RFC1691 - The Document Architecture for the Cornell Digital Library

Network Working Group W. Turner

Request for Comments: 1691 LTD

Category: Informational August 1994

The Document Architecture for the Cornell Digital Library

Status of this Memo

This memo provides information for the Internet community. This memo

does not specify an Internet standard of any kind. Distribution of

this memo is unlimited.

Abstract

This memo defines an architecture for the storage and retrieval of

the digital representations for books, journals, photographic images,

etc., which are collected in a large organized digital library.

Two unique features of this architecture are the ability to generate

reference documents and the ability to create multiple views of a

document.

IntrodUCtion

In 1989, Cornell University and Xerox Corporation, with support from

the Commission on Preservation and Access and later Sun Microsystems,

embarked on a collaborative project to study and to prototype the

application of digital technologies for the preservation of library

material. During this project, Xerox developed the College Library

Access and Storage System (CLASS), and Cornell developed software to

provide network access to the CLASS Digital Library.

Xerox and Cornell University Library staff worked closely together to

define requirements for storing both low- and high-resolution

versions of images, so that the low-resolution images could be used

for browsing over the network and the high-resolution images could be

used for printing. In addition, substantial work was done to define

documents with internal structures that could be navigated. Xerox

developed the software to create and store documents, while Cornell

developed complementary software to allow library users to browse the

documents and request printed copies over the network.

Cornell has defined a document architecture which builds on the

lessons learned in the CLASS project, and is maintaining digital

library materials in that form.

Document Architecture Overview

Just as a conventional library contains books rather than pages, so

the electronic library must contain documents rather than images.

During the scanning process, images are automatically linked into

documents by creating document structure files which order the image

files in the same way the binding of a book orders the pages. Thus,

the digital book as currently configured consists of two parts: a set

of individual pages stored as discrete bit map image files, and the

document structure files which "bind" the image files into a

document. In addition, a database entry is made for each digital

document which permits searching by author and title (i.e.,

bibliographic information). Beyond the order of the pages, the

arrangement of a physical book provides information to readers. The

title page and publication information come first; the table of

contents usually precedes the text; the text is divided into sections

or chapters; if there is an index, it follows the text. The reader

often refers to these components of a book when browsing the library

shelves, in order to determine whether to read the book.

The document structure provides direct access to the components of an

electronic document, storing the information that would otherwise be

lost when the book is disbound for scanning.

Document Architecture Requirements

Listed below are the requirements that were initially set down for

the Cornell Digital Library Architecture.

1. The architecture must be open (i.e., published and freely

available).

2. The architecture should be as simple as possible (to facilitate

product development).

3. The architecture should assume data storage in UNIX file systems.

4. The architecture should allow for standard data usage, such as via

FTP and Gopher servers (i.e., pages of a document must exist in a

single Directory, and the naming convention used must order them

in the standard collating sequence, such as the series "0001.TIF,

0002.TIF,..., 0411.TIF" (NOTE: a series such as "1.TIF, 2.TIF,...,

10.TIF" would be ordered "1.TIF, 10.TIF, 2.TIF, ..." which is not

acceptable).

5. The architecture should provide for storing the same information

in different formats. For example, when a page of a document is

available at several different resolutions.

6. Low-resolution "thumbnail" images of each page must be stored to

facilitate browsing and sharing of data.

7. The architecture must support distribution of files so that

similar files may be stored together, permitting optimization of

storage use and performance.

8. The architecture must support documents that are composed of

references to all or part of other documents.

9. The architecture must support document components which are

stored on separate servers distributed across the network.

10. The architecture must support not only an hierarchical structure

for each document, but the ability to define multiple views of

each document.

11. The architecture should accept, rather than dictate, directory

structures in which documents will be stored. This will permit

documents created in other ways to be added to the Digital

Library simply by adding database information rather than by

copying or moving files.

Document Architecture Description

A digital library consists of a Digital Library Server, networked

storage, and a referencing database. A single digital library will

contain one or more collections. Each collection will contain one or

more documents.

The referencing database allows searching for documents by author,

title, and document ID. In the current implementation, the

referencing database is a relational SQL database, and each

collection is epresented by a table in the database. It is planned

to migrate to Z39.50 database searching as the preferred method, as

this protocol has been established as the standard for library

applications.

Authorization will be primarily collection-based, although the design

will permit authorization checking at any level down to the

individual file. Notification would come only when the patron

attempted to open the document or access the particular component.

Each document consists of three components: the logical structure;

the physical references; and the data files.

The logical structure is a logical description of the document.

Conceptually, a document is a tree, with the leaves being the data

files (pages). At a minimum, all documents have a logical structure

which lists the pages in the document and the order in which they

appear. Usually, documents will have a more elaborate structure.

The logical structure relates the logical structure of a document to

the physical references which make up the document.

These physical references map the lowest levels of the document's

logical structure (the leaves of the tree) to the files that contain

the data. Where there are multiple representations of a page, such

as images at various resolutions, these are linked together in the

physical references file.

The data files contain the data making up a document. Any format can

be accommodated: image files, ASCII text, PostScript, etc. However,

one-to-one correspondence between data files for a given physical

reference is assumed. That is, if there are multiple file types for

a single page, these files should represent exactly the same

information.

Physical References File

The Physical References file is the component of the document which

relates logical structures (logical components of documents) to

physical files. Document references, by which a document can be

composed of all or part of other documents possibly residing on

different servers, are handled in the Physical References file.

A document may contain multiple document objects, each of which

contains one or more data objects. When a document contains actual

physical data (for example, it is created by scanning or importing

images), a Master Document Object is created. When a document

incorporates components of other documents, a Reference Document

Object is created for each of the other documents. The Document

Objects are numbered with internal reference numbers, which are

included in the corresponding Data Object lines.

Data Object lines include the Document Object number, the file

reference number, and the file type. The Document Object number

refers to a Document Object line, from which the library name,

collection name, and document ID can be retrieved. The tuple

is guaranteed to locate a file. Each Data Object line refers to a

single file; where multiple file types of a single document page

exist, there will be multiple Data Object lines for that page.

In the file, all Document Object lines will preceed all Data Object

lines for a given document. Document Object lines may be either

grouped together at the beginning of the file, or may immediately

preceed the first Data Object line for the Document Object. Document

Object lines will appear in order by Document Object number. Data

Object lines will appear in order by sequence number, NOT by Document

Object number.

The fields in the Physical References file are delimited by vertical

bars.

Document Object Lines

Field Description Comments

----- ---------------------- ----------------------------

1 Document Object number 0 => Master Document Object

1-9 => Reference Document Object

2 Library name Server name

3 Collection name

4 Document ID 8-digit number

5 Author name

6 Volume

7 Title

8 Edition

Data Object Lines

Field Description Comments

----- ---------------------- ----------------------------

1 Document Object number Corresponds to above

2 Sequence number

3 File reference Reference number used to locate

file in filing system

4 Physical reference number Equal to Logical Structure file

5 File type 1 = TIFF 600dpi

2 = TIFF thumbnail

3 = ASCII version of page

(i.e., OCR output)

4 = ASCII notes

5 = Other

6 = TIFF 300dpi

6 Note

Physical References File Example

+0CORNELLOLINLIB00000001Boole, Mary EverestPhilosophy Of Algebra

010000000251 (File ref. #2 = Phys. ref. #5 = 600dpi TIFF image)

020000000352 (File ref. #3 = Phys. ref. #5 = 100dpi TIFF image)

030000000461 (File ref. #4 = Phys. ref. #6 = 600dpi TIFF image)

040000000562 (File ref. #5 = Phys. ref. #6 = 100dpi TIFF image)

Note that in the above, it is guaranteed that file references 2 and 3

are two different versions of the same page, as are file references 4

and 5.

Logical Structure File

The Logical Structure file is the component of the document structure

which offers "views" of a document and links images together

logically to define documents. The file is actually an unloaded tree;

when a document is "opened", the file is read and the tree

reconstructed. By convention, all Logical Structure files contain one

logical structure "PAGES" which defines the document by listing the

pages in the order in which they appeared in the original document.

Document Structure lines

Field Description Comments

----- ---------------------- ----------------------------

1 Parent structure number Structure is a child of...

2 Sequence number

3 Logical Structure name Label for this structure

4 Structure number Equal to Physical Reference file

5 Logical Children # of logical children of this

structure

Document Structure lines (continued)

Field Description Comments

----- ---------------------- ----------------------------

6 Physical Children # of physical children of this

structure

7 References # of references to this

structure within this document

(for how many structures is this

a substructure)

Logical Structure File Example

00ROOT0400 Structure 0, ROOT, has 4 logical children

01PAGES110001 Str. 1, PAGES, has 100 logical children

02CONTENTS22201 Str. 2, CONTENTS, has 22 logical children

...has no physical children

...

11Production note5022 Str. 5 is child of structure 1

...has a label "Production note"

...has no logical children

...has 2 physical references

...is referenced twice in this document

126021 Str. 6 has no label

137021 Str. 7 has 2 physical references

148021 Str. 8 is referenced only here

159021 Str. 9 is 5th sequential child of PAGES

...

199103022

1100104022

21Production note105101 Str. 105 is a child of str. 2

22Title page106101 Str. 106 has 1 logical child

23Table of contents107201

24Chapter 1. From Arithmetic to Algebra108601

25Chapter 2. The Making of Algebras109401

26Chapter 3. Simultaneous Problems110401

27Chapter 4. Partial Solutions...111301

28Chapter 5. Mathematical Certainty...112301

29Chapter 6. The First Hebrew Algebra113801

210Chapter 7. How to Choose our Hypotheses114901

211Chapter 8. The Limits of the Teachers Function115501

212Chapter 9. The Use of Sewing Cards116401

...

220Chapter 17. From Bondage to Freedom124501

221Appendix125211

222advertisements126412

1051Production note5022 Str. 5 is a child of str. 105

1061Title page11022 2nd reference to str. 11

1071715022

1072816022

...

1264104022

Implementation Details

The tuple <library ID>+<collection ID>+<document ID>+<filetype>+

<file reference> is guaranteed to locate a file. A file locator

program will translate between this tuple and the fully-qualified

path and file name in the underlying file system. While a library

will always have a hierarchical nature corresponding to UNIX file

systems, the order of the hierarchy will be flexible to accommodate

optimization efforts. Each level of the hierarchy will have an INFO

file that describes the order of the lower levels of the hierarchy.

The file locator program will read these files as it navigates the

directory structure of the file system when a library, collection, or

document is opened. Two examples follow:

Example 1. Hierarchy is LIBRARY, COLLECTION, DOCUMENT, FILETYPE.

/<library name>

LIBINFO.TXT Description of library

/<collection name>

COLINFO.TXT Description of collection

/<document ID>

DOCINFO.TXT Description of document

LOGSTR.000 Logical structure file

PHYSREF.000 Physical reference file

/<filetype1>

00001.TIF

00002.TIF

...

/<filetype2>

00001.TIF

00002.TIF

...

Example 2. Hierarchy is LIBRARY, FILETYPE, COLLECTION, DOCUMENT.

/<library name>

LIBINFO.TXT Description of library

/<filetype1>

/<collection name>

COLINFO.TXT Description of collection

/<document ID>

DOCINFO.TXT Description of document

LOGSTR.000 Logical structure file

PHYSREF.000 Physical reference file

00001.TIF

00002.TIF

...

/<filetype2>

/<collection name>

COLINFO.TXT Description of collection

/<document ID>

DOCINFO.TXT Description of document

LOGSTR.000 Logical structure file

PHYSREF.000 Physical reference file

00001.TIF

00002.TIF

....

This implementation involves some redundancy, but it permits complete

copies of a collection to be mounted on different file systems for

performance considerations. In particular, the second scheme would

facilitate storing all low-resolution images on high-speed magnetic

disk for fast access, and all high-resolution images on slower, less

eXPensive storage. This will also facilitate authorizing access to

low-resolution images by other software systems (FTP, Gopher) while

restricting access to high-resolution images.

Security Considerations

Security issues are not discussed in this memo.

References

[1] Turner, W., "Cornell Digital Library Document Architecture,

Version 1.1 - 3/22/94", Library Technology Department, Cornell

University.

Author's Address

William Turner

Library Technology

502 Olin Library

Cornell University

Ithaca, NY 14853

Phone: 607-255-9098

Fax: 607-255-9346