This article was presented as a paper at the Research Libraries Group Forum, May 20-21, 1999, at Emory University.

The ATLAS Project of the Center for Electronic Texts in Religion:

50 Years of 50 Journals

James R. Adair
Director, ATLA Center for Electronic Texts in Religion

On January 1, 1999, the American Theological Library Association created the new Center for Electronic Texts in Religion (CETR), based in Atlanta. The purpose of CETR is to disseminate electronic texts of interest to scholars of religion, to promote the publication of original scholarly works in formats compatible with online study and distribution, to support other efforts to move the academic study of religion into the information age, and to remain on the forefront of advances in technology through a commitment to research and development. CETR will plan, evaluate, and direct a variety of electronic projects dealing with the academic study of religion. The first project sponsored by CETR will be ATLAS, the ATLA Serials project.

An Overview of ATLAS

ATLAS is designed to take 50 religion journals and 50 years' worth of volumes of each, for those that go back that far, digitize them, and make them accessible from the Web. In some cases, where a journal has been in existence for more than 50 years, ATLAS may work with the publisher to see about including the entire run of the journal. Having access to earlier scholarly research is important in a field like religion, where modern theories and approaches to the subject frequently build on, or react to, work that is decades old. We will market ATLAS journals both to institutions and to individual scholars.

Phase I: Encapsulated Images and Electronic Journals

Textual data can be digitized in two formats, encoded text and images, and ATLAS will digitize every journal in both formats. In Phase I of ATLAS, we will scan in all the pages of every journal as 600 dpi TIFF images. After they are converted to low resolution GIFs or JPEGs for display, the TIFF images will be archived (see below). When I was involved with the SELA journals project, a joint effort of Scholars Press and the Emory University Libraries, we experimented with different ways of displaying page images, and we settled on a method that involved associating each page image with an SGML "envelope" that contained a limited amount of information about the image (most importantly journal title and page number). We used the Ebind DTD developed at Berkeley to create valid SGML documents that allowed readers to look at the table of contents of the issue of a journal, select an article, then download the first page of the article and begin reading. Readers who already knew the specific page they were interested in could go directly to it (see http://shemesh.scholar.emory.edu/cgi-bin/Ebind2html/1/BA60.2 for an example).

In ATLAS Phase I we will use a similar, but improved, method of displaying the page images. First, we will use XML rather than SGML envelopes to surround the images. For the SELA project, we had to convert SGML pages to HTML on the fly. For ATLAS, we expect that XML browsers will be widely available by the time the first journals are available for beta testing (1 August 2000), so the problems inherent in converting to HTML should not impede the implementation of the project. Second, whereas the EBind DTD only allows the user to record page numbers and the author and title of each article, the enhanced DTD that we develop will allow us to record a sizeable amount of additional metadata, in a format compatible with USMARC, for each article, including topics, chronological and geographical information, and scripture references treated. Furthermore, we will use the ATLA Religion Database (RDB) as a model for creating the front-end of our ATLAS search engine. Since all ATLAS journals are indexed in the RDB, the metadata for each article already exists in electronic form. By associating this metadata with the page images, we will create a powerful search tool, even though only page images will be available at this stage. However, full-text searches will have to wait on the fully encoded texts.

The XML-encapsulated page images will be the first pages that we make available online, for two reasons. First, producing them is a relatively quick process, compared with fully encoded XML text, so we will be able to provide access to numerous journals fairly quickly. Second, even after the fully encoded XML texts are available, scholars will undoubtedly find errors in the encoded texts. Having access to the page images will allow them to determine whether the errors were originally present in print or whether they were introduced in the process of converting from print to electronic format. In the latter case, we will make the necessary corrections to the encoded text. In the former case, we will preserve both the original and the corrected forms of the text.

ATLA does not currently index electronic journals in the RDB, but it plans to begin indexing selected e-journals in early 2000. ATLAS Phase I will integrate the e-journals indexed by the RDB into its database, and since digitization is not an issue, the inclusion of e-journals should be relatively straightforward. Two issues will need to be addressed, however: archiving and varied HTML formatting. It is possible, of course, simply to archive the HTML format, but HTML does not allow the richness of markup possible in more sophisticated XML DTDs, so it is not an ideal archiving format. Furthermore, as XML browsers become widely accessible, many e-journals will begin to make the transition from HTML to XML in order to take advantage of its many powerful features, including increased metadata capability, enhanced encoding possibilities, improved display and linking mechanisms, and Unicode support. Migrating from HTML to XML is one step in the direction of determining an archiving format, but unless some consistency in encoding among various e-journals can be achieved, the e-journals included in ATLAS will not be as usable as the print journals. To address the problem of varied HTML formatting among e-journals, ATLAS staff will work with e-journal publishers and consortia like the Association of Peer-Reviewed Electronic Journals in Religion (http://purl.org/apejr) in order to develop an encoding scheme that will be viable for individual e-journals and for ATLAS itself.

Phase II: Fully Encoded Texts

Once the production and testing of the encapsulated image format is well under way (by the middle of 2000), the journals will be encoded in XML, probably in a DTD related to the SGML Text Encoding Initiative (TEI) DTD, though the specific DTD has yet to be determined. The fully encoded text will of course contain the same metadata as the encapsulated page images, so searching over a variety of fields will be possible, but full-text searching (including sophisticated Boolean searches, proximity searches, and searches based on the XML encoding) will be an added bonus. A moderate amount of tagging will be employed in ATLAS Phase II, with emphasis on those items that users are most likely to search for, including bibliographical information, scripture references, and citations of other works. The ATLAS search engine will be the most powerful and useful tool of this sort available to religion scholars, allowing them to search the collection for individual words (or parts of words), subjects, or scripture references, in many different combinations.

ATLAS as a Preservation Project

The idea for ATLAS developed out of earlier discussions about creating a digital archive for paper journals, and though our plan for the project has moved well beyond those earlier musings, the importance of having a digital archive has always remained one of the core components of the project. While ATLAS was being planned, ATLA was also engaged in a pilot project involving the digitization of one volume each of five different journals, and our experiments with these journals led us to choose to digitize pages as 600 dpi TIFF images. At this resolution, even the smallest footnotes in older journals will be readable. This resolution is also sufficient to record illustrations, line drawings, and even photographic reproductions that are present in the journals we plan to digitize. Furthermore, 600 dpi TIFF images for printed pages meet or surpass current recommendations of the Library of Congress (see http://lcweb.loc.gov/webstyle/fileform.html).

We will store the 600 dpi images of the journal pages in a database, thus creating a digital archive (the ATLAS archive will also include both the encapsulated images and the fully encoded texts). A debate is currently raging in the archiving community over the question of whether a digital archive is really an archive at all, since most digital storage media degrade rather rapidly over time. Perhaps more significantly, technology is advancing so quickly that even if twenty-year-old media are still intact, they often cannot be read, because (1) no machines that are capable of reading the media still work and (2) software doesn't exist to convert from the older format to formats that are used today. The solution, many archivists say, is to continue using microfilm as the archival medium of choice. ATLA has been archiving journals on microfilm for fifty years and recognizes the advantages that microfilm continues to offer even now at the beginning of the digital age, so we plan to seek additional funding to preserve ATLAS journals on microfilm. However, we are also convinced of the viability of digital archives, and we consider the images and encoded text stored on electronic media to be just as important to archive as the microfilm. Unlike a microfilm archive, in which the camera master is stored in a vault, where it might remain for up to 500 years, a digital archive must be refreshed periodically to ensure that the media on which the images are stored are still valid, and the formats must be updated as new formatting standards are developed. A digital archive is thus somewhat more difficult to maintain, but it has a number of advantages as well: ability to produce as many perfect digital copies as needed without degradation of the original, much quicker copying time, and much lower storage costs. The ATLAS digital archive will employ an integrity checking scheme based on checksums to ensure that the archive remains viable over time. Copies of the ATLAS archive will be maintained locally, in mirror locations, and in other electronic archives.

Standards Employed in ATLAS

The bane of many an electronic project is proprietary software that only runs on certain platforms or data structures that can only be read by proprietary programs. The ATLAS designers acknowledge the importance of using existing and developing standards in the project. XML was chosen as the primary encoding scheme for the ATLAS project because it is an international standard recognized by the ISO. In addition to XML, other standards that will be used in the ATLAS project include USMARC (a library cataloging standard), the Dublin Core (a content description model for electronic resources), and RDF (a standard framework for processing metadata). In addition to these widely recognized standards, ATLAS designers will work with the Electronic Standards for Biblical Language Texts seminar of the Society of Biblical Literature, which is working on a variety of standards of interest to scholars of religion.

Problems to Overcome

The biggest technical question in terms of display will revolve around the issue of the proper display of languages like Hebrew and Arabic, which are written right-to-left . If the XML browsers that become available fully conform to the XML standard, they will be fully Unicode compliant as well. On one level, Unicode is a character encoding scheme that uses two bytes (sixteen bits) to represent each distinct character, unlike ASCII, which uses one byte (eight bits). Whereas only 256 characters can be represented in ASCII, 65,536 can be represented in Unicode. Intermingling English text with Greek and Hebrew, for example, in ASCII requires the use of multiple fonts, since more than 256 characters are required to display all the letters, numerals, diacritical marks, punctuation, and special characters in these three languages. In Unicode, however, each distinct script has its own block of code points. So, for example, Western European languages can be represented by the 256 standard ASCII and Extended ASCII (also called Latin 1) characters, and Greek has its own block of characters, as do Hebrew (also used for Aramaic and Yiddish), Arabic, and Hindi (the special considerations for dealing with Chinese, Japanese, and Korean are not considered here--see the Unicode Standard, version 2.0). Reserving a block of characters for each script is not all Unicode does. It also defines which direction scripts run (directionality, e.g., left-to-right, right-to-left) and how characters should be displayed when surrounded by certain other characters (contextual characters, e.g., a final sigma in Greek or a medial nun in Syriac). Fully Unicode compliant XML browsers will solve these display problems that have haunted HTML for years (for a fuller discussion of the problems and various solutions to displaying multilingual documents on the Web, see my article "TC: A Journal of Biblical Textual Criticism: A Modern Experiment in Studying the Ancients," Journal of Electronic Publishing 3 [1997]; URL: http://www.press.umich.edu:80/jep/03-01/TC.html).

One of the challenges that faces any encoding project is integrating the various documents that are part of the project with one another and, equally importantly, with documents external to the project, some of which might not have been digitized, or even have been created, yet. One of the problems we will address in the ATLAS project is the issue of linking documents internal to ATLAS with those that either already or may someday exist elsewhere in electronic format. What we propose to do is use standard reference schemes such as ISBN and ISSN to identify documents uniquely. These schemes will have to be extended to allow us to refer to subparts (e.g., articles within a specific volume of a journal). Furthermore, since many of the works that are of interest to religion scholars were created before the advent of ISBN and ISSN, I am proposing a supplemental reference scheme, tentatively called the International Standard Catalog Number (ISCN), that can be used to identify older works. ISCN numbers would be based on the models of ISBN and IP numbers and would be assigned by organizations such as the ALA and the ATLA. The various standards used to identify written, graphic, and other works will be built into a permanent Uniform Resource Name scheme that will allow them to be referenced from ATLAS documents, whether or not they already exist online.

The reference schemes discussed to this point are designed to identify a unique state of a document, such as the second edition of a particular systematic theology textbook. However, many of the references in documents related to religion are canonical rather than specific references. That is, they refer to a particular passage within a body of literature but not to a specific translation or edition of that passage. For example, a religion article is more likely to refer to Genesis 1:1, a canonical reference, than to the New Revised Standard Bible's English translation of Genesis 1:1. Canonical references are not limited to the scripture of various religious traditions but apply as well to classical texts (e.g., Xenophon, Anabasis 4.2.7-12). In conjunction with the Electronic Standards for Biblical Language Texts seminar, ATLAS personnel will develop a scheme for referencing canonical passages. Users who click on a link to a canonical reference will be allowed to choose which specific instance of currently available versions of the canonical reference they would like to view. Preliminary work in this area is already being undertaken with regard to the biblical text, thanks to a grant from the Society of Biblical Literature.

The Future of ATLAS

The ATLAS project is a three-year project slated to digitize approximately 50 journals. The ATLA digitizes over 600 journals in its Religion Database and adds additional titles every year (including electronic journals beginning in 2000), so there is plenty of relevant material to digitize. We have designed ATLAS to be self-sustaining by the end of the third year of the project, and we project sufficient revenues to allow us to digitize an additional 20 to 30 journals per year. However, in order to make more material available in a timely manner, we will seek additional funding to enlarge the ATLAS archive more quickly.

Prospective CETR Projects

As noted earlier, ATLAS is only the first of many projects that will be managed by the ATLA Center for Electronic Texts in Religion. Other projects that we are interested in working on include the digitization of manuscript facsimiles and early critical editions of the biblical text, the online library catalog as a research tool, the digitization of denominational newspapers and historical documents, and the establishment of a clearing house for electronic projects of interest to religion scholars. The ATLAS online religion journal collection is a project that is being created for religion scholars by religion scholars. We hope that it will only be the first of many CETR undertakings that will be of value to scholars and students of religion.