Advanced Technology and the History of the Book


Daniel V. Pitti
Project Director
Institute for Advanced Technology in the Humanities
University of Virginia


I. Introduction

 

One of the most often stated and least justified claims of apologists for the "digital revolution" has been that the Internet has succeeded in replacing an obsolete print culture. Such claims have met with justified skepticism among students of the history of the book, but, unfortunately, justifiable skepticism has all too frequently given way to a less discriminating hostility. Despite exaggerated claims and special pleading from those who hype the Internet, scholars and students of the history of the book should recognize that advanced network and computing technologies have the potential to significantly advance their studies by facilitating collaboration, improving access to essential resources, and providing new methods of publication. While the technology should be approached cautiously, much of what is currently available is sufficiently mature, stable, and broadly supported to merit its use by archives, librarians, and scholars.

For disciplines, such as the history of the book, that are inherently international, network and computing technology can not only facilitate communication among scholars distributed around the world, but also provide universal, union access to distributed resources essential to the study of the book. In some cases, selective access to digital representations of primary source materials can‹for some but by no means all research purposes‹function as adequate surrogates for the originals themselves. For other research objectives, they can facilitate analysis and research that would be difficult, or perhaps practically impossible, when using the original materials. The purpose of this paper is to lay before you the progress along these lines made by a number of intellectual disciplines closely related to the history of the book. Although I do not want to positively assert that any of what follows is a road map for what ought to be done by the "community" of book historians, I hope that you will see that these approaches to advanced technology offer rigorous and potentially viable means of pursuing fruitful computer assisted research in the history of the book.

 

II. Technology: Evaluation

Many librarians, archivists, curators, book historians, and other book professionals have been reluctant to embrace advanced technology. In part this is a response to the exaggerated claims and naïve predictions of technology enthusiasts, and in part it is an understandable and prudent response to emerging and unproven technologies.

For well over thirty years, "visionaries" have been predicting the death of cataloging, books, libraries, and publishers, and in some Post-Modern inspirations, authors and readers as well. It is easy to dismiss the enthusiasts, if not ignore them. Many of them, based on their affiliations, are obviously inspired more by social and commercial self-interest than they are by the technology. Many of their predictions have simply failed to materialize, or attempts have resulted in humiliating disasters. They frequently are forced to humbly retreat when reality turns out to be more complex than they initially thought, or ask, yet again, for a lot more money and a little more time. The complex nature and role of the book and the institutions supporting and depending on it have proven not to be easily reducible.

In spite of the enthusiasts among them, many thoughtful technologists have learned from their technical (and political) failures, and have applied the lessons to developing technologies that enable users to define and solve their own problems in ways that are appropriate and responsible. In this regard, they have made particularly important advances in developing publicly owned standards that ameliorate the dependency of data on proprietary hardware and software (and technologists). These standards enable users to take control over their data, representing and exploiting it to serve their own interests and objectives. Rather than technologists determining the future, inventing and imposing on others solutions to problems imagined by them, users now increasingly have it within their power to master the technology and employ it to serve their own objectives. It is now the responsibility of users to recognize and take advantage of the emerging and existing opportunities.

The experience of librarians and technologists in applying technology to cataloging provides a useful example from which we can learn both good and bad methodologies. Early collaborations between technologists and librarians were fraught with misunderstanding and miscommunication. The technologists initially woefully underestimated the complexity of books, publishing, and cataloging. Many naïvely viewed libraries as large warehouses, and catalogs as inventories of the items stored in them. Completely overlooked or ignored was that most warehouses contain a large number of a small variety of items, while libraries contain a large number of mostly unique items.1 Further, the unique items are frequently "ill behaved," defying easy categorization and description, and exist in extremely complex interrelations with one another. All of this and more make even the most basic catalog far more complex than a simple inventory. Catalogers were also naïve about the technology. They frequently had little or no experience with it, and tended either to accept the view of the technologists without question, or to reject the technology out of hand. There was little or no mutual understanding or shared terminology.

Despite all of this, the technologists and catalogers managed to invent something extremely useful, machine-readable cataloging, or MARC, as we have come to know it. But they also made some serious mistakes, some of which led to damage that was expensive and, in some cases, impossible to repair. A noteworthy and instructive example is in the area of authority control and key word access.

When keyword searching became possible, many technologists predicted that it rendered authority control obsolete. Based on this prediction, many library administrators instructed their catalogers to stop doing authority control. While the librarians and the users of catalogs quickly determined that keyword searching was a powerful and useful tool, enabling retrieval impossible with printed cards, they also determined that it was no substitute for experienced professionals making difficult judgments and distinctions, and recording them in machine-readable and therefore exploitable form. For example, computers are still incapable of recognizing that all of the following names refer to the Muslim philosopher al-Ghazzali (1058-1111): Ghazzali, Gazali, Abu Hamid Muhammad ibn Muhammad ibn Ahmad al-Ghazzali, Ghasali, Algazali, Algazel, Ghazali, Al-Ghazali, Houjjatoul-Islam, Mohamed Mohamed Toussi, Mohamed Mohamed al-Ghazali, and Ebu Hâmid Muhammed el-Gazâlî. Or that the following all refer to William Shakespeare (1564-1616): William Shakespeare, William Shakspeare, Uiliam Sek`spiri, Gouilliam Saixper, William Shakspere, Wilyam Shikisbir, Wiliam Szekspir, Sekspyras, Vil'iam Shekspir, Viljem Sekspir, Tsikinya-chaka, Sha-shih-pi-ya, Shashibiya, Vilyam Shekspir, Vilyam Shakspir, Syeiksup`io, William Szekspir, Guglielmo Shakespeare, William Shake-speare, Sha-o, Sekspir, Uiliam Shekspir, Vilijam Sekspir, V. Shekspir, and U. Shekspir.2

On the other hand, the computer can do some wonderful and some not too wonderful things with these distinctions, once they are recorded in computer-readable form. Using keyword access, computers can easily direct users from unused to used headings. Discovery that might take a great deal of time and persistence in a card catalog, and might not be possible at all, is made efficient. But these same distinctions, if not carefully used, can also lead to embarrassing and perhaps disastrous consequences.

One particular mishap in computer assisted authority control has become notorious. In the 1980s, OCLC, the large, international bibliographic utility located in Dublin, Ohio, decided that it needed to "clean up its catalog" after many of its clients complained repeatedly that it had a serious authority control problem. The programmers at OCLC decided to write a program that would match headings used in catalog records against unused variants found in Library of Congress authority records, and where they found matches, substitute the heading in the authority record for the heading in the bibliographic record. On the surface, this seemed like a perfectly reasonable thing to do. Unfortunately, it had many unforeseen consequences. One has become well known: the program changed all of the headings for Madonna, the popular singer, into "Mary, Blessed Virgin, Saint." Many librarians specializing in authority control took this as clear evidence that computers could never do authority control, as any reasonably informed human being would not make such a stupid mistake.
This evaluation, however, was not entirely sound. Many of the librarians saw only the "collateral damage," and failed to recognize that a good portion of the program worked quite well, and accomplished the desired goal. The programmers involved, having experienced success tempered by embarrassing humiliation, analyzed their failures, and began to approach the problem of identifying when the same name did and did not apply to the same entity more carefully. They improved the algorithms to recognize when a match was "safe," for example a personal name qualified by one or more life dates, and when it was not, for example, a personal name with only two components and without qualification. A careful, fair analysis of the results demonstrated that they could do a lot programmatically, but that there would always be a remainder that only a trained, intelligent, professional person could sort out. They switch from trying to make the computer do everything, to trying to do as much as could be done accurately and reliably while leaving the remainder to the professionals with suggestions and information to help them in their problem solving. Technology, carefully and deliberatively applied, could perform many of the most routine and tedious chores, while isolating the most challenging tasks for librarians. The result is that catalogers now have more time to spend on the problems and challenges that resist reduction to computer algorithms.
The development and application of MARC has taught technologists and librarians a great deal. Much though not all of the early overconfidence in technology, on one hand, and overly skeptical assessment of it, on the other hand, have been displaced. In their place are clear, realistic collaborative assessments of what technology can and cannot do, applications that exploit computation while respecting, facilitating, and exploiting the irreducible contributions of professional catalogers, and more careful, reversible experiments that test and extend the current technological limits.

MARC has been unquestionably successful. The emergence of online, networked catalogs has realized for the first time in history the long held dream of the "universal catalog." Since the late 1980s, there have been major advancements in computer and network technologies and our power to represent and exploit information to serve a wide variety of goals and interests. As we encounter these emerging technologies, we need to do so with a methodology informed by our past experience.

Historians of the book can draw on the experience of the librarians in adapting technology to serve their professional objectives. Historians need to acquaint themselves with the technology, to understand and evaluate what it can and cannot do, and to determine appropriate and responsible uses of it. They also need to collaborate and work with technologists. This collaboration necessarily requires assisting technologist in understanding and respecting the complexities of their discipline. Shared terminology needs to be negotiated. Technologists alone should not determine the applications. The historians should proceed deliberatively, based on informed and careful consideration of what is and is not possible now, and what is and is not likely to be possible in the near future. They should welcome what is clearly useful, adapting it to serve their professional goals and interests, and defer use of technologies whose stability, utility, and independence are not demonstrated.3

 

III. Technology: Overview

Since the early development and use of MARC, there have been major technological developments that are having an extensive, pervasive impact on all of the major institutions of modern society and culture. The Internet lies at the center of these developments. It interconnects and thus makes other technologies far more powerful than they would be in isolation. At the same time and as a direct result of its empowering of other technologies, it constitutes the most important economic force driving much of the investment and development of computer and communication technology. Database technologies have matured significantly, enabling complex, large-scale representation and manipulation of certain classes of information. Since the early 1980s, the emergence and development of markup technologies has for the first time made it possible to accurately represent texts of arbitrary length and complexity based on user supplied specifications. A variety of technologies for the digitization of existing media and original digital creation of analogues of existing media has also emerged. Finally, the emergence, of affordable, increasingly powerful personal computers and increasingly accessible software has put all of these technologies into almost all major institutions, and into the homes of many private citizens as well.

Internet
The Internet stands out as the most transforming of the computing related technologies to emerge in the last twenty years because it interconnects computers and the people using them wherever they are in the world, and any time of the day or night. While most of us first became aware of the Internet in the early1990s, the research and early prototypes that led to it began in the 1960s. As the network began to be realized in the 1970s and 1980s, its potential for facilitating communication between computers and through them people became increasingly clear. New standards and software emerged to take advantage of the potential. Telnet emerged as a way for a user on one computer to connect to another computer. Researchers developed File Transfer Protocol (FTP) to enable moving files back and forth between computers. Other researchers developed Electronic mail (Email) to enable sending and receiving messages. Related to FTP, researchers developed client-server technology, to support distributed processing of information. In conjunction with Email, scientists developed listserv technology to support group discussion and communication. Still later, researchers invented Hypertext Markup Language (HTML) to enable online viewing of texts with "links" that led to yet other texts. To take full advantage of hypertext, researchers developed browsers that enabled not only interrelating texts, but also interrelating texts and other digital media. The emergence of browsers and hypermedia coupled with the Internet in the early 1990s spawned the current Internet "phenomenon."

The Internet has had an effect on most if not all of us, professionally or personally, or both. We are more than likely to use email on a daily bases to communicate with individuals within our own institutions, and with colleagues in others. We are also likely to subscribe to one or more listservs that focus on one or more of our professional responsibilities or intellectual interests. We can easily send messages and files. We are also likely to use browsers and various indexing utilities such as Google to find, retrieve, and read or print articles and essays of interest to us, or to access the Oxford English Dictionary, the Encyclopedia Britannica, or the bibliographic catalogs in varies research libraries. While the technology is far from flawless‹proprietary files have been a persistent problem‹ it has progressed significantly and continuously over its short history, to the point where many of us cannot remember at time when it was not there. We mostly do not realize how much we have come to depend on it, except in moments when we do not have access when we expect to. For many scholars, especially humanists and social scientists, who had no access to the early networks and frequently had little or no opportunity to collaborate and communicate easily and regularly, it has improved immeasurably our communication with colleagues, and our access to information.

Database Technology

Database technology is designed to store, manipulate, and access large volumes of highly regularized data. Modern database technology began in the early 1960s with efforts to develop techniques to conceptualize, structure, and manipulate data independent of the specific hardware used. The most prevalent types of database are hierarchical, network, relational, and object oriented, with relational databases being the most prevalent. While object oriented databases have not been widely implemented, the technology has contributed conceptual and functional models that have influenced the most recent relational database implementations. Relational databases with functionality inspired by object orient databases are called object-relational databases. Affordable yet sophisticated relational database software frequently comes packaged with personal computers.

The widespread availability of database technology enables individual scholars to compile and manipulate large amounts of data to support research. In addition to individual projects, the technology supports large collaborative projects, enabling scholars and researchers to cooperatively build shared, sets of various kinds of data: bibliographic, census and demographic, statistical, genealogy, and others. In addition, database technology provides the data infrastructure for sophisticated Geographic Information Systems, and computer graphics. Database technology also provides the infrastructure for MARC-based online catalogs and maintenance systems, as well as access, description, and control systems in archives and museums.

Database technology is most useful for representing and exploiting specific kinds of data.4 The kind of information found in forms and questionnaires fits well in databases. For example, personnel records and job application are perfect candidates, as are records describing publishers and book traders.5 In general, database suitable documents have the following characteristics: each document has the same set of data elements; the order of the data elements in any given record is not important; and the data elements in any record have few or no hierarchical relations with one another. Documents that not are not generally suitable for databases have the following characteristics: the documents differ from one another in the number, kinds, or sequence of components; the order of the document components is important; and components have many, frequently unbounded hierarchical relations with one another. Texts, such as those found in books and journals belong to this type, and markup technology rather than database technology has emerged as the optimum way to represent and exploit them.

Markup Technologies

All information in computers is encoded to facilitate processing it. In the early history of computing, the codes associated with textual information were typically procedural codes. Procedural codes specify certain operations or procedures that are to be applied to the information. Word processing programs, the most common text application, associate codes with text to facilitate printing it. The various codes represent different styles that are to be applied to information. For example, the title of an article might have codes that will facilitate centering it on the page, and printing it in a large, bold font. Most procedural encoding is proprietary and devoted to one output, the most common of which is print.

In the late 1970s and early 1980s, an alternative to procedural encoding emerged. Instead of embedding procedural codes in texts, descriptive or declarative codes were embedded. Descriptive encoding of text specifies what the text and its components are rather than the procedures to be applied to it. Descriptive encoding is a process of naming, as opposed to procedural encoding, which is a process of associating verbs or actions with text and text components. The descriptive or declarative approach has the major advantage of supporting multiple procedures, even procedures not anticipated at the time of encoding. For example, declarative markup might state that a given string of text is a title. This string can then be printed using one set of procedures, displayed on a computer screen using another, and indexed using yet another.

Standard Generalized Markup Language (SGML), first codified in 1986 by the International Standards Organization, is a descriptive method of representing or encoding textual information in computers. While SGML is both standard and generalized, it does not provide an off the shelf markup language that one can simply take home and apply to a letter, novel, article, catalog record, or finding aid. Instead it is a markup language meta-standard, or in simpler words, a standard for constructing markup languages. SGML provides conventions for naming the logical components of documents, and a syntax and meta-language for defining and expressing the logical structure and relations among the components. SGML is a set of formal rules for defining specific markup languages for individual kinds of documents. Using these formal rules, members of a community sharing a particular type of document can work together to create a markup language specific to their shared document type.

The specific markup languages expressing these analytic models and written in compliance with formal SGML requirements are called Document Type Definitions, or DTDs. For example, the Association of American Publishers has developed four DTDs: for books, journals, journal articles, and mathematical formulae. After thorough revision, this standard has been released as an ANSI/NISO/ISO standard, 12083.6 A consortium of software developers and producers has developed a DTD for computer manuals called DocBook. The Text Encoding Initiative (TEI) has developed a complex suite of DTDs for the representation of literary and linguistic materials.7 Archivists have developed a DTD for archival description or finding aids called Encoded Archival Description (EAD).8 There are even several DTDs for representing various varieties of MARC. A large number of government, education and research, business, industry, and other institutions and professions are currently developing DTDs for shared document types.9 DTDs shared and followed by a community can themselves be standards. ANSI/NISO/ISO 12083, DocBook, TEI, and EAD are all standard DTDs.

HyperText Markup Language (HTML) is an SGML DTD that has enjoyed enormous success as the encoding standard underpinning the World Wide Web. As a specific application of SGML, the HTML DTD limits itself to simple procedural encoding dedicated to online display and hypermedia linking. Constraining the set of tags has made it easy to build applications that make life relatively easy for authors and Web publishers. The ease of use has been a major factor in the Web's remarkable success.

The developers of HTML, the World Wide Web Consortium (W3C), recognized that HTML, as useful and popular as it has been, would not support complex, community-based use of shared information on the Internet. Because HTML implements a small, closed set of procedurally oriented tags, it is incapable of supporting sophisticated searching, navigation, display, and communication. Evidence of HTML's limited ability to support intelligent searching and document discovery, let alone complex display, navigation and other processing, is not difficult to find. Many of us have used Web search engines to look for both known items and items on a particular topic. More often than not, we are overwhelmed by voluminous results, with many and perhaps most of them being irrelevant. Our patience frequently is exhausted looking for an item or two that satisfies our need. The small, closed tag set has thus come at a price: HTML has extremely limited functionality.

The W3C recognized in SGML's declarative approach and extensibility the means to overcome the limits of HTML, but they also noted that SGML presents its own set of problems. It is very complex for software developers, and as a result, software products for exploiting the richness of the descriptive encoding have been limited in number and almost always expensive. In 1996, the World Wide Web Consortium (W3C) founded the EXtensible Markup Language (XML) Working Group to address this problem.10 The Working Group, in a short period of time, wrote a specification for a simplified subset of SGML named XML. They simply eliminated the features of SGML that were problematic for programmers. XML is simplified or normalized SGML.

XML encoding of text provides a means of representing textual semantics and structure, but it does not in itself provide support for the procedures that are likely to be applied to texts. Presenting or displaying text on a computer screen and printing it on paper are two obvious procedures. The W3C has developed EXtensible Stylesheet Language for standardizing both of these procedures, as well as other transformations of text. XML Linking Language (XLink) is standardizing hypertext and hypermedia behavior. In addition to supporting the kinds of links familiar currently on the Web, XLink will enable linking not only to other documents and digital media, but also into them, even when the (author of the) referencing document does not control the referenced document. XLink will also support annotation of texts and objects, again regardless of owner. XML Query (XQuery) will support the standardization of searching texts (and databases as well).11 Together with XML, these standards and other related, supporting standards represent a relatively complete, standard approach to textual information.

While the origins of SGML lie in the processing of texts, XML is also being used as the basis for encoding and communicating many kinds of data. A large number of the current XML initiatives involve data that is created and maintained in databases, but is communicated among databases and published on the Internet using XML. Many of these initiatives involve commercial databases and business transactions. Still others involve what has come to be called, for better or worse, metadata.12 Noteworthy also is Scalable Vector Graphics (SVG), an emerging standard developed by the W3C "for describing two-dimensional graphics in XML" A companion standard, also under development by the W3C, is Synchronized Multimedia Integration Language (SMIL), which supports, as the name suggests, integrated presentation of multiple media.13 SVG and SMIL appear likely to have an impact in the presentation of geographic information, though yet another effort is devoted to creating XML-based representation of the geographic information itself. This effort, the Geography Markup Language (GML), is organized and led by the Open GIS Consortium.14 The Web3D Consortium is developing an XML-based standard for three-dimensional graphics, Extensible 3-D (X3D). It is based on the SGML-based ISO standard Virtual Reality Markup Language (VRML).15 X3D will provide a standard encoding of data supporting an extremely wide variety of three-dimensional objects. Examples are rooms, buildings, automobiles, and, of course, books. Like SVG, X3D also promises to provide support for the representation of geographic topographical information, as well as Computer Aided Design (CAD), used extensively in architectural and engineering design. All of these efforts are beginning what is likely to be an extended period of standardizing machine-readable data for various media. Some, such as XML itself, are well along in development, with an increasingly wide range of products of increasing quality available. Others are in various stages, from just underway to nearing approval by the W3C and other authoritative bodies.

Both markup technologies, as represented in XML and related standards, and database technologies are particularly significant because they enable users to represent semantically and structurally rich intellectual understandings of text and other data in machine-readable form that can be exploited using a wide array of existing procedures as well as procedures yet to be devised. The wide array of current XML and database initiatives demonstrates the importance of this descriptive and representational power. In the humanities research community, semantically rich machine-readable expressions have already significantly enhanced intellectual access to cultural objects, through MARC and other cataloging standards, and are beginning to enhance analysis and interpretation of them as well.

Markup and database technologies are also significant because they ameliorate the dependency of data on hardware and software. Migrating information out of an obsolete standard into a new, better standard is an important procedure that will sooner or later become necessary. This is typically overlooked in the conception and design of projects and programs. Fortunately, standard, descriptive encodings inherently provide more support for this procedure than do proprietary, procedural encodings. When you know what the information and its parts are, migrating from one standard to another is a matter of semantic mapping rather than procedural mapping. XML even has its own standard for accomplishing this transformation, XSL.

While markup and databases technologies enable semantic encodings, they do not in themselves define the semantics. Specific cultural heritage disciplines and communities sharing intellectual and professional objectives must be responsible for the analysis, specification, expression, and application of the semantics. Libraries have accomplished the most in this area, though archives and museums are also actively engaged in standards development. Primarily through the Text Encoding Initiative, humanists have also begun to define standards, though a great deal of work remains to be done, and much remains to be done in collaboration between these communities. Developing shared semantics and structures represents the most important challenge facing the cultural heritage disciplines and communities in the near future.

Digitization Technologies

Pictorial, audio, and audio-video media are also increasingly standardized. For pictorial material there is one major de facto industry standard: Tagged Interchange File Format (TIFF).16 While we should prefer open, public standards to de facto industry standards, when a proprietary encoding achieves widespread industry support, its fate is no longer tied to one company, and one or two computer programs, and thus the dependency and risk involved in its use is ameliorated. TIFF as well as de facto standards such as PostScript and Portable Document Format (PDF) fall into this category. Because an image stored in TIFF preserves all of the information captured, it is generally considered acceptable as an archival encoding. TIFF file sizes, though, tend to be quite large, making them unsuitable for Internet transmission given current bandwidths. Thus smaller files using techniques for compressing the files are generally used for Internet publishing. The most popular of these compression formats is the ISO JPEG standard. The current version of JPEG is not appropriate for use in archiving and preservation because the techniques it employs lead to information loss (lossy compression). Currently ISO is developing a successor to JPEG called JPEG2000 that will support compression without loss of information (lossless compression), and thus will be suitable as an archival format.17

ISO also has developed standard encodings for audio and audio-visual materials, though these are primarily compression standards and thus are designed for audio and audio-visual files used on the Internet. Audio (and thus by implication also audio-visual) capture that is sufficient for long-term archiving and preservation is problematic because sound is continuous, and digital capture, by definition, is discontinuous. Analog is converted to digital through sampling, and until sampling rates achieve an acceptable threshold there is significant loss of information. Nevertheless, research and development continues in this area with the expectation that capturing audio data in digital form using high sampling rates will make digital preservation of audio and audio-visual material feasible in the near future. Despite these limitations, used cautiously, many of these de facto and public standards are sufficient for many purposes.

Taken together these technologies and standards offer book professionals many possible opportunities to apply advanced technology in their research and teaching. Many of the technologies are well established and proven, and based on solid, open, public standards. Others are also well established and understood, but not yet based on standards. Nevertheless, serious standards efforts are underway for many of these as well. Certainly not all problems have been solved, and there are risks, especially for naïve users. Many of the conveniences that are introduced by the advanced technology also introduce uncertainties that threaten to upset the balance of control and interests upon which the current "order of the books" rests.18 With due thought and caution, though, much of the technology is sufficiently mature, stable, and broadly supported to merit its use.

 

IV. History of the Book: Collaboration and Community

One of the great opportunities presented by advanced technology is that of facilitating collaboration. The sciences and the social sciences, even before the advent of the Internet and related technologies, frequently employed collaborative projects to achieve shared research objectives. The scale and complexity of many projects motivated these collaborations. Complex, labor-intensive projects required group effort to be successful. Without collaborating, the research was simply impossible. Collaboration also had a major secondary benefit as well. Designing and carrying out complex projects required intensive exchange of ideas and negotiation that led to many intellectual advances and breakthroughs. Many of these would have been difficult or more slowly realized, if at all, by scholars working alone.

Outside of a few dictionary and encyclopedia projects, humanists rarely have collaborated in this manner. The ongoing building of bibliographic catalogs stands out as a major exception. Most often humanists toil in solitude, communicating now and again with one or two trusted colleagues. Until now, they have lacked by and large both the means and the motivation to engage in collaborative research activities.

Technology provides both the opportunity and the motivation for historians of the book to engage in large-scale, complex projects. The history of the book is international in nature. Historians are distributed throughout the world, as are the tools and resources employed in their research. The book trade itself is international, with many of the significant figures and firms operating across borders. The international nature of the book trade accounts for some of the distribution of resources, though collectors have also contributed. The distribution of people and resources has been a major, time-consuming, and frequently prohibitive obstacle to both scholarly communication and research. Advanced technology provides the means to overcome this obstacle, though only if the historians of the book collaborate with one another, and with librarians, archivists, publishers, and others, in complex and intellectually intensive collaboration.

Existing and emerging technologies present several opportunities to historians of the book. It makes possible providing universal, union intellectual access to resources in the form of specialized bibliographic catalogs and archival description systems. It also makes it possible to provide selective access to digital representations of bibliographic and archival resources that that can function as adequate surrogates for the original for some, though by no means all, research purposes. In some cases, these digital representations may also facilitate analysis and research that would be difficult, or perhaps practically impossible, when using original materials. The technology also enables building analytic and pedagogical tools that can be shared. Finally, it offers the opportunity to create new forms of publication and pedagogy employing these resources.

As the foregoing has illustrated, achieving such things in a manner that will assure their usefulness over time requires the disciplined efforts of a community. An essential factor in establishing a collaborative community or consortium is having one or two lead institutions that are willing to provide hardware, software, and technical expertise to host, maintain, and publish resources, and to facilitate communication among participants. While some of these functions can be distributed, such as distributing responsibility for communication to one institution, and resource maintenance and delivery to another, distributing resources is problematic technically, especially if the resources constitute a wide variety, and at the same time have many interrelating links. For now the existing technology does not easily support such distribution. There is a major economic advantage to centralizing some of the more complex operations, as it relieves participating individuals and institutions of having to invest time and money in mastering complex supporting technology. Distribution of creation and maintenance activities, however, is absolutely essential, as the expertise needed to gather resources and the resources themselves are distributed.

The major challenges in building a history of the book consortium on the Internet are not intellectual and technical, as difficult as these are, but political. Politics, in the more attractive sense of the term, is community building. Communities first and foremost must articulate and share common interests and goals. Agreement can be difficult to achieve, as it requires negotiation and sacrifice. Individuals will only voluntarily participate in a community if it enables them to more effectively pursue individual interests, and to achieve goals that are difficult and perhaps impossible to achieve working alone.

 

V. History of the Book: Project Suggestions

The necessary first step in building a consortium is developing a shared vision of objectives. In what follows, I will suggest some possible objectives. Coming from an outsider with no detailed knowledge of the nature and methods of the discipline, they are all offered merely as suggestions, and not as recommendations. Similar projects may well be underway, either in print, or digitally. They are intended to promote discussion, criticism, and counterproposals.

The suggestions begin with establishing the communication necessary for collaboration. Immediately following these are proposals to improve intellectual access to resources. Following are proposals for providing structural and digital image representations of resources. Intellectual access is intentionally presented before digitizing, as it is especially important, and it provides the foundation necessary for building digital resource collections. A final section groups together reference materials, critical secondary resources, and geographic information. While most and perhaps all of these proposals are ambitious, perhaps immodestly so, building a consortium should begin with modest, realistic projects that are relatively easy to accomplish. Early, modest successes will establish the trust and expertise necessary to accomplish more ambitious objectives.

Listservs

Generally, to carry out collaborative projects such as I have been proposing requires the establishment of one or more listservs to facilitate communication. How many listservs, and devoted to what purpose, depends upon the number, size, and complexity of the projects undertaken, and thus the degree to which specialization is necessary. At a minimum, an emerging consortium needs at least one list to organize and discuss the activities of the consortium itself. Over time, the need may arise for specialized discussion groups devoted to activities such as administration and governance, technical infrastructure, and intellectual and technical standards. In addition to administrative communication, listservs can serve to facilitate scholarly communication. There are already several discussion lists devoted to the study of books and related technologies:

  • ExLibris: Rare books and special collections
  • Book_Arts-L: All book arts
  • TYPO-L: Type and typographic design
  • PAPER-L: Paper
  • LETPRESS: Letterpress
  • CALLIG: Calligraphy
  • SHARP-L: Society for the History of Authorship, Reading and Publishing (SHARP)

Rather than establishing competing lists, a consortium might instead choose to concentrate on lists that complement them.

Access to Historical Evidence

Since books themselves obviously constitute a major source of evidence for book historical research, improving intellectual access to especially significant books and book evidence would greatly expedite and improve research. Existing online bibliographic catalogs have already improved access, though they have a number of disadvantages. They are distributed, and thus require serial searching. A researcher must have a good idea of where a particular book is likely to be located before beginning a search. Catalog interfaces also vary, adding to the complexity of the challenge. Both OCLC and RLG, though, have come a long way in solving this problem. More difficult, though, than the distribution of catalogs is that most MARC cataloging lacks forms of access that would be useful to students of the book. Existing practices at the Bancroft Library at the University of California, Berkeley may serve as a useful example of improving access for book historians. Using more detailed MARC records than are typical, the Bancroft Library is providing specialized access to its collections.19 Through the online catalog, the following indices are available:

  • Chronological: inverted geographic access to place of publication, subarranged by date
  • Typographical: access to printer or publisher
  • Binders: access to bookbinders
  • Association: access to former owners (provenance)
  • Genre/Form: access to printing and publishing evidence; binding, genre, and paper terms; provenance evidence; and typographical evidence

Using the Bancroft Library's and similar efforts as a starting point, a consortium could develop content and encoding standards for specialized headings. The consortium could work actively to encourage use of these standards, and lobby the large bibliographic utilities and vendors of MARC catalogs to provide specialized searching based on them. A more ambitious project might bypass the utilities and vendors by creating its own union, international catalog incorporating the specialized headings. Such a catalog might be based on MARC, or alternatively on XML, using, for example the UNIMARC XML DTD developed by the BiblioML project in France. This would make it possible to use XML indexing and publishing software instead of a MARC system.20 Mapping the various dialects of MARC and non-MARC records into a UNIMARC DTD would be quite complex, though feasible, as demonstrated by the Manuscript and Letters Via International Networks (MALVINE) project, discussed below. Any or all of these initiatives would improve access to book evidence.

Encoded Archival Description (EAD)21 is an emerging international standard for encoding detailed archival descriptions of fonds or collections.22 EAD provides a standard representation of descriptions of the records of corporate bodies, and the papers of individuals and families. While EAD has many rationales, perhaps the most compelling is that standard archival description supports the long-cherished dream of providing both professional and public researchers universal, union access to primary resources. Currently there are dozens of institutions throughout the world using EAD, with the number growing rapidly. Many of the EAD implementations are consortia with several repositories participating. These are generally organized geographically, such as the Online Archive of California, or by discipline or subject, such as the Physics History Finding Aids project.23 The MALVINE project is organized geographically (European Union) and by genre (letters and manuscripts).24 The Research Libraries Group (RLG), an international consortium of research archives, libraries, and museums, is currently providing union access to finding aids from throughout the world through its Archival Resources service.25

EAD makes it possible to greatly improve access to the records of individuals, families, and firms that have made significant contributions to the history of the book. Like the history of physics and many other disciplines, the history of the book transcends national borders, with many of the significant individuals, families, and firms active in more than one country. The records documenting the activities are distributed in many countries, and within countries, in more than one repository. A worthy project might focus on identifying the significant fonds and collections distributed in European repositories, and centralize descriptive access to them using EAD. This would necessarily involve a wide variety of activities. Identifying what collections are processed and described, and evaluating the quality of existing descriptions would be a first step. Developing a strategy for converting existing descriptions into EAD would follow. Creating the archival description system would require converting print finding aids into machine-readable form, and mapping and writing conversion scripts for finding aids in word processing and database formats. EAD implementers already have extensive experience in working with vendors that convert paper finding aids, and vast in-house experience converting word-processed and database finding aids. Organizing and seeking funds for processing unprocessed collections would follow conversion of existing finding aids. The existing EAD consortia all have a lead repository or institution hosting and publishing the contributed finding aids. Providing international, union access to significant archival evidence would greatly facilitate access to significant archival evidence, and complement the access to book evidence discussed above.

The Manuscripts and Letters via Integrated Networks in Europe (MALVINE) project provides an excellent European example of how such an initiative might be organized and implemented. MALVINE is funded by the European Union. There are fifteen participating archives, libraries, museums, and documentation centers, located in nine European countries. MALVINE is coordinated by the Staatsbibliothek zu Berlin, though the Humanities Information Technologies Research Programme at the University of Bergen coordinates and provides technical support for conversion and publishing. Other responsibilities are distributed among the other participants. Key components of the success of MALVINE have been the collegiality of the participants and the highly capable technical expertise brought to the effort. MALVINE represents an excellent model for European collaboration.

To complement access to bibliographic and archival evidence, a consortium might also explore options for describing and providing access to tools and apparatus used in the production of books. The International Council of Museums' Conceptual Reference Model, and other museum initiatives and standards efforts should be explored in this regard.26 Some museums are also experimenting with EAD to provide access to museum artifacts.27

Representation and Analysis of Evidence

In addition to enabling enhanced access to book and archival evidence, existing and emerging technologies also enable creating machine-readable representations of the evidence itself. Such representations will be suitable for some research purposes, and may improve existing analytic methods and inspire new methods.

Machine-readable representation of evidence can be done in three ways. First, XML can be used to create descriptive, structural representations of objects, for example, the physical features of books. Second, imaging technology can be used to capture graphical information. Two-dimensional imaging can be used for page images, manuscript and print archival resources, and similar flat resources. Three-dimensional imaging can be used for books, tools and apparatus used in the production of books, and similar resources.

Using the TEI DTD, Terry Catapano and Syd Bauman provide an example of a machine-readable structural representation of the physical features of a book.28 The TEI, by design, is optimized for encoding the intellectual structure of a book: chapters, paragraphs, poems, lines of poems, and so on. It is sufficiently flexible, however, that Bauman and Catapano were able to devise a prototype description of the physical structure of a book. They state that there are certain advantages to such a representation:

The re-arrangement of the pages of text as imposed for printing may make apparent places where the text was affected by typographical exigencies. It is also useful in electronic bibliographic analysis

  • for identifying which compositor set which forme, in order to distinguish their individual spelling and punctuation habits;
  • to track identifiable pieces of type to determine the order of printing; and
  • to discover the course of proofreading and correction.29

Their example is intended to demonstrate, in a very preliminary way, the feasibility of using XML to represent the physical structure of a book. It is not a fully developed system. Using this demonstration as a starting place, a project might attempt to design and develop a comprehensive XML DTD for representing the physical features of books and attempt a variety of computer-assisted analysis to determine its utility. If successful, such a DTD would facilitate building a shared collection of representations of individual, exemplary books that could be made available to the scholarly community for analysis, discussion, and teaching. Such a DTD optionally might be developed in collaboration with the TEI Consortium.

While TEI provides a comprehensive scheme for text description and representation that facilitates literary, editorial, historical, and linguistic analysis, historians of the book are likely to find much in it that they do not want or need, and not fund many elements that they do. Medievalists, for example, have found that TEI lacks sufficient detail for both description and representation of medieval manuscripts.30 Projects such as the Manuscript Access through Standards for Electronic Records (MASTER) and Electronic Access to Medieval Manuscripts (EAMMS) are working with the TEI Consortium to extend TEI to optimize its use in the study of medieval manuscripts. Initiatives similar to these might be appropriate for the history of the book community.

Page imaging of books also offers significant opportunities for collaboration. There is an ongoing and well-justified controversy concerning the use of page imaging in preservation. Preservation is an extremely complex issue, as all migration of information from one medium to another involves loss of information, namely the medium from which the information is transferred or migrated. When the medium itself is the primary evidence, such as it is in the case of books significant in book history, media transfer is an unacceptable preservation method. The only acceptable preservation method for significant books is preservation of the book itself. Page imaging, though, can be useful in facilitating access and analysis of evidence for some purposes, for example, the study of typography, and for bringing together distributed materials for comparison. The William Blake Archive provides an excellent example of the use of page imaging.31 A worthwhile area of collaboration would be in experimenting with and establishing best practices in image capture and quality to facilitate the utility of imaging in support of research and analysis. Subsequent to establishing image quality standards, collaborative building and sharing collections of page images would seem a worthwhile undertaking.

Widely demonstrated and accepted standards and technique for high quality three-dimensional imaging have yet to emerge, though research, prototyping, and development are well underway. Large-scale use of this technology would be premature a this time, but exploring and experimenting with the technology to identify useful applications would position the community to take full advantage of the technology as it becomes standardized and widely used. Uncle Tom's Cabin & American Culture at the Institute for Advanced Technology in the Humanities provides simple but suggestive examples of three-dimensional book imaging using QuickTime®.32 An experimental project, the Brazil Rendering System, provides some striking examples of three-dimensional renderings exported to standard two-dimensional imaging formats.33 In addition to books, three-dimensional technology would also be useful in representing tools and apparatus used in book production.

Other Projects

While improving access to primary resources should be the first priority of the history of the book community, there are also a number of other potentially useful project opportunities. Reference materials and authoritative secondary literature can aid in the use of the primary resources, as can geographic information.

Securing digital rights to significant secondary works and reference materials, and publishing them on the Internet is certainly worth considering. Materials that fall into this category are landmark histories of the book and dictionaries and glossaries, in essence, the kinds of materials that any historian of the book has within arm's reach in his or her office, and which are consulted regularly. Reference materials of this kind are especially useful online, as the technology facilitates quick access, and the information sought is typically brief and thus easily readable on a screen.

Making important geographic information available can be extremely useful for reference, research, and teaching. Geographic Information System (GIS) provide sophisticated access to highly accurate spatial information. Geographic Information System also can be linked with social science data systems and other relevant datasets. For example, GIS linked to datasets might provide dynamic graphical presentations of the location and movement of people, firms, trade, materials, technology, and ideas over time. Even static page images of maps can be extremely useful. Two projects at the University of Virginia demonstrate the use of map information in conjunction with primary resources. The Valley of the Shadow Project uses animated maps to trace the movements of soldiers from Pennsylvania and Virginia during the American Civil War.34 While still under development, the Salem Witch Trials is linking geographic and temporal information with a database documenting families, individuals, institutions, significant events, and documentary sources in a variety of formats. A prototype map displays in space (location of homes and households) and over time (from 29 February to 31 March, 1692) accusers, accused, and accusations. When completed, users will be able to visually follow the spatial-temporal unfolding of the events that gripped Salem in the seventeenth century, and access detailed descriptive information on individuals, institutions, events, and related documents by clicking on icons on the map.35

The proposals offered here are intended to initiate discussion and to invite criticism. Some of them, upon close scrutiny, may turn out to be impractical, simply not useful, or in fact already underway or completed. The list is also by no means comprehensive. Certainly worthy of discussion are projects devoted to developing and sharing pedagogical materials, to establishing one or more online, peer-reviewed journals, and to providing access to relevant social science datasets. No doubt there are many other promising candidates as well. Individual scholars, librarians, collectors, and students in the field are quite likely already undertaking important projects that would benefit from collaboration. Many scholars in the field, undoubtedly, could reel off a fairly long list of such projects, probably associated with the names of imaginative colleagues. Ultimately, the community itself needs to identify its most important needs, and determine whether and how collaborative use of technology can address them.


VI. Conclusion

The technological landscape has change considerably in the last twenty years. At the close of the 1970s, computing technology was generally not available for most humanists. The equipment was expensive, and it required engineering and programming expertise that few humanists had the time or the inclination to master. The Internet was available to only a few, and was primitive when compared to today. Most computing was devoted to "number crunching," for which most humanists had little or no use. Database applications were still quite crude by current standards. Markup and related text technologies and imaging, audio, and audio-visual technologies were only on the horizon. Standards were virtually nonexistent. All of this has changed significantly.

Particularly important among the technologies and standards that have emerged are those associated with databases and text markup. These two technologies, for the first time, make it possible for humanists to rigorously articulate structures that reflect their intellectual interests, to instantiate them in machine-readable form, and use the instantiations in computations that exploit the structures. For many years now, librarians have demonstrated the power of database technology for representing and exploiting descriptive cataloging. Archivists, librarians, and humanities scholars have also successfully applied markup technologies for accurately describing and representing both the physical characteristics and intellectual content of cultural objects and collections of objects. While there are a number of existing and emerging standards already in place that would benefit the history of the book community, many others remain to be identified and developed. Database and markup technologies, in concert with the other technologies described here, present the field with the opportunity to determine its own future and the appropriate role for technology in it.

While technology presents great opportunity, it also presents very real dangers. Book historians cannot simply rely on technologists for guidance, especially when the technologists have conflicting interests. The community needs to develop its own technology experts. Because the technology presents such a wide-open space for the imagination, there is great risk that time and money will be invested in activities that ultimately are all form, and no content. This danger, however, should not stifle experimentation. The technology represents, in many respects, terra incognita, and to determine what does and does not work, and what is and is not useful, will require exploration. Experiments, especially those involving external funds, need to be carefully designed, with clear hypotheses and methods for evaluating results. The most serious, danger, though, is that fear of the dangers leads to doing nothing as a community. Inaction will have two probable outcomes. Individuals will engage in isolated projects that will make building a community at some later date much more difficult, as the individuals will be very reluctant to sacrifice their work. The other probable outcome is that technologists and other outsiders will determine the technological future of the history of the book community. Carpe diem!

Endnotes

  1. Unique is used here, of course, to mean unique in the context of the individual library, not unique in the bibliographic universe.
  2. The list of variant names for al-Ghazzali and Shakespeare are both taken from the Library of Congress Name Authority File. There is a certain additional complexity that is not apparent in the two examples. Some of the entries represent cataloger transliteration of non-Roman alphabet text (which may themselves be transliterations from Roman-alphabet texts), and others are transliterations in the published texts.
  3. We find ourselves in much the same position as the protagonist in Goethe's Faust. After Faust congers up Mephistopheles, he asks him who he is. Mephistopheles answers that he is "part of a power that alone works evil, but engenders good."
  4. With a bit of creativity, the technology can be used to work with data for which it is not optimized. Some data is not easily classified, which is to say, it has characteristics of more than one type. For such data, it is sometimes best to base the selection of what technology to employ on the most important functional objectives, perhaps sacrificing less important objectives in the process.
  5. The description of types of data that follows is derived in part from Steven J. DeRose, "Navigation, Access, and Control Using Structured Information," American Archivist (Chicago: Society of American Archivists), vol. 60, no. 3 (summer 1997).
  6. For more information on ANSI/NISO/ISO 12083 see http://www.xmlxperts.com/12083.htm.
  7. For more information on TEI, see http://www.tei-c.org/.
  8. For more information on EAD, see http://www.loc.gov/ead/ead.html and http://jefferson.village.virginia.edu/ead/.
  9. For a list of current DTD initiatives, see http://www.oasis-open.org/cover/xml.html - applications.
  10. " The World Wide Web Consortium (W3C) develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential as a forum for information, commerce, communication, and collective understanding." For more information, see http://www.w3c.org/.
  11. For more information on XSL, XLink, and XQuery, see http://www.w3.org/Style/XSL/, http://www.w3.org/XML/Linking, and http://www.w3.org/XML/Query.
  12. Metadata is first and foremost what librarians call descriptive cataloging, used for providing intellectual access to and description and control of bibliographic entities. The definition of metadata has been extended to cover a wide variety of other kinds of control, such as rights management, age appropriate filtering, and communicating and controlling the ordering and interrelation of complex-compound digital objects. This latter category is frequently called "structural metadata." An example of structural metadata is the information need "to bind together" into a "book" large numbers of digital page images.
  13. For more information on SVG and SMIL, see http://www.w3.org/Graphics/SVG/Overview.htm8 and http://www.w3.org/AudioVideo/.
  14. For more information on SMIL, see http://www.opengis.net/gml/01-029/GML2.html. For more information on the Open GIS Consortium, see .http://www.opengis.org/.
  15. For more information on the Web3D Consortium, VRML and X3D, see http://www.web3d.org/.
  16. For an excellent introduction to archive and library imaging, see Anne R. Kinney and Oya Y. Rieger, Moving Theory into Practice (Menlo Park: RLG, 2000).
  17. For more information on JPEG2000, see http://etro.vub.ac.be/~chchrist/recpad00_paper.pdf.
  18. The expression "order of books" is taken from Roger Chartier's The Order of Books (Stanford: Stanford University Press, 1994).
  19. For a detailed description of the Bancroft Library's specialized indices, see http://www.lib.berkeley.edu/BANC/specialfiles.html.
  20. For more information on BiblioML, see http://www.culture.fr/biblioml/.
  21. EAD related information and files are available at http://lcweb.loc.gov/ead/ and http://jefferson.village.virginia.edu/ead. The following publications are also available at http://www.archivists.org/catalog/index.html: Encoded Archival Description: Context, Theory, and Case studies (Chicago: Society of American Archivists, 1998); Encoded Archival Description Tag Library: Version 1.0 (Chicago: Society of American Archivists, 1998); and Encoded Archival Description Application Guidelines: Version 1.0 (Chicago: Society of American Archivists, 1999). The Tag Library is also available at: http://lcweb.loc.gov/ead/tglib/tlhome.html.
  22. The International Council on Archives' General International Standard Archival Description (IDAD(G)) defines a fonds as "a complex body of materials, frequently in more than one form or medium, sharing a common provenance." For more information on ISAD(G), see http://www.ica.org/ISAD(G)E-pub.pdf.
  23. More information on the Online Archive of California can be found at http://www.oac.cdlib.org/.The Physics History Finding Aids is organized and sponsored by the American Institute of Physics, with additional funding coming from the National Endowment for the Humanities. Information about the consortium and currently available finding aids can be found at http://www.aip.org/history/ead/index.html.
  24. Malvine is a European initiative to provide access to "disparate holdings of modern manuscripts and letters, kept and catalogued in European libraries, archives, documentation centres and museums." For more information on Malvine, see http://www.malvine.org/.
  25. More information on Archival Resources can be found at http://www.rlg.org/arr/index.html. Archival Resources currently provides access to approximately 20,000 findings, and is currently growing at a rate of 1,000 each month. The number of finding aids and contributing repositories expected to grow steadily and considerably for the foreseeable future, as more repositories begin to adopt EAD, with many of them choosing also to contribute them to this growing international databases.
  26. For more information on International Council of Museums, Conceptual Reference Model, see http://cidoc.ics.forth.gr/index.html?crm_index.html.
  27. For more information on the Museums and the Online Archive of California, see http://www.bampfa.berkeley.edu/moac/.
  28. Syd Bauman and Terry Catapano, TEI and the Encoding of the Physical Structure of Books.
  29. Ibid.
  30. For more information on the TEI Consortium, MASTER, and EAMMS see http://www.tei-c.org/, http://www.cta.dmu.ac.uk/projects/master/index.html, http://www.hmml.org/eamms/index.html.
  31. The William Blake Archive can be found at http://www.blakearchive.org/. While all of the plates are available for comparison when more than one copy exists, the following link will provide access to one example. Near the bottom of the screen is a "compare" button: http://www.blakearchive.org/cgi-bin/nph-dweb/blake/Illuminated-Book/MHH/mhh.f/@Generic__BookTextView/685;cv=java;pt=450.
  32. Uncle Tom's Cabin & American Culture can be accessed at http://www.iath.virginia.edu/utc/index2f.html, and the QuickTimeĈ movie of an edition of Uncle Tom's Cabin be accessed at http://www.iath.virginia.edu/utc/uncletom/editions/edhp.html. Access the three-dimensional images of "the books on the shelf" by clicking on the spines, and click and hold down the left mouse button and drag to rotate and open the book.
  33. The Brazil Rendering System site can be accessed at http://www.splutterfish.com/sf/index.php3. In the "Gallery" there are a number of interesting examples, such as the following: http://www.blur.com/blurbeta/brazilgallery/img_Lunarwolf_pump_test_7.jpg.
  34. Valley of the Shadow can be accessed at http://jefferson.village.virginia.edu/vshadow2, and an animated theater map can be accessed at http://jefferson.village.virginia.edu/vshadow2/MAPDEMO/theaterintro.html.
  35. The Salem Witch Trial prototype map can be accessed at http://jefferson.village.virginia.edu/~bcr/salem/salem.html.

Adriaan van der Weel, email
Last revised: 08-05-01