Daniel V. Pitti
Project Director
Institute for Advanced Technology in the Humanities
University of Virginia
I. Introduction
One of the most often stated and least justified claims of apologists for the "digital revolution" has been that the Internet has succeeded in replacing an obsolete print culture. Such claims have met with justified skepticism among students of the history of the book, but, unfortunately, justifiable skepticism has all too frequently given way to a less discriminating hostility. Despite exaggerated claims and special pleading from those who hype the Internet, scholars and students of the history of the book should recognize that advanced network and computing technologies have the potential to significantly advance their studies by facilitating collaboration, improving access to essential resources, and providing new methods of publication. While the technology should be approached cautiously, much of what is currently available is sufficiently mature, stable, and broadly supported to merit its use by archives, librarians, and scholars.
For disciplines, such as the history of the book, that are inherently international, network and computing technology can not only facilitate communication among scholars distributed around the world, but also provide universal, union access to distributed resources essential to the study of the book. In some cases, selective access to digital representations of primary source materials canfor some but by no means all research purposesfunction as adequate surrogates for the originals themselves. For other research objectives, they can facilitate analysis and research that would be difficult, or perhaps practically impossible, when using the original materials. The purpose of this paper is to lay before you the progress along these lines made by a number of intellectual disciplines closely related to the history of the book. Although I do not want to positively assert that any of what follows is a road map for what ought to be done by the "community" of book historians, I hope that you will see that these approaches to advanced technology offer rigorous and potentially viable means of pursuing fruitful computer assisted research in the history of the book.
II. Technology: Evaluation
Many librarians, archivists, curators, book historians, and other book professionals have been reluctant to embrace advanced technology. In part this is a response to the exaggerated claims and naïve predictions of technology enthusiasts, and in part it is an understandable and prudent response to emerging and unproven technologies.
For well over thirty years, "visionaries" have been
predicting the death of cataloging, books, libraries, and publishers, and in some
Post-Modern inspirations, authors and readers as well. It is easy to dismiss the
enthusiasts, if not ignore them. Many of them, based on their affiliations, are
obviously inspired more by social and commercial self-interest than they are by
the technology. Many of their predictions have simply failed to materialize, or
attempts have resulted in humiliating disasters. They frequently are forced to
humbly retreat when reality turns out to be more complex than they initially thought,
or ask, yet again, for a lot more money and a little more time. The complex nature
and role of the book and the institutions supporting and depending on it have
proven not to be easily reducible.
In spite of the enthusiasts among them, many thoughtful
technologists have learned from their technical (and political) failures, and
have applied the lessons to developing technologies that enable users to define
and solve their own problems in ways that are appropriate and responsible. In
this regard, they have made particularly important advances in developing publicly
owned standards that ameliorate the dependency of data on proprietary hardware
and software (and technologists). These standards enable users to take control
over their data, representing and exploiting it to serve their own interests
and objectives. Rather than technologists determining the future, inventing
and imposing on others solutions to problems imagined by them, users now increasingly
have it within their power to master the technology and employ it to serve their
own objectives. It is now the responsibility of users to recognize and take
advantage of the emerging and existing opportunities. Despite all of this, the technologists and catalogers
managed to invent something extremely useful, machine-readable cataloging, or
MARC, as we have come to know it. But they also made some serious mistakes,
some of which led to damage that was expensive and, in some cases, impossible
to repair. A noteworthy and instructive example is in the area of authority
control and key word access.
The experience of librarians and technologists in applying technology to cataloging
provides a useful example from which we can learn both good and bad methodologies.
Early collaborations between technologists and librarians were fraught with
misunderstanding and miscommunication. The technologists initially woefully
underestimated the complexity of books, publishing, and cataloging. Many naïvely
viewed libraries as large warehouses, and catalogs as inventories of the items
stored in them. Completely overlooked or ignored was that most warehouses contain
a large number of a small variety of items, while libraries contain a large
number of mostly unique items.1 Further, the unique
items are frequently "ill behaved," defying easy categorization and description,
and exist in extremely complex interrelations with one another. All of this
and more make even the most basic catalog far more complex than a simple inventory.
Catalogers were also naïve about the technology. They frequently had little
or no experience with it, and tended either to accept the view of the technologists
without question, or to reject the technology out of hand. There was little
or no mutual understanding or shared terminology.
When keyword searching became possible, many technologists predicted that it rendered authority control obsolete. Based on this prediction, many library administrators instructed their catalogers to stop doing authority control. While the librarians and the users of catalogs quickly determined that keyword searching was a powerful and useful tool, enabling retrieval impossible with printed cards, they also determined that it was no substitute for experienced professionals making difficult judgments and distinctions, and recording them in machine-readable and therefore exploitable form. For example, computers are still incapable of recognizing that all of the following names refer to the Muslim philosopher al-Ghazzali (1058-1111): Ghazzali, Gazali, Abu Hamid Muhammad ibn Muhammad ibn Ahmad al-Ghazzali, Ghasali, Algazali, Algazel, Ghazali, Al-Ghazali, Houjjatoul-Islam, Mohamed Mohamed Toussi, Mohamed Mohamed al-Ghazali, and Ebu Hâmid Muhammed el-Gazâlî. Or that the following all refer to William Shakespeare (1564-1616): William Shakespeare, William Shakspeare, Uiliam Sek`spiri, Gouilliam Saixper, William Shakspere, Wilyam Shikisbir, Wiliam Szekspir, Sekspyras, Vil'iam Shekspir, Viljem Sekspir, Tsikinya-chaka, Sha-shih-pi-ya, Shashibiya, Vilyam Shekspir, Vilyam Shakspir, Syeiksup`io, William Szekspir, Guglielmo Shakespeare, William Shake-speare, Sha-o, Sekspir, Uiliam Shekspir, Vilijam Sekspir, V. Shekspir, and U. Shekspir.2
On the other hand, the computer can do some wonderful and some not too wonderful things with these distinctions, once they are recorded in computer-readable form. Using keyword access, computers can easily direct users from unused to used headings. Discovery that might take a great deal of time and persistence in a card catalog, and might not be possible at all, is made efficient. But these same distinctions, if not carefully used, can also lead to embarrassing and perhaps disastrous consequences.
One particular
mishap in computer assisted authority control has become notorious. In the 1980s,
OCLC, the large, international bibliographic utility located in Dublin, Ohio,
decided that it needed to "clean up its catalog" after many of its clients complained
repeatedly that it had a serious authority control problem. The programmers
at OCLC decided to write a program that would match headings used in catalog
records against unused variants found in Library of Congress authority records,
and where they found matches, substitute the heading in the authority record
for the heading in the bibliographic record. On the surface, this seemed like
a perfectly reasonable thing to do. Unfortunately, it had many unforeseen consequences.
One has become well known: the program changed all of the headings for Madonna,
the popular singer, into "Mary, Blessed Virgin, Saint." Many librarians specializing
in authority control took this as clear evidence that computers could never
do authority control, as any reasonably informed human being would not make
such a stupid mistake.
This evaluation, however, was not entirely sound. Many of the
librarians saw only the "collateral damage," and failed to recognize that a
good portion of the program worked quite well, and accomplished the desired
goal. The programmers involved, having experienced success tempered by embarrassing
humiliation, analyzed their failures, and began to approach the problem of identifying
when the same name did and did not apply to the same entity more carefully.
They improved the algorithms to recognize when a match was "safe," for example
a personal name qualified by one or more life dates, and when it was not, for
example, a personal name with only two components and without qualification.
A careful, fair analysis of the results demonstrated that they could do a lot
programmatically, but that there would always be a remainder that only a trained,
intelligent, professional person could sort out. They switch from trying to
make the computer do everything, to trying to do as much as could be done accurately
and reliably while leaving the remainder to the professionals with suggestions
and information to help them in their problem solving. Technology, carefully
and deliberatively applied, could perform many of the most routine and tedious
chores, while isolating the most challenging tasks for librarians. The result
is that catalogers now have more time to spend on the problems and challenges
that resist reduction to computer algorithms.
The development and application of MARC has taught technologists and librarians
a great deal. Much though not all of the early overconfidence in technology,
on one hand, and overly skeptical assessment of it, on the other hand, have
been displaced. In their place are clear, realistic collaborative assessments
of what technology can and cannot do, applications that exploit computation
while respecting, facilitating, and exploiting the irreducible contributions
of professional catalogers, and more careful, reversible experiments that test
and extend the current technological limits.
MARC has been unquestionably successful. The emergence of online, networked catalogs has realized for the first time in history the long held dream of the "universal catalog." Since the late 1980s, there have been major advancements in computer and network technologies and our power to represent and exploit information to serve a wide variety of goals and interests. As we encounter these emerging technologies, we need to do so with a methodology informed by our past experience.
Historians of the book can draw on the experience of the librarians in adapting technology to serve their professional objectives. Historians need to acquaint themselves with the technology, to understand and evaluate what it can and cannot do, and to determine appropriate and responsible uses of it. They also need to collaborate and work with technologists. This collaboration necessarily requires assisting technologist in understanding and respecting the complexities of their discipline. Shared terminology needs to be negotiated. Technologists alone should not determine the applications. The historians should proceed deliberatively, based on informed and careful consideration of what is and is not possible now, and what is and is not likely to be possible in the near future. They should welcome what is clearly useful, adapting it to serve their professional goals and interests, and defer use of technologies whose stability, utility, and independence are not demonstrated.3
III. Technology: Overview
Since the early development and use of MARC, there have been major technological developments that are having an extensive, pervasive impact on all of the major institutions of modern society and culture. The Internet lies at the center of these developments. It interconnects and thus makes other technologies far more powerful than they would be in isolation. At the same time and as a direct result of its empowering of other technologies, it constitutes the most important economic force driving much of the investment and development of computer and communication technology. Database technologies have matured significantly, enabling complex, large-scale representation and manipulation of certain classes of information. Since the early 1980s, the emergence and development of markup technologies has for the first time made it possible to accurately represent texts of arbitrary length and complexity based on user supplied specifications. A variety of technologies for the digitization of existing media and original digital creation of analogues of existing media has also emerged. Finally, the emergence, of affordable, increasingly powerful personal computers and increasingly accessible software has put all of these technologies into almost all major institutions, and into the homes of many private citizens as well.
Internet The Internet has had an effect on most if not all of us, professionally or
personally, or both. We are more than likely to use email on a daily bases to
communicate with individuals within our own institutions, and with colleagues
in others. We are also likely to subscribe to one or more listservs that focus
on one or more of our professional responsibilities or intellectual interests.
We can easily send messages and files. We are also likely to use browsers and
various indexing utilities such as Google to find, retrieve, and read or print
articles and essays of interest to us, or to access the Oxford English Dictionary,
the Encyclopedia Britannica, or the bibliographic catalogs in varies
research libraries. While the technology is far from flawlessproprietary
files have been a persistent problem it has progressed significantly and
continuously over its short history, to the point where many of us cannot remember
at time when it was not there. We mostly do not realize how much we have come
to depend on it, except in moments when we do not have access when we expect
to. For many scholars, especially humanists and social scientists, who had no
access to the early networks and frequently had little or no opportunity to
collaborate and communicate easily and regularly, it has improved immeasurably
our communication with colleagues, and our access to information.
Database Technology Database technology is designed to store, manipulate,
and access large volumes of highly regularized data. Modern database technology
began in the early 1960s with efforts to develop techniques to conceptualize,
structure, and manipulate data independent of the specific hardware used. The
most prevalent types of database are hierarchical, network, relational, and
object oriented, with relational databases being the most prevalent. While object
oriented databases have not been widely implemented, the technology has contributed
conceptual and functional models that have influenced the most recent relational
database implementations. Relational databases with functionality inspired by
object orient databases are called object-relational databases. Affordable yet
sophisticated relational database software frequently comes packaged with personal
computers. The widespread availability of database technology
enables individual scholars to compile and manipulate large amounts of data
to support research. In addition to individual projects, the technology supports
large collaborative projects, enabling scholars and researchers to cooperatively
build shared, sets of various kinds of data: bibliographic, census and demographic,
statistical, genealogy, and others. In addition, database technology provides
the data infrastructure for sophisticated Geographic Information Systems, and
computer graphics. Database technology also provides the infrastructure for
MARC-based online catalogs and maintenance systems, as well as access, description,
and control systems in archives and museums.
Database technology is most useful for representing
and exploiting specific kinds of data.4 The kind of information
found in forms and questionnaires fits well in databases. For example, personnel
records and job application are perfect candidates, as are records describing
publishers and book traders.5 In general, database suitable documents
have the following characteristics: each document has the same set of data elements;
the order of the data elements in any given record is not important; and the data
elements in any record have few or no hierarchical relations with one another.
Documents that not are not generally suitable for databases have the following
characteristics: the documents differ from one another in the number, kinds, or
sequence of components; the order of the document components is important; and
components have many, frequently unbounded hierarchical relations with one another.
Texts, such as those found in books and journals belong to this type, and markup
technology rather than database technology has emerged as the optimum way to represent
and exploit them.
Markup Technologies
All information in computers is encoded to facilitate
processing it. In the early history of computing, the codes associated with textual
information were typically procedural codes. Procedural codes specify certain
operations or procedures that are to be applied to the information. Word processing
programs, the most common text application, associate codes with text to facilitate
printing it. The various codes represent different styles that are to be applied
to information. For example, the title of an article might have codes that will
facilitate centering it on the page, and printing it in a large, bold font. Most
procedural encoding is proprietary and devoted to one output, the most common
of which is print.
In the late 1970s and early 1980s, an alternative to procedural encoding emerged.
Instead of embedding procedural codes in texts, descriptive or declarative codes
were embedded. Descriptive encoding of text specifies what the text and
its components are rather than the procedures to be applied to it. Descriptive
encoding is a process of naming, as opposed to procedural encoding, which is
a process of associating verbs or actions with text and text components. The
descriptive or declarative approach has the major advantage of supporting multiple
procedures, even procedures not anticipated at the time of encoding. For example,
declarative markup might state that a given string of text is a title. This
string can then be printed using one set of procedures, displayed on a computer
screen using another, and indexed using yet another.
Standard Generalized Markup Language (SGML), first codified in 1986 by the
International Standards Organization, is a descriptive method of representing
or encoding textual information in computers. While SGML is both standard and
generalized, it does not provide an off the shelf markup language that one can
simply take home and apply to a letter, novel, article, catalog record, or finding
aid. Instead it is a markup language meta-standard, or in simpler words, a standard
for constructing markup languages. SGML provides conventions for naming the
logical components of documents, and a syntax and meta-language for defining
and expressing the logical structure and relations among the components. SGML
is a set of formal rules for defining specific markup languages for individual
kinds of documents. Using these formal rules, members of a community sharing
a particular type of document can work together to create a markup language
specific to their shared document type.
The specific markup languages expressing these analytic models and written
in compliance with formal SGML requirements are called Document Type Definitions,
or DTDs. For example, the Association of American Publishers has developed four
DTDs: for books, journals, journal articles, and mathematical formulae. After
thorough revision, this standard has been released as an ANSI/NISO/ISO standard,
12083.6 A consortium of software developers and producers has
developed a DTD for computer manuals called DocBook. The Text Encoding Initiative
(TEI) has developed a complex suite of DTDs for the representation of literary
and linguistic materials.7 Archivists have developed a DTD
for archival description or finding aids called Encoded Archival Description
(EAD).8 There are even several DTDs for representing various
varieties of MARC. A large number of government, education and research, business,
industry, and other institutions and professions are currently developing DTDs
for shared document types.9 DTDs shared and followed by a community
can themselves be standards. ANSI/NISO/ISO 12083, DocBook, TEI, and EAD are
all standard DTDs. HyperText Markup Language (HTML) is an SGML DTD
that has enjoyed enormous success as the encoding standard underpinning the
World Wide Web. As a specific application of SGML, the HTML DTD limits itself
to simple procedural encoding dedicated to online display and hypermedia linking.
Constraining the set of tags has made it easy to build applications that make
life relatively easy for authors and Web publishers. The ease of use has been
a major factor in the Web's remarkable success. The developers of HTML, the World Wide Web Consortium
(W3C), recognized that HTML, as useful and popular as it has been, would not
support complex, community-based use of shared information on the Internet.
Because HTML implements a small, closed set of procedurally oriented tags, it
is incapable of supporting sophisticated searching, navigation, display, and
communication. Evidence of HTML's limited ability to support intelligent searching
and document discovery, let alone complex display, navigation and other processing,
is not difficult to find. Many of us have used Web search engines to look for
both known items and items on a particular topic. More often than not, we are
overwhelmed by voluminous results, with many and perhaps most of them being
irrelevant. Our patience frequently is exhausted looking for an item or two
that satisfies our need. The small, closed tag set has thus come at a price:
HTML has extremely limited functionality.
The W3C recognized in SGML's declarative approach and
extensibility the means to overcome the limits of HTML, but they also noted that
SGML presents its own set of problems. It is very complex for software developers,
and as a result, software products for exploiting the richness of the descriptive
encoding have been limited in number and almost always expensive. In 1996, the
World Wide Web Consortium (W3C) founded the EXtensible Markup Language (XML) Working
Group to address this problem.10 The Working Group, in a short
period of time, wrote a specification for a simplified subset of SGML named XML.
They simply eliminated the features of SGML that were problematic for programmers.
XML is simplified or normalized SGML.
XML encoding of text provides a means of representing textual semantics and
structure, but it does not in itself provide support for the procedures
that are likely to be applied to texts. Presenting or displaying text on a computer
screen and printing it on paper are two obvious procedures. The W3C has developed
EXtensible Stylesheet Language for standardizing both of these procedures, as
well as other transformations of text. XML Linking Language (XLink) is standardizing
hypertext and hypermedia behavior. In addition to supporting the kinds of links
familiar currently on the Web, XLink will enable linking not only to other documents
and digital media, but also into them, even when the (author of the)
referencing document does not control the referenced document. XLink will also
support annotation of texts and objects, again regardless of owner. XML Query
(XQuery) will support the standardization of searching texts (and databases
as well).11 Together with XML, these standards and other related,
supporting standards represent a relatively complete, standard approach to textual
information.
While the origins of SGML lie in the processing of
texts, XML is also being used as the basis for encoding and communicating many
kinds of data. A large number of the current XML initiatives involve data that
is created and maintained in databases, but is communicated among databases and
published on the Internet using XML. Many of these initiatives involve commercial
databases and business transactions. Still others involve what has come to be
called, for better or worse, metadata.12 Noteworthy also is
Scalable Vector Graphics (SVG), an emerging standard developed by the W3C "for
describing two-dimensional graphics in XML" A companion standard, also under development
by the W3C, is Synchronized Multimedia Integration Language (SMIL), which supports,
as the name suggests, integrated presentation of multiple media.13
SVG and SMIL appear likely to have an impact in the presentation of geographic
information, though yet another effort is devoted to creating XML-based representation
of the geographic information itself. This effort, the Geography Markup Language
(GML), is organized and led by the Open GIS Consortium.14 The
Web3D Consortium is developing an XML-based standard for three-dimensional graphics,
Extensible 3-D (X3D). It is based on the SGML-based ISO standard Virtual Reality
Markup Language (VRML).15 X3D will provide a standard encoding
of data supporting an extremely wide variety of three-dimensional objects. Examples
are rooms, buildings, automobiles, and, of course, books. Like SVG, X3D also promises
to provide support for the representation of geographic topographical information,
as well as Computer Aided Design (CAD), used extensively in architectural and
engineering design. All of these efforts are beginning what is likely to be an
extended period of standardizing machine-readable data for various media. Some,
such as XML itself, are well along in development, with an increasingly wide range
of products of increasing quality available. Others are in various stages, from
just underway to nearing approval by the W3C and other authoritative bodies.
Both markup technologies, as represented in XML and related standards, and
database technologies are particularly significant because they enable users
to represent semantically and structurally rich intellectual understandings
of text and other data in machine-readable form that can be exploited using
a wide array of existing procedures as well as procedures yet to be devised.
The wide array of current XML and database initiatives demonstrates the importance
of this descriptive and representational power. In the humanities research community,
semantically rich machine-readable expressions have already significantly enhanced
intellectual access to cultural objects, through MARC and other cataloging standards,
and are beginning to enhance analysis and interpretation of them as well. Markup and database technologies are also significant because
they ameliorate the dependency of data on hardware and software. Migrating information
out of an obsolete standard into a new, better standard is an important procedure
that will sooner or later become necessary. This is typically overlooked in
the conception and design of projects and programs. Fortunately, standard, descriptive
encodings inherently provide more support for this procedure than do proprietary,
procedural encodings. When you know what the information and its parts are,
migrating from one standard to another is a matter of semantic mapping rather
than procedural mapping. XML even has its own standard for accomplishing this
transformation, XSL.
While markup and databases technologies enable semantic
encodings, they do not in themselves define the semantics. Specific cultural heritage
disciplines and communities sharing intellectual and professional objectives must
be responsible for the analysis, specification, expression, and application of
the semantics. Libraries have accomplished the most in this area, though archives
and museums are also actively engaged in standards development. Primarily through
the Text Encoding Initiative, humanists have also begun to define standards, though
a great deal of work remains to be done, and much remains to be done in collaboration
between these communities. Developing shared semantics and structures represents
the most important challenge facing the cultural heritage disciplines and communities
in the near future.
Digitization Technologies ISO also has developed standard encodings for audio
and audio-visual materials, though these are primarily compression standards
and thus are designed for audio and audio-visual files used on the Internet.
Audio (and thus by implication also audio-visual) capture that is sufficient
for long-term archiving and preservation is problematic because sound is continuous,
and digital capture, by definition, is discontinuous. Analog is converted to
digital through sampling, and until sampling rates achieve an acceptable threshold
there is significant loss of information. Nevertheless, research and development
continues in this area with the expectation that capturing audio data in digital
form using high sampling rates will make digital preservation of audio and audio-visual
material feasible in the near future. Despite these limitations, used cautiously,
many of these de facto and public standards are sufficient for many purposes.
The Internet stands out as the most transforming of the computing related technologies
to emerge in the last twenty years because it interconnects computers and the
people using them wherever they are in the world, and any time of the day or night.
While most of us first became aware of the Internet in the early1990s, the research
and early prototypes that led to it began in the 1960s. As the network began to
be realized in the 1970s and 1980s, its potential for facilitating communication
between computers and through them people became increasingly clear. New standards
and software emerged to take advantage of the potential. Telnet emerged as a way
for a user on one computer to connect to another computer. Researchers developed
File Transfer Protocol (FTP) to enable moving files back and forth between computers.
Other researchers developed Electronic mail (Email) to enable sending and receiving
messages. Related to FTP, researchers developed client-server technology, to support
distributed processing of information. In conjunction with Email, scientists developed
listserv technology to support group discussion and communication. Still later,
researchers invented Hypertext Markup Language (HTML) to enable online viewing
of texts with "links" that led to yet other texts. To take full advantage of hypertext,
researchers developed browsers that enabled not only interrelating texts, but
also interrelating texts and other digital media. The emergence of browsers and
hypermedia coupled with the Internet in the early 1990s spawned the current Internet
"phenomenon."
Pictorial, audio, and audio-video media are also increasingly standardized.
For pictorial material there is one major de facto industry standard:
Tagged Interchange File Format (TIFF).16 While we should prefer
open, public standards to de facto industry standards, when a proprietary
encoding achieves widespread industry support, its fate is no longer tied to
one company, and one or two computer programs, and thus the dependency and risk
involved in its use is ameliorated. TIFF as well as de facto standards
such as PostScript and Portable Document Format (PDF) fall into this category.
Because an image stored in TIFF preserves all of the information captured, it
is generally considered acceptable as an archival encoding. TIFF file sizes,
though, tend to be quite large, making them unsuitable for Internet transmission
given current bandwidths. Thus smaller files using techniques for compressing
the files are generally used for Internet publishing. The most popular of these
compression formats is the ISO JPEG standard. The current version of JPEG is
not appropriate for use in archiving and preservation because the techniques
it employs lead to information loss (lossy compression). Currently ISO is developing
a successor to JPEG called JPEG2000 that will support compression without loss
of information (lossless compression), and thus will be suitable as an archival
format.17
Taken together these technologies and standards offer book professionals many possible opportunities to apply advanced technology in their research and teaching. Many of the technologies are well established and proven, and based on solid, open, public standards. Others are also well established and understood, but not yet based on standards. Nevertheless, serious standards efforts are underway for many of these as well. Certainly not all problems have been solved, and there are risks, especially for naïve users. Many of the conveniences that are introduced by the advanced technology also introduce uncertainties that threaten to upset the balance of control and interests upon which the current "order of the books" rests.18 With due thought and caution, though, much of the technology is sufficiently mature, stable, and broadly supported to merit its use.
IV. History
of the Book: Collaboration and Community
One of the great opportunities presented by advanced technology is that
of facilitating collaboration. The sciences and the social sciences, even before
the advent of the Internet and related technologies, frequently employed collaborative
projects to achieve shared research objectives. The scale and complexity of
many projects motivated these collaborations. Complex, labor-intensive projects
required group effort to be successful. Without collaborating, the research
was simply impossible. Collaboration also had a major secondary benefit as well.
Designing and carrying out complex projects required intensive exchange of ideas
and negotiation that led to many intellectual advances and breakthroughs. Many
of these would have been difficult or more slowly realized, if at all, by scholars
working alone.
Outside of a few dictionary and encyclopedia projects,
humanists rarely have collaborated in this manner. The ongoing building of bibliographic
catalogs stands out as a major exception. Most often humanists toil in solitude,
communicating now and again with one or two trusted colleagues. Until now, they
have lacked by and large both the means and the motivation to engage in collaborative
research activities.
Technology provides both the opportunity and the motivation for historians
of the book to engage in large-scale, complex projects. The history of the book
is international in nature. Historians are distributed throughout the world,
as are the tools and resources employed in their research. The book trade itself
is international, with many of the significant figures and firms operating across
borders. The international nature of the book trade accounts for some of the
distribution of resources, though collectors have also contributed. The distribution
of people and resources has been a major, time-consuming, and frequently prohibitive
obstacle to both scholarly communication and research. Advanced technology provides
the means to overcome this obstacle, though only if the historians of the book
collaborate with one another, and with librarians, archivists, publishers, and
others, in complex and intellectually intensive collaboration.
Existing and emerging technologies present several
opportunities to historians of the book. It makes possible providing universal,
union intellectual access to resources in the form of specialized bibliographic
catalogs and archival description systems. It also makes it possible to provide
selective access to digital representations of bibliographic and archival resources
that that can function as adequate surrogates for the original for some, though
by no means all, research purposes. In some cases, these digital representations
may also facilitate analysis and research that would be difficult, or perhaps
practically impossible, when using original materials. The technology also enables
building analytic and pedagogical tools that can be shared. Finally, it offers
the opportunity to create new forms of publication and pedagogy employing these
resources.
As the foregoing has illustrated, achieving such things in a manner that will
assure their usefulness over time requires the disciplined efforts of a community.
An essential factor in establishing a collaborative community or consortium
is having one or two lead institutions that are willing to provide hardware,
software, and technical expertise to host, maintain, and publish resources,
and to facilitate communication among participants. While some of these functions
can be distributed, such as distributing responsibility for communication to
one institution, and resource maintenance and delivery to another, distributing
resources is problematic technically, especially if the resources constitute
a wide variety, and at the same time have many interrelating links. For now
the existing technology does not easily support such distribution. There is
a major economic advantage to centralizing some of the more complex operations,
as it relieves participating individuals and institutions of having to invest
time and money in mastering complex supporting technology. Distribution of creation
and maintenance activities, however, is absolutely essential, as the expertise
needed to gather resources and the resources themselves are distributed.
The major challenges in building a history of the book consortium on the Internet are not intellectual and technical, as difficult as these are, but political. Politics, in the more attractive sense of the term, is community building. Communities first and foremost must articulate and share common interests and goals. Agreement can be difficult to achieve, as it requires negotiation and sacrifice. Individuals will only voluntarily participate in a community if it enables them to more effectively pursue individual interests, and to achieve goals that are difficult and perhaps impossible to achieve working alone.
V. History of the Book: Project
Suggestions
The necessary first step in building a consortium is developing a shared vision
of objectives. In what follows, I will suggest some possible objectives. Coming
from an outsider with no detailed knowledge of the nature and methods of the
discipline, they are all offered merely as suggestions, and not as recommendations.
Similar projects may well be underway, either in print, or digitally. They are
intended to promote discussion, criticism, and counterproposals.
The suggestions begin with establishing the communication
necessary for collaboration. Immediately following these are proposals to improve
intellectual access to resources. Following are proposals for providing structural
and digital image representations of resources. Intellectual access is intentionally
presented before digitizing, as it is especially important, and it provides the
foundation necessary for building digital resource collections. A final section
groups together reference materials, critical secondary resources, and geographic
information. While most and perhaps all of these proposals are ambitious, perhaps
immodestly so, building a consortium should begin with modest, realistic projects
that are relatively easy to accomplish. Early, modest successes will establish
the trust and expertise necessary to accomplish more ambitious objectives.
Listservs
Generally, to carry out collaborative projects such as I have been proposing requires the establishment of one or more listservs to facilitate communication. How many listservs, and devoted to what purpose, depends upon the number, size, and complexity of the projects undertaken, and thus the degree to which specialization is necessary. At a minimum, an emerging consortium needs at least one list to organize and discuss the activities of the consortium itself. Over time, the need may arise for specialized discussion groups devoted to activities such as administration and governance, technical infrastructure, and intellectual and technical standards. In addition to administrative communication, listservs can serve to facilitate scholarly communication. There are already several discussion lists devoted to the study of books and related technologies:
Rather than establishing competing lists, a consortium might instead choose to concentrate on lists that complement them.
Access to Historical Evidence
Since books themselves obviously constitute a major source of evidence for book historical research, improving intellectual access to especially significant books and book evidence would greatly expedite and improve research. Existing online bibliographic catalogs have already improved access, though they have a number of disadvantages. They are distributed, and thus require serial searching. A researcher must have a good idea of where a particular book is likely to be located before beginning a search. Catalog interfaces also vary, adding to the complexity of the challenge. Both OCLC and RLG, though, have come a long way in solving this problem. More difficult, though, than the distribution of catalogs is that most MARC cataloging lacks forms of access that would be useful to students of the book. Existing practices at the Bancroft Library at the University of California, Berkeley may serve as a useful example of improving access for book historians. Using more detailed MARC records than are typical, the Bancroft Library is providing specialized access to its collections.19 Through the online catalog, the following indices are available:
Using the Bancroft Library's and similar efforts as a starting point, a consortium could develop content and encoding standards for specialized headings. The consortium could work actively to encourage use of these standards, and lobby the large bibliographic utilities and vendors of MARC catalogs to provide specialized searching based on them. A more ambitious project might bypass the utilities and vendors by creating its own union, international catalog incorporating the specialized headings. Such a catalog might be based on MARC, or alternatively on XML, using, for example the UNIMARC XML DTD developed by the BiblioML project in France. This would make it possible to use XML indexing and publishing software instead of a MARC system.20 Mapping the various dialects of MARC and non-MARC records into a UNIMARC DTD would be quite complex, though feasible, as demonstrated by the Manuscript and Letters Via International Networks (MALVINE) project, discussed below. Any or all of these initiatives would improve access to book evidence.
Encoded Archival Description (EAD)21 is an emerging international standard for encoding detailed archival descriptions of fonds or collections.22 EAD provides a standard representation of descriptions of the records of corporate bodies, and the papers of individuals and families. While EAD has many rationales, perhaps the most compelling is that standard archival description supports the long-cherished dream of providing both professional and public researchers universal, union access to primary resources. Currently there are dozens of institutions throughout the world using EAD, with the number growing rapidly. Many of the EAD implementations are consortia with several repositories participating. These are generally organized geographically, such as the Online Archive of California, or by discipline or subject, such as the Physics History Finding Aids project.23 The MALVINE project is organized geographically (European Union) and by genre (letters and manuscripts).24 The Research Libraries Group (RLG), an international consortium of research archives, libraries, and museums, is currently providing union access to finding aids from throughout the world through its Archival Resources service.25
EAD makes it possible to greatly improve access to the records of individuals, families, and firms that have made significant contributions to the history of the book. Like the history of physics and many other disciplines, the history of the book transcends national borders, with many of the significant individuals, families, and firms active in more than one country. The records documenting the activities are distributed in many countries, and within countries, in more than one repository. A worthy project might focus on identifying the significant fonds and collections distributed in European repositories, and centralize descriptive access to them using EAD. This would necessarily involve a wide variety of activities. Identifying what collections are processed and described, and evaluating the quality of existing descriptions would be a first step. Developing a strategy for converting existing descriptions into EAD would follow. Creating the archival description system would require converting print finding aids into machine-readable form, and mapping and writing conversion scripts for finding aids in word processing and database formats. EAD implementers already have extensive experience in working with vendors that convert paper finding aids, and vast in-house experience converting word-processed and database finding aids. Organizing and seeking funds for processing unprocessed collections would follow conversion of existing finding aids. The existing EAD consortia all have a lead repository or institution hosting and publishing the contributed finding aids. Providing international, union access to significant archival evidence would greatly facilitate access to significant archival evidence, and complement the access to book evidence discussed above.The Manuscripts and Letters via Integrated Networks in Europe (MALVINE) project provides an excellent European example of how such an initiative might be organized and implemented. MALVINE is funded by the European Union. There are fifteen participating archives, libraries, museums, and documentation centers, located in nine European countries. MALVINE is coordinated by the Staatsbibliothek zu Berlin, though the Humanities Information Technologies Research Programme at the University of Bergen coordinates and provides technical support for conversion and publishing. Other responsibilities are distributed among the other participants. Key components of the success of MALVINE have been the collegiality of the participants and the highly capable technical expertise brought to the effort. MALVINE represents an excellent model for European collaboration.
To complement access to bibliographic and archival evidence, a consortium might also explore options for describing and providing access to tools and apparatus used in the production of books. The International Council of Museums' Conceptual Reference Model, and other museum initiatives and standards efforts should be explored in this regard.26 Some museums are also experimenting with EAD to provide access to museum artifacts.27
Representation and Analysis of Evidence
In addition to enabling enhanced access to book and archival evidence, existing and emerging technologies also enable creating machine-readable representations of the evidence itself. Such representations will be suitable for some research purposes, and may improve existing analytic methods and inspire new methods.
Machine-readable representation of evidence can be done in three ways.
First, XML can be used to create descriptive, structural representations of objects,
for example, the physical features of books. Second, imaging technology can be
used to capture graphical information. Two-dimensional imaging can be used for
page images, manuscript and print archival resources, and similar flat resources.
Three-dimensional imaging can be used for books, tools and apparatus used in the
production of books, and similar resources.
Using the TEI DTD, Terry Catapano and Syd Bauman provide an example of a machine-readable
structural representation of the physical features of a book.28
The TEI, by design, is optimized for encoding the intellectual structure of
a book: chapters, paragraphs, poems, lines of poems, and so on. It is sufficiently
flexible, however, that Bauman and Catapano were able to devise a prototype
description of the physical structure of a book. They state that there are certain
advantages to such a representation:
The re-arrangement of the pages of text as imposed
for printing may make apparent places where the text was affected by typographical
exigencies. It is also useful in electronic bibliographic analysis
Their example is intended to demonstrate, in a very preliminary way, the feasibility of using XML to represent the physical structure of a book. It is not a fully developed system. Using this demonstration as a starting place, a project might attempt to design and develop a comprehensive XML DTD for representing the physical features of books and attempt a variety of computer-assisted analysis to determine its utility. If successful, such a DTD would facilitate building a shared collection of representations of individual, exemplary books that could be made available to the scholarly community for analysis, discussion, and teaching. Such a DTD optionally might be developed in collaboration with the TEI Consortium.
While TEI provides a comprehensive scheme for text description and representation that facilitates literary, editorial, historical, and linguistic analysis, historians of the book are likely to find much in it that they do not want or need, and not fund many elements that they do. Medievalists, for example, have found that TEI lacks sufficient detail for both description and representation of medieval manuscripts.30 Projects such as the Manuscript Access through Standards for Electronic Records (MASTER) and Electronic Access to Medieval Manuscripts (EAMMS) are working with the TEI Consortium to extend TEI to optimize its use in the study of medieval manuscripts. Initiatives similar to these might be appropriate for the history of the book community.
Page imaging of books also offers significant opportunities for collaboration. There is an ongoing and well-justified controversy concerning the use of page imaging in preservation. Preservation is an extremely complex issue, as all migration of information from one medium to another involves loss of information, namely the medium from which the information is transferred or migrated. When the medium itself is the primary evidence, such as it is in the case of books significant in book history, media transfer is an unacceptable preservation method. The only acceptable preservation method for significant books is preservation of the book itself. Page imaging, though, can be useful in facilitating access and analysis of evidence for some purposes, for example, the study of typography, and for bringing together distributed materials for comparison. The William Blake Archive provides an excellent example of the use of page imaging.31 A worthwhile area of collaboration would be in experimenting with and establishing best practices in image capture and quality to facilitate the utility of imaging in support of research and analysis. Subsequent to establishing image quality standards, collaborative building and sharing collections of page images would seem a worthwhile undertaking.
Widely demonstrated and accepted standards and technique for high quality three-dimensional imaging have yet to emerge, though research, prototyping, and development are well underway. Large-scale use of this technology would be premature a this time, but exploring and experimenting with the technology to identify useful applications would position the community to take full advantage of the technology as it becomes standardized and widely used. Uncle Tom's Cabin & American Culture at the Institute for Advanced Technology in the Humanities provides simple but suggestive examples of three-dimensional book imaging using QuickTime®.32 An experimental project, the Brazil Rendering System, provides some striking examples of three-dimensional renderings exported to standard two-dimensional imaging formats.33 In addition to books, three-dimensional technology would also be useful in representing tools and apparatus used in book production.
Other Projects
While improving access to primary resources should be the first priority
of the history of the book community, there are also a number of other potentially
useful project opportunities. Reference materials and authoritative secondary
literature can aid in the use of the primary resources, as can geographic information.
Securing digital rights to significant secondary works
and reference materials, and publishing them on the Internet is certainly worth
considering. Materials that fall into this category are landmark histories of
the book and dictionaries and glossaries, in essence, the kinds of materials that
any historian of the book has within arm's reach in his or her office, and which
are consulted regularly. Reference materials of this kind are especially useful
online, as the technology facilitates quick access, and the information sought
is typically brief and thus easily readable on a screen.
Making important geographic information available can be extremely useful
for reference, research, and teaching. Geographic Information System (GIS) provide
sophisticated access to highly accurate spatial information. Geographic Information
System also can be linked with social science data systems and other relevant
datasets. For example, GIS linked to datasets might provide dynamic graphical
presentations of the location and movement of people, firms, trade, materials,
technology, and ideas over time. Even static page images of maps can be extremely
useful. Two projects at the University of Virginia demonstrate the use of map
information in conjunction with primary resources. The Valley of the Shadow
Project uses animated maps to trace the movements of soldiers from Pennsylvania
and Virginia during the American Civil War.34 While still
under development, the Salem Witch Trials is linking geographic and temporal
information with a database documenting families, individuals, institutions,
significant events, and documentary sources in a variety of formats. A prototype
map displays in space (location of homes and households) and over time (from
29 February to 31 March, 1692) accusers, accused, and accusations. When completed,
users will be able to visually follow the spatial-temporal unfolding of the
events that gripped Salem in the seventeenth century, and access detailed descriptive
information on individuals, institutions, events, and related documents by clicking
on icons on the map.35
The proposals offered here are intended to initiate
discussion and to invite criticism. Some of them, upon close scrutiny, may turn
out to be impractical, simply not useful, or in fact already underway or completed.
The list is also by no means comprehensive. Certainly worthy of discussion are
projects devoted to developing and sharing pedagogical materials, to establishing
one or more online, peer-reviewed journals, and to providing access to relevant
social science datasets. No doubt there are many other promising candidates
as well. Individual scholars, librarians, collectors, and students in the field
are quite likely already undertaking important projects that would benefit from
collaboration. Many scholars in the field, undoubtedly, could reel off a fairly
long list of such projects, probably associated with the names of imaginative
colleagues. Ultimately, the community itself needs to identify its most important
needs, and determine whether and how collaborative use of technology can address
them.
VI. Conclusion
The technological landscape has change considerably in the last twenty years. At the close of the 1970s, computing technology was generally not available for most humanists. The equipment was expensive, and it required engineering and programming expertise that few humanists had the time or the inclination to master. The Internet was available to only a few, and was primitive when compared to today. Most computing was devoted to "number crunching," for which most humanists had little or no use. Database applications were still quite crude by current standards. Markup and related text technologies and imaging, audio, and audio-visual technologies were only on the horizon. Standards were virtually nonexistent. All of this has changed significantly.
Particularly important among the technologies and standards that have emerged are those associated with databases and text markup. These two technologies, for the first time, make it possible for humanists to rigorously articulate structures that reflect their intellectual interests, to instantiate them in machine-readable form, and use the instantiations in computations that exploit the structures. For many years now, librarians have demonstrated the power of database technology for representing and exploiting descriptive cataloging. Archivists, librarians, and humanities scholars have also successfully applied markup technologies for accurately describing and representing both the physical characteristics and intellectual content of cultural objects and collections of objects. While there are a number of existing and emerging standards already in place that would benefit the history of the book community, many others remain to be identified and developed. Database and markup technologies, in concert with the other technologies described here, present the field with the opportunity to determine its own future and the appropriate role for technology in it.
While technology presents great opportunity, it
also presents very real dangers. Book historians cannot simply rely on technologists
for guidance, especially when the technologists have conflicting interests.
The community needs to develop its own technology experts. Because the technology
presents such a wide-open space for the imagination, there is great risk that
time and money will be invested in activities that ultimately are all form,
and no content. This danger, however, should not stifle experimentation. The
technology represents, in many respects, terra incognita, and to determine
what does and does not work, and what is and is not useful, will require exploration.
Experiments, especially those involving external funds, need to be carefully
designed, with clear hypotheses and methods for evaluating results. The most
serious, danger, though, is that fear of the dangers leads to doing nothing
as a community. Inaction will have two probable outcomes. Individuals will engage
in isolated projects that will make building a community at some later date
much more difficult, as the individuals will be very reluctant to sacrifice
their work. The other probable outcome is that technologists and other outsiders
will determine the technological future of the history of the book community.
Carpe diem!
Endnotes
Adriaan van der Weel, email
Last revised: 08-05-01