Digital Text and the Gutenberg Heritage

Ch. 3: The Concept of Markup {Status=draft}

© Copyright 2001 by Adriaan van der Weel


1. Implicit Markup

1a. Markup in the history of human communications

As readers, we only need to glance at a printed page to recognise segments of text as footnotes, quotations, marginal glosses and so on. Without reading a letter of the text we are able to identify title pages, chapter openings and other major divisions within a book. All these structural elements are conveniently distinguished from the body of the text for the ease of the reading experience. They are rendered distinct by a variety of typographic means, such as type size, the use of bold, italics, typefaces, white space. Though we tend to be less conscious of it, the same structuring takes place on the word and sentence level. The flow of words in a conversation is uninterrupted; segmenting speech into meaningful units is one of the major challenges confronting anyone who sets out to learn a new language. In chirographic practice (the practice of writing by hand), it took a long time before word spaces were standardly used as a structuring element. In Roman times a raised dot was often used before the Greek custom of scripta continua (running words together) was adopted. It was only when Irish Christian scribes reinvented the word space around the fifth century AD that it came to stay. {Gumbert, "Typography in the Manuscript Book", p. 10.}

From the earliest times, writing has been governed by conventions. Conventions rule, for example, the direction in which we write (right to left, left to right, top to bottom or even boustrophedontic, which is to say the way the oxen turns when ploughing: from left to right and right to left in turn); the way we end one and begin another sentence; the meaning we attribute to punctuation marks and the white space surrounding characters. Although the early mediaeval scribes, writing on vellum, identified the first letter of a paragraph by highlighting it, sometimes simply in red, often more ornately, with artistic flamboyance, they lacked many of the other punctuation and spacing conventions to which we are used. The text proceeded in a virtually unbroken line, with the exception of closely spaced full stops, down two columns on each page, until the next paragraph.

As more people learned to read and the practice of silent reading developed, the demand for structuring devices such as word spacing and clearer rubrication for punctuation and section headings increased. The invention of printing called for a further elaboration and codification of typographic conventions. Rubrication, after all, was scribal handwork which in the long run went against the nature of printing. Print both demanded and made it possible to broaden the array of typographical structuring devices. Print is more precise than handwriting in rendering subtle spatial variations, differences in type size and weight and so on. {Ong, Orality and Literacy, p. 128.} It fostered the development of standards and conventions through which the structure of a text, and thus its meaning, could be transferred more faithfully.

We use the term mark-up to refer to the conventions used to present words visually, whether in manuscript, print or electronic form. This may be the use of white space (e.g. space between paragraphs and words, the space surrounding titles, etc.), punctuation, or the form/size/type of the letters themselves (e.g. bold lettering, italics, etc.), and so on.

[Sidebar] Major categories of structuring devices:

I. General-
Illustrations-
Ornaments (e.g., fleurons, rules, boxes, shading)-
Running headers or footers-
Folios (page numbers)-
Special signs (e.g., paragraph signs; brackets)

II. Type-
Capitals vs lower case-
Punctuation (originally designed to render aspects of speech--Gumbert 1993, p. 11)-
Typeface-
Type size-
Ornamented or dropped capitals-
Justification (left, right, centred, fully justified)-
Bold, italic, small caps, underlining.

III. The arrangement of blank space for-
Word spacing-
Leading (interlinear spacing)-
Margins (size­and proportions­of the type page compared to the page as a whole)-
Indents-
Columns-
Tables-
Letter spacing

IV Colour (including grey)-
Rubrication (mainly in MS and early printed book practice)-
Background highlighting (e.g., tables, boxes, sidebars etc)

This markup usually aims to obey the typographic convention of the period: it answers to the unspoken expectations of the reader. But of course typographic markup may serve other purposes beyond structuring the text for the reader's convenience. It may, for example, be closely intertwined with an esthetic purpose for its own sake. Or again, a more subtle semiotic purpose may be distinguished, i.e. for the general appearance of text to represent a message about the text's nature, or the reader, owner or user's social position. {Gumbert, "The Typography of the Manuscript Book", p. 6.}

Redundancy
We are quite used to a certain redundancy in the typographic signposting of structure. In Western usage, a new sentence, for example, is indicated by three means: a full stop (or other end-of-sentence marker); a white space; and a capital letter. In Caroline script, for example, there was less such redundancy. Word spaces were not consistently used, and a medial dot served variously as our modern full stop when it is followed by a capital letter, or as our comma when it is followed by a lower case letter. {Kendrick, 1985-87, p. 126.} Similarly, a paragraph in English is usually indicated, in addition to the new sentence indicators, by the start of a new line and an indent. In Dutch, perhaps half of all books published only employ a new line to begin a new paragraph. Dutch practice is thus less redundant, but causes an occasional ambiguity when sentence and line endings coincide. Increasingly international practice shows that a paragraph may also start on a new line following a line of blank space, and dispense with the indent. This practice may cause ambiguity when the blank space coincides with the end of a page. Again, there are many ways to indicate a long quotation, i.e. one not run on within the text in prose: it may be indented left and/or right; it might be set with less leading; it might be set in a smaller font, or any combination of these devices.

Ambiguity
Besides the redundancy of several typographical devices being employed to indicate one structural item, we also see the reverse. One typographical device may represent various structural items. We may use italics, for example, to indicate emphasis, words from a foreign language, a book title, structural hierarchy (e.g., in headings) and so on. A space may be used variously to divide words, sentences or thousands (the "thin space" in 100 000). A full stop may represent a decimal divider, a numerical divider (as in chapter numbering: "1. The economic view" or "1.2"), an end of sentence indicator, part of the mark indicating elision (Š), file name extension divider (markup.html), etcetera.


Coding and decoding
Both redundancy and ambiguity are the result of the implicit nature of all of the examples of markup discussed so far. That is to say, the markup never states explicitly what it means; rather, we rely on unspoken conventions for the use we make of it. Partly there is of course no need in written and printed communication to be explicit (human beings are very good at understanding typography), partly it is simply not feasible. The result, at any rate, is that in typographic practice no one-to-one relationship exists between form and function.

In the absence of an explicit, universal typographic code, usually (if the text is sufficiently long) underlying conventions can be deduced to aid us in the decoding of the typographic code. But this process is complicated, for example, by the fact that the code tends to be adapted to the circumstances of place and time. In a trendy youth magazine the conventions will be different from those observed in a staid scholarly journal, not to mention the national cultural differences. Even if underlying assumptions are comparable or even largely the same, that underlying matrix is too far removed from the implementation to be a practicable guide. For humans it is often difficult enough to decode codes with which they are not familiar; for computers it is not possible at all. Certainly the outcome is not reliable, because not objective.

If the analysis of typographical encoding and decoding is fraught with difficulties, the terminology we have to describe the result is also defective:

Despite a tradition of book design going back centuries, and despite the efforts of many devoted critics, no one could claim that there is a consensus on how to describe a printed page in detail. Some critics speak of the "bibliographic codes" that form part of the publication of any work. But the term is, for now, still more a metaphor than a sober description. The characteristic of any code is that it is made up of a finite set of signs, which as Saussure teaches us are arbitrary linkages of signifier and signified. For artificial languages, the sets of signifiers and their meanings are given by the creator of the artificial language. For natural languages, dictionaries attempt to catalog the signifiers and their significance; grammars attempt to explain the rules for combining signs into utterances. We have nothing equivalent for the physical appearance of texts in books. Any serious attempt to record the bibliographic codes built into the book design and typography of a literary work must begin by specifying the set of signs to be distinguished. Is 24-point type different from 10-point type as a bibliographic code? In most circumstances, yes. Is 10-point type from one type foundry different from 10-point type in the same face, produced by a different foundry? In most circumstances, no. What about 10- and 11point type? Ten and 12? To specify a formal language for expressing significant differences of typographic treatment, we need to reach some agreement about what constitutes a significant difference‹what the minimal pairs are. (Michael Sperberg McQueen, "Textual Criticism and the Text Encoding Initiative", p. 54)

The electronic environment
In an electronic environment, the implicit nature of conventional markup with the attendant unreliability of its decoding is unacceptable. Computers have compelled us to specify our mark-up explicitly. Due to the fact that computers are incapable of processing in a controlled fashion anything which has not been explicitly defined, we are forced, when dealing with them, to be very clear about our definitions and usage of text and mark-up.

In the case of the simplest usage of text, on the character level, including word spaces, capital vs lower case letters and punctuation ("punctuational" markup) this explicit definition, as we have seen, is taken care of by the ASCII character set, and now increasingly by the Unicode character sets. All character sets encode capitals and lower case letters, numbers, spaces and the commonest punctuation marks: ",", ".", ";", ":", etcetera. However, even here you might say that the designers of the ASCII table were guided primarily by the form of the character (taking their cue from the typewriter keyboard), whereas the difference in function was what mattered. There is, after all, a major functional difference between the full stop as the indication of a sentence end and the full stop as a decimal separator. It could perhaps be argued that it is a great pity, and a lost opportunity, that the computer industry did not make better use of the possibilities of the computer's ability to distinguish between characters. That a colon and a semicolon only differ by the tiniest pen stroke in their graphic representation does not make them any more similar to a computer than A and z. To a computer all that is relevant is their function: form is no issue. In letterset printing a p could, at a pinch, be made to serve as a d by turning it upside down. The saying "Mind your p's and q's" illustrates that p and q could be easily confused because the compositor worked with the mirror images of characters. In the computer every character is equally unique; there is no greater similarity between two characters either because their appearance is similar, or because their ASCII values show greater resemblance. In fact, the whole notion of resemblance does not exist for a computer. But reversely, the case for making a distinction between different functions of full stops in the ASCII system would not make any sense whatsoever in the case of metal letters.

That computers were not developed with such enhanced capabilities is primarily because they are simply automated typewriters and, with the exception of the addition of some few control keys, the computer keyboard imitates the typewriter keyboard. But of course the whole issue at hand has only been brought into being by the growing importance of electronic text, and thus by the growing importance of the computer itself. We are, in other words, making an anachronistic wish. But even if someone had recognised the unique opportunity offered by the need to design an alphabet for computers, no keyboard could have accommodated even the most frequently used of the endless range of our typographic devices.


2. Explicit Markup

Markup through markup language
As we have seen, the distinction between form and function is crucial in electronic text. And not only did the design of the ASCII character set miss the opportunity to make computers recognise the sort of distinctions humans can make effortlessly on the character level, there are all of the problems inherent in the electronic representation of typographical markup discussed in the previous chapter. Notoriously, there is the problem of the many different schemes of proprietary markup‹involving binary codes‹used by word processors and layout programs to encode typographical markup (such as italics, bold, new page, indents, columns etc.). Documents created in one program cannot usually be read by another, at least not without the aid of a conversion filter, as most people will have had ample occasion to lament. This problem is exacerbated when the documents are transferred from one software platform to another: DOS to MS Windows; Windows to Macintosh; Macintosh to Unix. And even if the document can be read, its graphic representation may be different on different computers owing to varieties in personal preferences. De facto standards spring up, and perish again, making electronic textual transmission a shaky affair, with further-reaching implications as the internet's grasp on human communication gets firmer.

To achieve the purpose of the interchange of texts between people and the hardware and software they use without communication breaking down, while at the same time circumventing the limitations of ASCII (which cannot deal with accented characters‹let alone foreign alphabets‹space and other typographic features very well) the concept of a descriptive markup language was invented. A markup language is a language that can describe explicitly any features that may be in danger of not being understood or misunderstood, by computers or by other human beings or by both. These explicit descriptions take the form of codes either embedded in the text and clearly marked as codes, or stored outside it and keyed to it.

The history of generic markup, like that of word processing and page layout programs, goes back to typesetting:

[That g]eneric markup can help us to reintroduce the important separation between structure and appearance ... was realized at the time of the confusion over specific markup with photo-typesetting systems. A movement was started to create a standard markup language, which all typesetting vendors would be persuaded to accept as input. It would be the typesetting houses' problem to translate this language into the language of their own photocomposer machines.
To be able to do this, a generic markup language was needed. Generic markup means adding information to the text indicating the logical components of a document, such as paragraphs, headers, footnotes. This initial generic markup effort was lead by the GCA, an industry group who owns the trademark "GenCode" which is the name of the generic markup language intended for typesetters. (Van Herwijnen, p. 20)

However, there also exist markup languages that store codes offline and key them to the text. The relative merits of both will be discussed in Chapter 5, "Markup continued".

By way of an example of how markup works we may look at HTML (HyperText Markup Language), the most widely familiar markup language in actual use today. It was developed by Tim Berners-Lee in # specifically to provide a graphic navigation interface of the World Wide Web. It is an implementation of the Standard Generalised Markup Language (SGML), which is a so-called "metalanguage", meaning a language to write markup languages. HTML provides a system to encode some of the most frequently used typographic features, and it allows the practice of hyperlinking from one document on the World Wide Web to another. It performs these functions by marking up the textual content (written mostly in a natural language) with the markup codes defined in the HyperText Markup Language. Perhaps surprisingly, in view of the observation made in Ch2 that the ASCII character set does not contain the means to add more than the most elementary typographic formatting to text, this generic markup solution employs the ASCII character set. For example, bold text cannot be represented in a text editor using plain ASCII text, but in word processing programs bold text is encoded using proprietary binary codes. In HTML, the notation for bold text, using ASCII characters, is the code <B>. A text preceded by <B> and followed by </B> will be presented as bold type by an internet browser:

<B> This text is meant to appear in bold type.</B>

(<B> is the "start-tag" for bold type and "</B>" is the "end-tag" specifying where the bold type ends.) These "tags" are examples of "explicit markup" (i.e. codes that specify explicitly some typographic or semantic feature of the text), which the computer therefore can process unambiguously. In the example the markup is <B> and </B>, which explicitly specifies that a part of the text (the text between the markup codes) is to be shown typographically as bold. The code is made to stand out from the text by being enclosed within the reserved marker characters "<" and ">". Apart from the code for bold, HTML has codes for many other typographic features such as blank lines, tables, indented quoted text, lists, etcetera.

The hyperlinking ability of HTML uses the same principle of markup language. To create a hyperlink in a WWW document, you mark up the text segment you wish to link with the "anchor" start and end tags <A> and </A>, and the name and location of the document to which you wish to refer specified as a so-called attribute value to an attribute called "HREF" (Hypertext Reference). The whole looks like this:

People will find <A HREF="otherdoc.html">this document</A> handy.

The internet browser will show all text between <A> and </A> as a link, and the target of the link, where the browser will take you if you click on the link, is a document (in the same directory on the same server) called "otherdoc.html". Hyperlinks can point to another place inside the same document; to another document in the same location (as illustrated here); or to another document anywhere on the WWW.

Amost all of HTML tags are designed to render typographic information of the implicit kind. In fact it replaces one type of implicit encoding by another. That is to say that while a text coded in HTML can be processed on any computer platform for simple viewing (showing a text placed within the codes <I> and </I> as italics, for example), HTML cannot describe much in the way of function or structure. A computer will still not be able to tell whether a fragment of text identified as italics is a book title or a phrase in a foreign language. Not only that, but even in its possibilities to represent typographic information HTML has its limitations. Especially space (curiously perhaps in view of the fact that it is, as we have seen, the single most important structuring device) still presents a major problem: HTML has trouble with, for example, tabs, and ignores multiple spaces.

However, the limitations of HTML are not inherent in the concept of markup language per se: they are merely the result of the design of a particular markup language. Descriptive markup as a concept offers many more possibilities than HTML utilises. The following is an example of HTML:

(a) Example of HTML:

<HTML>

<HEAD>
<TITLE>Letter from Cassell to Nijgh, 07-04-1863</TITLE>
</HEAD>
<BODY>
<P>Ludgate Hill, E.C.<BR>
London 7th April 1863</P>
<P>An H. Nygh<BR>
Uitgever <BR>
Rotterdam</P>
<P>Dear Sir</P>
<P>We are favoured with yours of the 3rd mrt. </P>
<P>The First Number of our Illustrated Bunyans Pilgrim's Progress shall be sent you at the Earliest moment: but as we shall not publish it till the completion of our Bible (which will take place next month) we will in the meantime send you a specimen part. We ought to inform you that the post which brought us your letter, in which you ask us to quote a price for clich&eacute;s of the woodcuts, brought us a similar application from another house in Holland. Inasmuch as we have had the pleasure of corresponding with your house for some time we shall be glad to do business with you in this work if we can come to terms. The Engravings will be superb. It is our intention to have them executed in the very first style of art, and they will consequently be very expensive. Do you see your way to pay us one shilling &amp; three pence 1/3 per square inch wich, assuming that the specimen parts which we purpose sending you should meet your approval?</P>
<P>We shall be glad to hear from you at your Earliest convenience with your views upon this. We ask for an Early communication from you that we may know how to deal with the other application referred to above.</P>
<P>We are, Dear Sir <BR>
Yours faithfully <BR>
Cassell Petter &amp; Galpin </P>
<P>P.S. <I>The Pilgrims Progress</I> will make about 60 numbers (weekly)</P>
</BODY>
</HTML>

Illustration [#x; file Ch03_Picture_1.pict] shows the result in an internet browser window. Note that it makes no difference to the computer whether it is given example (a) [file: Ch03_Letter.html] or [#x; file: Ch03_Letter2.html] to show; the use of white space to separate and indent lines is for the convenience of the human reader only.

The limitations of HTML in representing structural information about a text can be illustrated by contrasting the HTML encoding of the letter in example (a) with the same letter encoded in a hypothetical markup language, Correspondence Markup Language (CML) in example (b).


(b) Hypothetical Correspondence Markup Language (CML)

<CML>

<HEAD>
<TITLE>Letter from Cassell to Nijgh, 07-04-1863</TITLE>
</HEAD>
<BODY>
<OPENER>
<DATELINE>
<ADDRESS>Ludgate Hill, E.C.</ADDRESS>
<PLACE>London </PLACE> <DATE>7th April 1863</DATE>
</DATELINE>
<ADDRESSEE>
<NAME>An H. Nygh
Uitgever </NAME>
<ADDRESS><PLACE>Rotterdam</PLACE></ADDRESS>
</ADDRESSEE> [#Check Vanhoutte]
<SALUTE>Dear Sir</SALUTE>
</OPENER>
<P>We are favoured with yours of the 3rd mrt. [#inst?]</P>
<P>The First Number of our Illustrated Bunyans Pilgrim's Progress shall be sent you at the Earliest moment: but as we shall not publish it till the completion of our Bible (which will take place next month) we will in the meantime send you a specimen part. We ought to inform you that the post which brought us your letter, in which you ask us to quote a price for clichés of the woodcuts, brought us a similar application from another house in Holland. Inasmuch as we have had the pleasure of corresponding with your house for some time we shall be glad to do business with you in this work if we can come to terms. The Engravings will be superb. It is our intention to have them executed in the very first style of art, and they will consequently be very expensive. Do you see your way to pay us one shilling & three pence 1/3 per square inch wich, assuming that the specimen parts which we purpose sending you should meet your approval?</P>
<P>We shall be glad to hear from you at your Earliest convenience with your views upon this. We ask for an Early communication from you that we may know how to deal with the other application referred to above.</P>
<CLOSER>
<SALUTE>We are, Dear Sir
Yours faithfully </SALUTE>
<SIGNED>Cassell Petter & Galpin</SIGNED>
<POSTSCRIPT>P.S. <TITLE>The Pilgrims Progress</TITLE> will make about 60 numbers (weekly)</POSTSCRIPT>
</CLOSER>
</BODY>
<CML>

HTML is clearly more limited in its capacity to encode the letter structure than CML. Where the HTML markup is limited to identifying paragraphs (<P>), line breaks (<BR>) and italics (<I>, markup in the Correspondence Markup Language example informs the computer (as well as any human reader) explicitly of the structural function of the various parts of the text. This information can then be used for all sorts of further processing purposes. For example, in a database of letters, the computer may be asked to find all letters signed by Cassell Petter & Galpin, or all letters sent between a certain range of dates. The computer would then look for "Cassell Petter & Galpin" only inside the markup codes <SIGNED></SIGNED>, or for dates only inside the <DATELINE></DATELINE> codes. Note that if you attempted to ask these questions from a collection of documents encoded in HTML like example (a), you would not be able to confine your search in the same way. Looking for Cassell Petter & Galpin, results would include the occurrence of the name "Cassell Petter & Galpin" anywhere in a document, including as addressee or as subject.

Apart from enabling further processing as in the search examples above, markup can also be linked to a typographic style for presentation on screen. In simple HTML browsers, this presentation style is "hard wired" in the program code. This means that any material between <P></P> codes will always be rendered in the same way, for example as 12pt Times Roman preceded by a line of white space, and any material between <I></I> codes will always be rendered as, for example, 12pt Times Italics. More sophisticated browsers can use stylesheets that associate markup tags with specific typographic formatting.

The example of the hypothetical Correspondence Markup Language illustrates the main goal of using descriptive markup: typographic form as an implicit marker of structure has been replaced by descriptive markup as an explicit marker of structure. The structural information represented by the markup has, moreover, been clearly separated from the content of the text, and each is represented in an explicit form that can be read and "understood" by both humans and computers.

One of the major advantages of separating the typographic form (representing structure) of a text from its content is the extreme versatility of output it allows. A text can be typeset by associating a particular markup code with a typographic form; it can be published in any number of electronic forms, from .pdf to a cd-rom, or it can be stored in a database.

Markup languages similar to the hypothetical Correspondence Markup Language used in example (b) above have been in existence since the 1980s. The advantages of such languages for computer processing of text are many, and it may be wondered why not more use is made of them. It has, for example, been suggested (e.g. in Coombs, Renear, and DeRose, "Markup Systems and the Future of Scholarly Text Processing") that the use of explicit markup, in favour of a typographic representation of structure, would free writers from the need to think of typographic structuring, allowing them instead to concentrate on the writing task proper. This notion is based on the misguided assumption that people find it easy to separate form from content. They don't. Human beings have been conditioned by centuries of typographic structuring, and have come to depend on the visual cues that typography can provide for structuring arguments. Hence the tremendous success of wysiwyg environments discussed in Chapter 2. Text editors are unsatisfactory for composing text because they don't allow authors to structure text typographically.

Chapter 5, "Markup Continued", takes a more in-depth technical view of descriptive markup, and discusses some more advanced features. The next chapter will deal with hypertext (and hypermedia)‹one of the new ways of ordering text, images and sound made possible by the digital media.

 

A.H. van der Weel; Tel. 071-5272974; E-mail

This page or one of its nested pages last updated: 06-09-2001