Digital Text and the Gutenberg Heritage
Ch. 3: The Concept of Markup
{Status=draft}
©
Copyright 2001 by Adriaan van der Weel
1. Implicit Markup
1a. Markup in the history of human communications
As readers, we only need to glance at a printed page to recognise segments of
text as footnotes, quotations, marginal glosses and so on. Without reading a
letter of the text we are able to identify title pages, chapter openings and
other major divisions within a book. All these structural elements are conveniently
distinguished from the body of the text for the ease of the reading experience.
They are rendered distinct by a variety of typographic means, such as type size,
the use of bold, italics, typefaces, white space. Though we tend to be less
conscious of it, the same structuring takes place on the word and sentence level.
The flow of words in a conversation is uninterrupted; segmenting speech into
meaningful units is one of the major challenges confronting anyone who sets
out to learn a new language. In chirographic practice (the practice of writing
by hand), it took a long time before word spaces were standardly used as a structuring
element. In Roman times a raised dot was often used before the Greek custom
of scripta continua (running words together) was adopted. It was only
when Irish Christian scribes reinvented the word space around the fifth century
AD that it came to stay. {Gumbert, "Typography in the Manuscript Book", p. 10.}
From the earliest times, writing has been governed by conventions. Conventions rule, for example, the direction in which we write (right to left, left to right, top to bottom or even boustrophedontic, which is to say the way the oxen turns when ploughing: from left to right and right to left in turn); the way we end one and begin another sentence; the meaning we attribute to punctuation marks and the white space surrounding characters. Although the early mediaeval scribes, writing on vellum, identified the first letter of a paragraph by highlighting it, sometimes simply in red, often more ornately, with artistic flamboyance, they lacked many of the other punctuation and spacing conventions to which we are used. The text proceeded in a virtually unbroken line, with the exception of closely spaced full stops, down two columns on each page, until the next paragraph.
As more people learned to read and the practice of silent reading developed, the demand for structuring devices such as word spacing and clearer rubrication for punctuation and section headings increased. The invention of printing called for a further elaboration and codification of typographic conventions. Rubrication, after all, was scribal handwork which in the long run went against the nature of printing. Print both demanded and made it possible to broaden the array of typographical structuring devices. Print is more precise than handwriting in rendering subtle spatial variations, differences in type size and weight and so on. {Ong, Orality and Literacy, p. 128.} It fostered the development of standards and conventions through which the structure of a text, and thus its meaning, could be transferred more faithfully.
We use the term mark-up to refer to the conventions used to present words visually, whether in manuscript, print or electronic form. This may be the use of white space (e.g. space between paragraphs and words, the space surrounding titles, etc.), punctuation, or the form/size/type of the letters themselves (e.g. bold lettering, italics, etc.), and so on.
[Sidebar] Major categories of structuring devices:
I. General-
Illustrations-
Ornaments (e.g., fleurons, rules, boxes, shading)-
Running headers or footers-
Folios (page numbers)-
Special signs (e.g., paragraph signs; brackets)
II. Type-
Capitals vs lower case-
Punctuation (originally designed to render aspects of speech--Gumbert 1993,
p. 11)-
Typeface-
Type size-
Ornamented or dropped capitals-
Justification (left, right, centred, fully justified)-
Bold, italic, small caps, underlining.
III. The arrangement of blank space for-
Word spacing-
Leading (interlinear spacing)-
Margins (sizeand proportionsof the type page compared to the page
as a whole)-
Indents-
Columns-
Tables-
Letter spacing
IV Colour (including grey)-
Rubrication (mainly in MS and early printed book practice)-
Background highlighting (e.g., tables, boxes, sidebars etc)
This markup usually aims to obey the typographic convention of the period:
it answers to the unspoken expectations of the reader. But of course typographic
markup may serve other purposes beyond structuring the text for the reader's
convenience. It may, for example, be closely intertwined with an esthetic purpose
for its own sake. Or again, a more subtle semiotic purpose may be distinguished,
i.e. for the general appearance of text to represent a message about the text's
nature, or the reader, owner or user's social position. {Gumbert, "The Typography
of the Manuscript Book", p. 6.}
Redundancy
We are quite used to a certain redundancy in the typographic signposting of
structure. In Western usage, a new sentence, for example, is indicated by three
means: a full stop (or other end-of-sentence marker); a white space; and a capital
letter. In Caroline script, for example, there was less such redundancy. Word
spaces were not consistently used, and a medial dot served variously as our
modern full stop when it is followed by a capital letter, or as our comma when
it is followed by a lower case letter. {Kendrick, 1985-87, p. 126.} Similarly,
a paragraph in English is usually indicated, in addition to the new sentence
indicators, by the start of a new line and an indent. In Dutch, perhaps half
of all books published only employ a new line to begin a new paragraph. Dutch
practice is thus less redundant, but causes an occasional ambiguity when sentence
and line endings coincide. Increasingly international practice shows that a
paragraph may also start on a new line following a line of blank space, and
dispense with the indent. This practice may cause ambiguity when the blank space
coincides with the end of a page. Again, there are many ways to indicate a long
quotation, i.e. one not run on within the text in prose: it may be indented
left and/or right; it might be set with less leading; it might be set in a smaller
font, or any combination of these devices.
Ambiguity
Besides the redundancy of several typographical devices being employed to indicate
one structural item, we also see the reverse. One typographical device may represent
various structural items. We may use italics, for example, to indicate emphasis,
words from a foreign language, a book title, structural hierarchy (e.g., in
headings) and so on. A space may be used variously to divide words, sentences
or thousands (the "thin space" in 100 000). A full stop may represent a decimal
divider, a numerical divider (as in chapter numbering: "1. The economic view"
or "1.2"), an end of sentence indicator, part of the mark indicating elision
(), file name extension divider (markup.html), etcetera.
Coding and decoding
Both redundancy and ambiguity are the result of the implicit nature of all of
the examples of markup discussed so far. That is to say, the markup never states
explicitly what it means; rather, we rely on unspoken conventions for the use
we make of it. Partly there is of course no need in written and printed communication
to be explicit (human beings are very good at understanding typography), partly
it is simply not feasible. The result, at any rate, is that in typographic practice
no one-to-one relationship exists between form and function.
In the absence of an explicit, universal typographic code, usually (if the text is sufficiently long) underlying conventions can be deduced to aid us in the decoding of the typographic code. But this process is complicated, for example, by the fact that the code tends to be adapted to the circumstances of place and time. In a trendy youth magazine the conventions will be different from those observed in a staid scholarly journal, not to mention the national cultural differences. Even if underlying assumptions are comparable or even largely the same, that underlying matrix is too far removed from the implementation to be a practicable guide. For humans it is often difficult enough to decode codes with which they are not familiar; for computers it is not possible at all. Certainly the outcome is not reliable, because not objective.
If the analysis of typographical encoding and decoding is fraught with difficulties,
the terminology we have to describe the result is also defective:
In the case of the simplest usage of text, on the character level, including word spaces, capital vs lower case letters and punctuation ("punctuational" markup) this explicit definition, as we have seen, is taken care of by the ASCII character set, and now increasingly by the Unicode character sets. All character sets encode capitals and lower case letters, numbers, spaces and the commonest punctuation marks: ",", ".", ";", ":", etcetera. However, even here you might say that the designers of the ASCII table were guided primarily by the form of the character (taking their cue from the typewriter keyboard), whereas the difference in function was what mattered. There is, after all, a major functional difference between the full stop as the indication of a sentence end and the full stop as a decimal separator. It could perhaps be argued that it is a great pity, and a lost opportunity, that the computer industry did not make better use of the possibilities of the computer's ability to distinguish between characters. That a colon and a semicolon only differ by the tiniest pen stroke in their graphic representation does not make them any more similar to a computer than A and z. To a computer all that is relevant is their function: form is no issue. In letterset printing a p could, at a pinch, be made to serve as a d by turning it upside down. The saying "Mind your p's and q's" illustrates that p and q could be easily confused because the compositor worked with the mirror images of characters. In the computer every character is equally unique; there is no greater similarity between two characters either because their appearance is similar, or because their ASCII values show greater resemblance. In fact, the whole notion of resemblance does not exist for a computer. But reversely, the case for making a distinction between different functions of full stops in the ASCII system would not make any sense whatsoever in the case of metal letters.
That computers were not developed with such enhanced capabilities is primarily because they are simply automated typewriters and, with the exception of the addition of some few control keys, the computer keyboard imitates the typewriter keyboard. But of course the whole issue at hand has only been brought into being by the growing importance of electronic text, and thus by the growing importance of the computer itself. We are, in other words, making an anachronistic wish. But even if someone had recognised the unique opportunity offered by the need to design an alphabet for computers, no keyboard could have accommodated even the most frequently used of the endless range of our typographic devices.
2. Explicit Markup
Markup through markup language
As we have seen, the distinction between form and function is crucial in electronic
text. And not only did the design of the ASCII character set miss the opportunity
to make computers recognise the sort of distinctions humans can make effortlessly
on the character level, there are all of the problems inherent in the electronic
representation of typographical markup discussed in the previous chapter. Notoriously,
there is the problem of the many different schemes of proprietary markupinvolving
binary codesused by word processors and layout programs to encode typographical
markup (such as italics, bold, new page, indents, columns etc.). Documents created
in one program cannot usually be read by another, at least not without the aid
of a conversion filter, as most people will have had ample occasion to lament.
This problem is exacerbated when the documents are transferred from one software
platform to another: DOS to MS Windows; Windows to Macintosh; Macintosh to Unix.
And even if the document can be read, its graphic representation may be different
on different computers owing to varieties in personal preferences. De facto
standards spring up, and perish again, making electronic textual transmission
a shaky affair, with further-reaching implications as the internet's grasp on
human communication gets firmer.
To achieve the purpose of the interchange of texts between people and the
hardware and software they use without communication breaking down, while at
the same time circumventing the limitations of ASCII (which cannot deal with
accented characterslet alone foreign alphabetsspace and other typographic
features very well) the concept of a descriptive markup language was invented.
A markup language is a language that can describe explicitly any features that
may be in danger of not being understood or misunderstood, by computers or by
other human beings or by both. These explicit descriptions take the form of
codes either embedded in the text and clearly marked as codes, or stored outside
it and keyed to it.
The history of generic markup, like that of word processing and page layout
programs, goes back to typesetting:
By way of an example of how markup works we may look at HTML (HyperText Markup Language), the most widely familiar markup language in actual use today. It was developed by Tim Berners-Lee in # specifically to provide a graphic navigation interface of the World Wide Web. It is an implementation of the Standard Generalised Markup Language (SGML), which is a so-called "metalanguage", meaning a language to write markup languages. HTML provides a system to encode some of the most frequently used typographic features, and it allows the practice of hyperlinking from one document on the World Wide Web to another. It performs these functions by marking up the textual content (written mostly in a natural language) with the markup codes defined in the HyperText Markup Language. Perhaps surprisingly, in view of the observation made in Ch2 that the ASCII character set does not contain the means to add more than the most elementary typographic formatting to text, this generic markup solution employs the ASCII character set. For example, bold text cannot be represented in a text editor using plain ASCII text, but in word processing programs bold text is encoded using proprietary binary codes. In HTML, the notation for bold text, using ASCII characters, is the code <B>. A text preceded by <B> and followed by </B> will be presented as bold type by an internet browser:
Amost all of HTML tags are designed to render typographic information of the implicit kind. In fact it replaces one type of implicit encoding by another. That is to say that while a text coded in HTML can be processed on any computer platform for simple viewing (showing a text placed within the codes <I> and </I> as italics, for example), HTML cannot describe much in the way of function or structure. A computer will still not be able to tell whether a fragment of text identified as italics is a book title or a phrase in a foreign language. Not only that, but even in its possibilities to represent typographic information HTML has its limitations. Especially space (curiously perhaps in view of the fact that it is, as we have seen, the single most important structuring device) still presents a major problem: HTML has trouble with, for example, tabs, and ignores multiple spaces.
However, the limitations of HTML are not inherent in the concept of markup language per se: they are merely the result of the design of a particular markup language. Descriptive markup as a concept offers many more possibilities than HTML utilises. The following is an example of HTML:
(a) Example of HTML:
<HTML> The limitations of HTML in representing structural information about a text
can be illustrated by contrasting the HTML encoding of the letter in example
(a) with the same letter encoded in a hypothetical markup language, Correspondence
Markup Language (CML) in example (b).
<CML> Apart from enabling further processing as in the search examples above, markup
can also be linked to a typographic style for presentation on screen. In simple
HTML browsers, this presentation style is "hard wired" in the program code.
This means that any material between <P></P> codes will always be
rendered in the same way, for example as 12pt Times Roman preceded by a line
of white space, and any material between <I></I> codes will always
be rendered as, for example, 12pt Times Italics. More sophisticated browsers
can use stylesheets that associate markup tags with specific typographic formatting.
The example of the hypothetical Correspondence Markup Language illustrates
the main goal of using descriptive markup: typographic form as an implicit
marker of structure has been replaced by descriptive markup as an explicit
marker of structure. The structural information represented by the markup has,
moreover, been clearly separated from the content of the text, and each is represented
in an explicit form that can be read and "understood" by both humans and computers.
One of the major advantages of separating the typographic form (representing
structure) of a text from its content is the extreme versatility of output it
allows. A text can be typeset by associating a particular markup code with a
typographic form; it can be published in any number of electronic forms, from
.pdf to a cd-rom, or it can be stored in a database.
Markup languages similar to the hypothetical Correspondence Markup Language
used in example (b) above have been in existence since the 1980s. The advantages
of such languages for computer processing of text are many, and it may be wondered
why not more use is made of them. It has, for example, been suggested (e.g.
in Coombs, Renear, and DeRose, "Markup Systems and the Future of Scholarly Text
Processing") that the use of explicit markup, in favour of a typographic representation
of structure, would free writers from the need to think of typographic structuring,
allowing them instead to concentrate on the writing task proper. This notion
is based on the misguided assumption that people find it easy to separate form
from content. They don't. Human beings have been conditioned by centuries of
typographic structuring, and have come to depend on the visual cues that typography
can provide for structuring arguments. Hence the tremendous success of wysiwyg
environments discussed in Chapter 2. Text editors are unsatisfactory for composing
text because they don't allow authors to structure text typographically.
Chapter 5, "Markup Continued", takes a more in-depth technical view of descriptive
markup, and discusses some more advanced features. The next chapter will deal
with hypertext (and hypermedia)one of the new ways of ordering text, images
and sound made possible by the digital media.
</HTML>
Illustration [#x; file Ch03_Picture_1.pict] shows the result in an internet browser
window. Note that it makes no difference to the computer whether it is given example
(a) [file: Ch03_Letter.html] or [#x; file: Ch03_Letter2.html] to show; the use
of white space to separate and indent lines is for the convenience of the human
reader only.
(b) Hypothetical Correspondence Markup Language (CML)
<CML>
HTML is clearly more limited in its capacity to encode the letter structure than
CML. Where the HTML markup is limited to identifying paragraphs (<P>), line
breaks (<BR>) and italics (<I>, markup in the Correspondence Markup
Language example informs the computer (as well as any human reader) explicitly
of the structural function of the various parts of the text. This information
can then be used for all sorts of further processing purposes. For example, in
a database of letters, the computer may be asked to find all letters signed by
Cassell Petter & Galpin, or all letters sent between a certain range of dates.
The computer would then look for "Cassell Petter & Galpin" only inside the
markup codes <SIGNED></SIGNED>,
or for dates only inside the <DATELINE></DATELINE>
codes. Note that if you attempted to ask these questions from a collection of
documents encoded in HTML like example (a), you would not be able to confine your
search in the same way. Looking for Cassell Petter & Galpin, results would
include the occurrence of the name "Cassell Petter & Galpin" anywhere in a
document, including as addressee or as subject.
|
A.H. van der Weel; Tel. 071-5272974; E-mail
This page or one of its nested pages last updated: 06-09-2001 |