KEYWORDS: text encoding, SGML, MECS
AFFILIATION: University of Bergen
E-MAIL: Claus.Huitfeldt@hd.uib.no FAX NUMBER: +47-55-58 94 70 PHONE NUMBER: +47-55-58 29 50
1. Text Objects and Their Properties
Letters of the alphabet, numbers, punctuation marks and a few other conventional signs are basic constituents of any text written in an alphabetical writing system. Any computing system which is capable of representing strings of alphanumeric characters is therefore also capable of representing at least the most basic linguistic contents of texts written in such writing systems.
However, looking back on a long tradition of manuscripts and printed books, we are definitely not prepared to admit that there is nothing more to texts than that. Not only can written texts contain graphical elements such as drawings etc., but the page layout, typography and graphical design may also play a crucial role in identifying, emphasising and increasing readability of parts of a text and conveying structural relationships between them.
Since a computer text file is in a certain sense simply a long string of characters, it is by marking them up with reserved character combinations that text processing systems let us represent such properties and structures. Text encoding systems such as SGML are an attempt to systematize, generalize and standardize such markup.
The marks or tags serve to identify specific parts or elements of the text and to ascribe specific properties to these elements. In a text encoded in accordance with the TEI's Guidelines for SGML encoding, e.g., we will frequently find structures such as:
... <emph> ... </emph> ...
The start tag and the end tag (i.e., the reserved character combinations '<...>' and '</...>') indicate the start and end of an element and ascribe to this element a property indicated by the generic identifier (the GI, in this case 'emph'). The TEI Guidelines tell us that this particular GI indicates the property 'emphasized', which 'marks words or phrases which are stressed or emphasized for linguistic or rhetorical effect' (TEI P3, p 955).
Broadly speaking, SGML-encoded documents consist entirely of such elements, which may be nested within other elements (cf. the OHCO-model of texts in DeRose et al). We could therefore also say that an SGML document is an ordered sequence of characters and markup, the markup ascribing certain properties to parts of the sequence and indicating certain relationships between these parts.
This seems to invite a rather clear-cut conception of texts as collections of objects with properties: The characters are so to speak the basic building blocks or elementary particles which cannot be decomposed any further, they are the smallest possible objects out of which higher-level objects, elements, are built. An object is either a character string or an element, and properties can be ascribed to such objects by GIs.
2. Attributes
It may be that one and the same object has more than one property, or that we want to classify or qualify an ascribed property further. SGML allows us to express such features by means of attributes:
... <foreign lang=fr> ... </foreign> ...
In this case, the GI 'foreign' ascribes the property of 'belonging to some language other than that of the surrounding text' (TEI P3, p 981) to the element, while the attribute 'lang' identifies this language more specifically as being French (indicated by the attribute value 'fr'). In other cases, the attribute may supply further information about the same element:
... <foreign lang=fr rend=italics> ... </foreign> ...
The value 'italics' of the attribute 'rend' (for 'rendition') does not provide a further classification or qualification of the language in question, but indicates that the element was or should be printed in italics.
Attributes can be useful since they allow us to express complex structures in a regular way, allowing for various sorts of processing depending on the purpose at hand:
<foreign lang=fr rend=italics> ... </foreign>
<emph rend=italics> ... </emph>
<name type=person reg='Smith, John' rend=bold> ...
</name>
SGML also lets us enforce rules e.g. to the effect that the attribute 'reg' is required on the GI 'name' but not allowed on the GIs 'emph' and 'foreign', that the attribute 'rend' is allowed on all GIs and required on 'emph', etc. The SGML attribute mechanism thus gives us a very strong tool to describe textual structures.
In practice one will usually design an SGML encoding system so that what are perceived as primary properties are represented by GIs, whereas attributes either qualify or classify these primary properties or add secondary attributes to those ascribed by the GI.
What is considered primary and secondary will vary from context to context. What for certain purposes may be encoded like this:
<foreign rend=italics> ... </foreign>
<emph rend=italics> ... </emph>
<name rend=bold> ... </name>
may for other purposes more suitably be encoded like this:
<italics type=foreign> ... </italics>
<italics type=emph> ... </italics>
<bold type=name> ... </bold>
It is sometimes said that the choice of whether to represent a certain property as a GI or as an attribute is a matter of taste and style. But while an element can have several attributes it can only have one GI. This may lead to problems in cases where one and the same element has properties which are both indicated by GIs.
Assume e.g. that one has chosen to represent emphasized phrases and names as GIs with attributes as illustrated above, and we encounter an emphasised name printed in bold italics. Either one must add a new attribute to the system, indicating one of the properties which would normally have been represented as a GI, e.g. like this:
<name type=person reg='Smith, John'
rend= 'bold italics' mode=emph>
... </name>
(The sole purpose of the attribute 'mode' is to carry the value of what would otherwise have been a GI.) Or one must nest two elements with the relevant GIs in question inside each other, and either duplicate common attributes or decide which of the two elements should carry them, e.g. like this:
<name type=person reg='Smith, John'
rend=bold>
<emph rend=italics>
... </emph></name>
Both cases seem to leave room for some slack or even inconsistency in encoding practice, and they mean that the same phenomena will be encoded by different mechanisms or in different ways from case to case.
The latter case also raises the question which should be the outer and which the inner of the two elements, leaving additional room for slack and inconsistency.
3. Encoding Without Attributes
Among the aims of the Wittgenstein Archives at the University of Bergen is to transcribe the (mostly unpublished) 20,000-page manuscript Nachlass of the Austrian philosopher Ludwig Wittgenstein. The encoding system used in this project is based on MECS (cf. Huitfeldt 1993 and 1995), which in all respects relevant for the present discussion is identical to SGML.
In this project, we decided not to make use of attributes at all.
Instead, a separate GI was introduced for every possible combination of properties, i.e. for what would otherwise have been represented as a combination of GIs and attributes. One of the reasons for this decision was that it was rather difficult to decide which were to count as primary and which as secondary properties of the texts.
E.g., Wittgenstein frequently marks parts of his texts with underlining. There are several different types of underlining, - such as straight, wavy, dotted and broken lines, underlinings with one, two or several lines. We know that Wittgenstein had his personal conventions for such markings in the manuscripts, and that the different kinds of underlining have different meanings. We know e.g. that a straight line means emphasis and that wavy lines in general indicate dissatisfaction with content or formulation, but we do not know the exact meaning of all these conventions. And although we do know that Wittgenstein indicated emphasis and dissatisfaction also by other means, a lot of uncertainty of interpretation usually pertains to these other occurrences.
Therefore, we limit our interpretation of the text to identifying the convention used, - we do not take the further step of interpreting what the convention in each individual case stands for, - i.e. we indicate the underlinings, not their meaning.
In SGML, we might have encoded all these properties with one GI and two attributes, e.g.
<u shape=s number=1> for 1 straight line,
<u shape=s number=2> for 2 straight lines,
<u shape=w number=1> for 1 wavy line,
etc.
Instead, we encode like this:
<us1> for 1 straight line,
<us2> for 2 straight lines,
<uw1> for 1 wavy line,
etc.
The number of possible combinations of such properties is considerable but limited, - the number of GIs we have to handle becomes larger than the number we would have had to deal with had we used a system with attributes - but it is manageable. The Wittgenstein Archives encoded texts for several years in this manner, and everything seemed to function well.
4. New problems
However, after a while we proceeded to a part of the Wittgenstein Nachlass which did cause us problems. Some texts have been edited by several different individuals or by Wittgenstein himself at different times, i.e. they are written in different "hands". Text originally written in one hand has sometimes been subject to cancellation, modification or addition by a later hand, which may in turn have been subject to alteration by a yet later hand, etc.
E.g. a word originally written in one hand (by Wittgenstein himself) may be underlined by a later hand (e.g. his colleague Russell), the underlining may have been cancelled by a third hand (e.g. Ramsey) and the cancellation cancelled by a fourth hand (e.g. Wittgenstein himself, thus dismissing Ramsey and agreeing with Russell). The second hand, which cancelled the underlining may also have deleted (i.e. cancelled) the word itself, - and again this cancellation may have been lifted by a later hand, and so on.
The introduction of new GIs in order to cover all combinations of these parameters throughout 20,000 pages became a rather hopeless business in the long run.
It is worth noticing that the complexity which threatened to break our system down was not the number of properties involved (in fact, only two properties, 'hand' and 'cancelled', were involved). Neither was it the number of values that these properties could have ('cancelled' has only two values, and the number of hands in Wittgenstein's manuscripts is not very large). Nor was the number of different GIs which could have these properties overwhelming.
What was special about these properties was that they could not only be properties of a text element, but also properties of properties, properties of properties of properties, and so on: A word can be cancelled (deleted), the cancellation can be cancelled, the cancellation of the cancellation can be cancelled - and on each of these levels the cancellation in question can have the property of being made in a different hand.
5. Multilevel Attribution
Although the question as to whether properties should be represented in the form of GIs or attributes is mostly a practical one, we have now found a criterion to identify certain properties that cannot be represented with GIs only.
What is characteristic about these properties is that they can occur at any level, that they can be used recursively, and that there is in principle no limit to the number of levels at which they can apply.
Since the difference between MECS and SGML is that whereas MECS has no attributes SGML has, one would think that for the Wittgenstein Archives the above was a strong argument in favour of SGML. However, it turns out to be just as difficult to take the above factors into account in SGML as in MECS. (The TEI's encoding of certainty and responsibility (TEI P3, p 521-528) and its use of feature structure mechanisms (TEI P3, p 475-519) are examples of solutions to similar problems in SGML.)
As mentioned earlier, an SGML attribute qualifies an element or its GI, not the other attributes of the same element. I.e., we can of course design special attributes for each of the levels involved on an ad hoc basis, but these will be entirely dependent upon some specific application for their correct interpretation.
6. Conclusion
Multi-level properties, i.e. properties which can be properties of properties, and which can occur at any level of attribution, cannot be represented by GIs and must be represented by some kind of attribute mechanism.
However, such an attribute mechanism must have capabilities we do not find in SGML, i.e certain attributes must be attributable of elements as well as of other attributes, and this attribution must be able to occur unrestricted at any level of recursion.
References
ISO: "Information Processing - Text and Office Systems Standard Generalized Markup Language (SGML)", International Organization for Standardization, ISO 8879-1986, Geneva 1986.
C.M. Sperberg-McQueen and Lou Burnard (eds.): "Guidelines for the Encoding and Interchange of Machine-Readable Texts (TEI P3)", Chicago and Oxford April 1994.
Claus Huitfeldt: "MECS - A MultiElement Code System", forthcoming in Working Papers from the Wittgenstein Archives at the University of Bergen, 1995.
Claus Huitfeldt: "MECS-WIT - A Registration Standard for the Wittgenstein Archives at the University of Bergen", forthcoming in Working Papers from the Wittgenstein Archives at the University of Bergen, 1995.
DeRose, Durand, Mylonas, and Renear: "What is Text, Really?" in Journal of Computing in Higher Education, Winter 1990, Vol I (2), p 3-26.