[Back]

Attributes: A Solution

Peter Cripps
The Wittgenstein Archives at the University of Bergen, Harald Haarfagresgt. 31, N-5007 Bergen, Norway

KEYWORDS: text encoding, SGML, MECS

AFFILIATION: The Wittgenstein Archives at the University of Bergen

E-MAIL:       Peter.Cripps@hd.uib.no
FAX NUMBER:   +47-55 58 94 70
PHONE NUMBER: +47-55 58 94 73


1. The problem restated

The principle objective of The Wittgenstein Archives at Bergen University is to publish the literary estate - the Nachlass - of the Austrian philosopher Ludwig Wittgenstein in machine readable form. This Nachlass comprises some 20,000 pages of which roughly 65% are manuscripts and the remainder typescripts. The greater part of the handwritten material, as well as much of the typed, is replete with later alterations, deletions, insertions, rearrangements, cross-references and the like, the sum of which constitutes a formidable challenge to text encoding.

Some of the textual features we wish to record in our transcription work invite the use of a markup language which can describe properties on more than one level.

For instance, a word or words constituting a later addition to a text might be inserted either above or below the original line of writing; such an insertion may or may not include a marking to indicate that it is confirmed as an addition to the main text; such a confirmation marking, if present, might in turn be modified by means of a wavy underlining to indicate that the addition has subsequently been called into doubt, and this wavy underlining might itself be crossed through, thereby effectively reaffirming what was formerly doubted.

And still this isn't the whole story. In addition to the different markings we can often distinguish that each step in the process makes use of a different ink, and this information needs to be recorded independently of what that ink was used for. I.e. we can never assume a consistent correlation between, say, confirmation markings and blue ink, or doubt markings and red ink.

What we have here is a text feature with one essential property - its being a later addition to the base text - and a number of secondary properties, such as position (above or below the line) and degree of authorial approval ("marked" = confirmed, "marked with added wavy underlining" = doubted, "marked with added wavy underlining, wavy underlining cancelled" = reaffirmed).

In Standard Generalized Markup Language (SGML) secondary properties such as an insertion's position, its degree of authorial approval (status) and the type of ink it was written with would commonly be handled by means of attributes. But when we consider the problem of Wittgenstein's insertions closely it becomes clear that SGML attributes are not capable of describing such features in all their complexity.

Suppose in the above case that we have a single SGML type generic identifier (GI) to describe the primary property of a string's being inserted and that all the secondary properties are to be accounted for by means of attributes. The insertion's position is easy enough to deal with, since for any one insertion this will be invariable. Thus we might have a tag which looks like this:

<insertion position="above line"> ... </insertion>

Neither should the confirmation marking be a problem so long as it isn't modified by anything else:

<insertion position="above line"
marking=confirmation> ... </insertion>

But what should we do if the confirmation marking has subsequently been modified by a wavy underlining to indicate doubt? The only way to quote an attribute in SGML is to list it in the open-tag of an element along with any others that might be relevant. Both "marking=confirmation" and "marking=doubt" appear to be relevant in our example. Yet the insertion as a whole cannot be both confirmed and doubted!

And how should we represent the use of different inks? We cannot simply add an attribute "ink=red" if only the confirmation marking is in red ink whereas the text of the insertion is in black.

What we seem to be dealing with is properties of properties: the confirmation marking represents a property of the insertion as a whole, whereas the doubt marking annuls the confirmation; the fact of being written in black ink is a property of the insertion as a whole, whereas that of being written in red describes the confirmation marking alone. And so on. What we need here is a system whereby attributes can qualify each other as well as the GIs that describe the element's primary property.


2. MECSA - a more flexible attribute syntax
MECSA is an attribute syntax currently being developed as an adjunct to the Multi-Element Code System (MECS), the markup language used at the Wittgenstein Archives in Bergen. Well formed MECS texts are convertible to and derivable from SGML texts, and it is envisaged that MECSA attributes will likewise be translatable to and from SGML attributes.

MECSA incorporates a number of innovations, the most significant of which is that it permits attributes to qualify not only GIs but also other attributes. This it does simply by allowing for bracketing. For example, if an attribute "att2" describes a property of an attribute "att1" (rather than an immediate property of the feature described by the GI) then one can express the relationship in MECSA by quoting "att2" in brackets(1) immediately after "att1", thus:

<GI att1=val1(att2=val2)> ... </GI>

Suppose further that "att2" requires the qualification of an attribute "att3". This we would express by appending "att3" as a parenthesis to "att2", thus:

<GI att1=val1(att2=val(att3=val3))> ... </GI>

And so on.

When two or more attributes apply to the same object (either the GI or another attribute) they are simply listed concurrently within the relevant bracket:

<GI att1=val1 att2=val2(att3=val3(att4=val4) att5=val5)> ... </GI>

In this case "att1" and "att2" apply directly to "GI", "att3" and "att5" apply to "att2", and "att4" applies to "att3".

MECSA makes it appropriate to talk of attributes occurring on different levels. In the above example "att1" and "att2" are on the highest level and could be called primary - or first-level - attributes, while "att3", "att4" and "att5" are sub-attributes (on lower levels), whereof "att3" and "att5" are second-level attributes and "att4" is on a third-level. In this way we could say that SGML provides a single level attribute syntax, whereas that of MECSA is multi-level.

The possible uses for such a system are numerous. In a particular application one might choose to ignore everything but the primary attributes, or alternatively to "work out" the brackets, beginning with the most deeply nested and progressing outwards, before putting the primary attribute(s) to their task(s). This will be clearer if we consider a practical example.

Let us imagine some MECSA attributes to describe the different properties associated with Wittgenstein's insertions as these are outlined above. The different kinds of markings, which serve to confirm or doubt an insertion, can be accounted for by an attribute called "marking" which takes the values "confirmation" (for the initial insertion marking) or "doubt" (for the wavy underlining indicative of doubt). The attribute "ink" takes one of the values "black", "blue" or "red". Allowing that the "marking" attribute can be applied to itself and the "ink" attribute to both the GI and the "marking" attribute, we might then tag a particular insertion thus:

<insertion ink=black marking=confirmation
(ink=blue marking=doubt(ink=red))> ...
</insertion>

Suppose now that we are interested in Wittgenstein's text as it looked after he had reworked it in blue ink. At that stage he was evidently in approval of the inserted material, and consequently it should be included in the text we retrieve. In other contexts the attribute "marking=doubt" might well be used to suppress the effect of "marking=confirmation". But by defining "ink=red" such that it first suppresses the effect of "marking=doubt", we can leave the attribute "marking=confirmation" to function uninhibited. In this way we suppress what was in itself a suppressor.

On another occasion we might wish to view the text as it looked after the first pass. To achieve this we can use the "ink=blue" attribute to suppress any effect the "marking=confirmation" attribute might have. And so on.

It is not difficult to imagine further applications for such a system.


3. Parsing MECSA attributes
In MECSA, details about the legal combinations of attributes and GIs are recorded in an Attribute Definition Table (ADT), which in most respects serves the same purposes as the ATTLIST declarations in an SGML DTD.

One of the functions of the SGML ATTLIST, or - in the case of MECSA - the ADT, is to specify the value for a legally applicable attribute in the case that none is made explicit in a particular document. In SGML this is a straightforward process. If the GI in a particular tag lacks a legal attribute, then a parser can supply that attribute together with its default value in accordance with the appropriate ATTLIST. But since MECSA allows even one and the same attribute to apply to itself as well as to a GI an ADT cannot be allowed to handle the ascription of attributes to attributes in the same way as it handles the ascription of attributes to GIs. The danger we have to guard against is, of course, the possibility of an infinite regress. It should be obvious what would happen if a parser were asked to render explicit all the legal attributes in a system where "att1" can apply to "att1"!

MECSA avoids this danger by allowing for the supplementation (extension) only of first-level attributes. If the ADT specifies that "att1" can legally be applied to "GI1" then a parser can, on encountering a tag where "GI1" lacks "att1", supply the attribute plus its default value. But although the ADT might specify that "att1" can legally be applied to "att1", a MECSA parser will not supply a lower level occurrence of "att1" where one is already present. The ADT information that "att1" can be applied to "att1" is available for the purpose of checking whether the attributes which are already explicit in a document are legal, not for the purpose of supplying default values where they are absent.

In an SGML ATTLIST the legal values for each attribute are listed together with the legal attributes for the particular GI. But in the MECSA ADT the information about the values an attribute may take is handled separately from the information about the range of attributes which may legally be ascribed to a particular attribute object (GI or other higher level attribute). This separation is again a consequence of the fact that MECSA allows attributes to qualify attributes. (It would not be feasible to express this possibility using the structure of the SGML ATTLIST.)

It is worth noting, however, that a MECSA parser does not necessarily presuppose the existence of an ADT. In the absence of an ADT a MECSA parser may still check the attribute syntax. But it may also, if desired, compile a minimal-ADT from the document itself. In doing so it notes which attributes are applied to which GIs, which to other attributes, and which values are already attached to those attributes. This information is then arranged in the format of a normal ADT, such that this record can, if desired, be used as a measure of the correctness of further document instances. The one thing such a syntax-checker cannot do is deduce an attribute's default value. Instead, in compiling a minimal-ADT, it supplies a System Default Value (SDV) in the place where an explicit default value would stand in a full-ADT. This SDV is a reserved character which ensures that attributes remain functionally meaningless when inserted into documents automatically on the basis of a minimal-ADT. If and when a document containing such attributes is checked against a full-ADT, these SDVs can then be replaced by meaningful values.


4. Conclusion
Due to the nature of certain textual phenomena encountered in Wittgenstein's Nachlass, the need arose for an attribute syntax of greater descriptive resolution than that offered by SGML. MECSA is a system which promises to satisfy this need by allowing attributes to qualify not only GIs but also one another. In designing this system such that documents using its syntax can be converted to and obtained from SGML documents, it is hoped that MECSA will be of use also in contexts other than the work currently being done at the Wittgenstein Archives in Bergen.


Notes
(1)
Technically speaking, MECSA provides for a "sub-attribute open delimiter" and a "sub-attribute close delimiter", the functions of which could be assigned to any suitable characters. Ie there is nothing compulsory about the characters "("and ")".


Bibliography
Charles F. Goldfarb, "The SGML Handbook", Oxford 1990.

Claus Huitfeldt, "MECS - A Multi-Element Code System", Bergen 1992; Working Papers from the Wittgenstein Archives at the University of Bergen, 1995.

C.M. Sperberg-McQueen & Lou Burnard (eds.), "Guidelines for Electronic Text Encoding and Interchange - TEI P3", Chicago/Oxford, 1994.


[Back]