KEYWORDS: non-linear text, programming tools
AFFILIATION: Max-Planck-Institut für Geschichte, Göttingen
E-MAIL: mthalle@gwdg.de FAX NUMBER: +49 - 551 - 495670 PHONE NUMBER: +49 - 551 - 495664
The administration of non linearily encoded texts on computer systems has traditionally been seen as a relatively high level problem in systems design. That is, relevant systems usually take a rather traditional approach at low level programming and add the nonlinearly functionality required for the presentation of nonlinear texts and/or linkages between digitally represented and transcribed texts by specifying appropriate functionality within the environment of the particular application system.
This creates problems, when a non-linear property transcends a specific component of an application. Assume, e.g., a text which is marked up according to two overlapping hierarchies. While these two hierarchies are represented in markup as equally important, "loading" the text into almost any target system usually means that a browser converts the text into a data object, which represents exactly one hierarchy and simply ignores the other. That is very convenient from the point of view of the target system, as, when it is being realized, the whole question of co-existing hierarchies can be ignored. The point of designing a markup scheme which allows for overlapping hierarchies, only to loose this property when the text is actually browsed into an application is not immediately understandable, however.
As a solution we propose to implement a data type "extended string", which replaces the traditional concept of a "string" in programming application systems. This means, that any application program accepts "external information", which is browsed into the internal extended string representation, processed in that form and re-converted into some kind of external representation before being displayed on an appropriate medium. This is far from new, of course: a good example might be a system like X-Windows where this general logic is used to allow an applications programmer to manipulate strings by completely traditional tools, while the internal string representation takes care of all aspects of processing necessary to handle font properties in display.
We assume, however, that this logic can be carried considerably further. Let us assume, that a given text is marked up by two overlapping hierarchies, one representing the division of the text into reference units, like pages and the other some semantic division, like the names of fields of a data base system, into which a specific substring belongs. Even if the text is marked up in a way which preserves both types of division, once it is browsed and loaded into the underlying database structure, we will normally not have the possibility anymore to access the reference units. More explicitly: if such a text is browsed into a data base system which has been realized in C, the function call
strcmp(name1,name2)
will yield the same value, irrespective whether name1 and name2 are contained on the same page or not.
To change this, we propose the implementation of a data type "extended string", which has a comparison function
estrcmp(environment,name1,name2)
which by default should act just as strcmp() above. If within an application program, however, it should be preceded by a call to an environment changing function
estrsetsensitive(environment,
PageSensitivity,On)
any following call to
estrcmp(name1,name2)
should result in different return values, reflecting whether name1 and name2 are on the same page or on different ones.
Taking examples from a series of ongoing projects who use experimental software based on the concept of a data type "extended string" as introduced above, the proposed paper discusses first some practical problems of its realization and the interrelationship of such an implementation with existing programming tools, taking as an example the embedding of the data type into a X-compatible widget.
It should be emphasized again, that this is just an introductory example: the number of string properties handled in that form is rather large and goes considerably beyond the scope of overlapping hierarchies. A complete description of the concept of an extended string can be found in M.Thaller, "The Processing of Manuscripts", in: M.Thaller (Ed.) Images and Manuscripts in Historical Computing (= Halbgraue Reihe zur Historischen Fachinformatik, vol. A 14), St Katharinen, 1992, 97-121. All the properties in question can be divided into three groups: (a) Those which are necessary to implement nonlinearity (from which our initial example has been taken), (b) those which are necessary to connect transcribed parts of a text to bitmaps of the image it describes or the manuscript it transcribes and (c) those which deal with "graphic" properties of portions of a text.
In all three cases the questions raised relate to two different fields: on the one hand they are connected to the practical dimension of programming. This aspect is supposed to be covered by the example quoted above. On the other hand, however, the actual policies to be implemented by such a purely technical solution, reflect heavily the conceptual decisions about what a specific property of a text actually means within the context of a given discipline.
This shall be described with regard to the question of how much information is actually related to the third of the three problem areas given above, the graphic properties of a text within historical research. Speaking on the most general level, we consider a text to be "historical", when it describes a situation, where we do neither know for sure, what the situation has been "in reality", nor according to which rules it has been converted into a written report about reality. On an intuitive level this is exemplified by cases, where two people with the same graphic representation of their names are mentioned in a set of documents, which possibly could be two cases of the same "real" individual being caught acting, which, however could also be homographic symbols for two completely different biological entities. At a more sublime level, a change in the color of the ink a given person uses in an official correspondence of the 19th century could be an indication of the original supply of ink having dried up; or of a considerable rise of the author within the bureaucratic ranks. Let us just emphasize for non-historians, that the second example is all but artificial: indeed the different colors of comments to drafts for diplomatic documents are in the 19th century quite often the only identifying mark of which diplomatic agent added which opinion.
What these introductory examples should demonstrate, is, that the text - the computer interpretable representation of a written document - forms in historical research an intermediate layer between two other layers of information. On the one extreme we have abstract factual knowledge about the various entities described in a text, which allows the interpretation of it; on the other there are purely graphical characteristics of the written document, which may carry meaning, but need not do so.
That the second problem is a genuine markup problem is probably obvious: if we use a computer to prepare diplomatic drafts of the 19th century for printing, we obviously need a way to describe a portion of the document as being "written with blue pencil". Which, at the time of the first transcription is exactly what it says, a literal description of a graphic property, though during the process of research it may well acquire a more abstract connotation, like "author=M. Simpson". This could of course be interpreted as such properties being eminently fitted to abstract rules for markup, because at the time of producing the markup we have not yet the faintest idea what the final representation in print, if any, of the specific graphic property is to be. The problem is however, that part of the research which is supposed to be supported - at least within an archival environment - is precisely dedicated to finding out, what the observable graphical properties mean. If a computer system shall therefore be able to support historical research as opposed to adminsitering in a convenient way results of historical research, it has to have the capability of administering graphical properties as what they are, being able to switch to a more abstract interpretation in time, but always being able to fall back to what can actually be observed.
To bring it to a point: almost all the examples given in the discussions on standardization during the last few years dealt with how to tag a structure which is clearly understood and where the graphic representation is accidental. Historical work deals with structures in a text which we want to discover, where the graphics we see may be all the clues we ever might get.
Concludingly the paper shows how these considerations fit into the ones that resulted in the first example given, and can be turned into an organic extension of an implementation as the X-compatible "extended string widget" introduced above.