Conference logo

TEXT ENCODING IN THEORY AND PRACTICE


A Course in SGML, TEI, MECS, and an introduction to tools in
text encoding for humanistic research

A course to be held at ALLC/ACH '96 in Bergen

The organizers of ALLC/ACH '96 are pleased to announce a pre-conference course on text encoding.

TITLE: A Course in SGML, TEI, MECS, and an introduction to tools in text encoding for humanistic research
TIME: Saturday 22 June, 1 pm - 7 pm
Sunday 23 June, 10 am - 6 pm
PLACE: University of Bergen
INSTRUCTORS: Lou Burnard, Peter Cripps, Claus Huitfeldt, and C. M. Sperberg-McQueen
REGISTRATION FEE: 400 NOK
REGISTRATION DEADLINE: 1 June 1996
This course will introduce the fundamental problems of text encoding and the representation of texts in electronic form, using the Standard Generalized Markup Language (SGML), the SGML-based encoding scheme of the Text Encoding Initiative (TEI), and the Multi-Element Code System (MECS) developed by the Wittgenstein Archives at the University of Bergen. The main focus will be on fundamental issues of encoding, but specifics of the tag sets defined by the TEI and the Wittgenstein Archives, and practical issues facing academic projects, will also be touched upon.

Topics to be covered include:
* General Principles of Text Markup: What is markup for? Varieties of markup; effect of markup. What are electronic texts for? Markup and interpretation. Markup as a means of enabling intelligent retrieval.

* Basics of SGML: What it is and isn't; the case for using it. Basic SGML syntax for the document instance (tags, entity references, comment declarations). Examination and explication of simple examples.

* Basics of MECS: What it is and isn't; syntax of document and declarations; simple examples. Why isn't MECS the same as SGML?

* Document Analysis: What document analysis is, and why it is an essential part of any e-text project. Phases of document analysis. Group document analysis of a sample text.

* Basics of the TEI: origins and goals of the TEI, overall organization of the TEI encoding scheme, basic structural notions of the TEI DTD and the pizza model: the base, additional, and core tag sets, and how they may be extended, modified, and documented.

* Group tagging of the sample text in TEI and MECS.

* Special problems: discussion of sample texts posing

special problems for markup (e.g. hypertext, text with critical apparatus, text with literary, philosophical, or linguistic analysis), with tagging in MECS and TEI, and comparison of the two.

* Group tagging of further examples.

* Practical issues: types of software available for work with electronic texts in SGML and MECS, issues of project organization, publication on the net, and a review of where to go for further information.

THE TEXT ENCODING INITIATIVE

The Text Encoding Initiative (TEI) is an international cooperative research effort, the goal of which is to define a set of generic Guidelines for the representation of all kinds of textual materials in electronic form, in such a way as to enable researchers in any discipline to interchange texts and datasets in machine readable form, independently of the software or hardware in use, and also independently of the particular application for which such electronic resources are used. The first full version of the TEI Guidelines was published in May, 1994, after six years of development in Europe and the US. It takes the form of a substantial reference manual, documenting a modular and extensible SGML document type definition (DTD), which can be used to describe electronic encodings of all kinds of texts, of all times and in all languages. It is sometimes said that the Standard Generalized Markup Language (SGML: ISO 8879) provides only the syntax for text markup; the TEI aims to provide a semantics.

Computer-aided research now crosses many political, linguistics, temporal, and disciplinary boundaries; the TEI Guidelines have been designed to be applied to texts in any language, from any period, in any genre, encoded for research of any kind. As far as possible, the Guidelines eschew controversy; where consensus has not been established, only very general recommendations are made. The object is to help the researcher make his or her position explicit, not to dictate what that position should be.

Viewed as a standard, the TEI scheme attempts to occupy the middle ground. It offers neither a single all-embracing encoding scheme, solving all problems once for all, nor an unstructured collection of tag sets. Rather it offers an extensible framework containing a common core of features, a choice of frameworks or bases, and a wide variety of optional additions for specific application areas. Somewhat light-heartedly, we refer to this as the Chicago Pizza model (in which the customer chooses a particular base - say deep dish or whole crust - and adds the toppings of his or her choice), by contrast with both the Chinese menu or laissez-faire approach (which allows for any combinations of dishes, even the ridiculous) and the set meal approach, in which you must have the entire menu.

THE MULTI-ELEMENT CODE SYSTEM

The Multi-Element Code System (MECS) is a markup scheme developed at the Wittgenstein Archives at the University of Bergen for use in the transcription of Ludwig Wittgenstein's posthumous papers. It builds on long experience at the Archives with the transcription and processing of complex manuscript materials, but is purposely made general enough to be applicable not only to manuscripts but to any type of document. While the syntax of MECS is designed to accommodate SGML syntax as a simple case, the system is designed to handle non-hierarchical and non-nesting phenomena as readily as nesting hierarchical phenomena; there is thus less need, with MECS, to make special-case tags to handle overlapping segments of text.

In contrast to the top-down approach of SGML, whereby the structure of any legal document is determined by a DTD, MECS allows you to introduce codes into a document as and when the need for them arises. From a document written in this way one can then derive a minimal code definition table (CDT) which will, if desired, serve to limit the introduction of further codes. This way of working is of value when developing a code set by means of pilot project or similar. Alternatively MECS allows you to impose upon a user the need for a CDT, thereby ensuring that only a limited set of codes is used.

It is at present possible to define MECS such that all MECS-documents are SGML conformant document instances and vice versa. There also exists a range of MECS software which includes tools for code-syntax and/or CDT document validation, document formatting, SGML document conversion, code extraction, statistical document analysis, spell-checking etc.

MATERIALS AND PRESENTERS

All participants will be provided with a printed introductory summary guide to the TEI and MECS schemes, and supporting materials on PC disks, including full versions of the TEI DTDs, public domain SGML software and sample TEI texts. Subject to availability, participants may purchase the CD-ROM of the TEI Guidelines at a discounted price. MECS software and manuals will also be available, either in book form or on diskette.

The tutorial will be taught by four instructors:


Lou Burnard (Oxford University Computing Services)
Peter Cripps (Wittgenstein Archives, University of Bergen)
Claus Huitfeldt (Wittgenstein Archives, University of Bergen)
C. M. Sperberg-McQueen (Computer Center, University of Illinois at Chicago)

Participation will be limited according to the space available.

Updated 15.04.96
Claus Huitfeldt