Chapter 2. Basic units: characters and words

Version 2.0 beta (15 September 2004)

 

2.1 Introduction
2.2 Characters
2.3 Words

Back to list of contents

 

2.1 Introduction

The basic unit in any transcription of an alphabetic script is the individual letters. In a linguistic context a distinction is often made between the abstract entity of a grapheme and the representation of graphs in a written document. Variant forms are referred to as allographs, e.g. the Roman type of s and the Fraktur (black letter) type. The terminology is analogous to the distinction between phonemes, phones and allophones. For a general introduction to this terminology, see Sture Allén 1971 or, more recently, Manfred Kohrt 1985.

In this handbook we shall adopt the terminology of the Unicode Standard. The fundamental distinction drawn is between characters and glyphs. Characters are, as Unicode defines it, “the smallest components of written language that have semantic value”, while glyphs are “the shapes that characters can have when they are rendered or displayed” (cf. Unicode 4.0, ch. 2.2). What the transcriber sees in the source document is a series of individual glyphs, and the act of transcribing essentially involves connecting these glyphs to the characters at the transcriber's disposal.

The concept of a character is similar to, but not identical with the linguistic concept of a grapheme. These concepts are notoriously difficult, but for the purposes of this handbook we believe that the Unicode usage is robust and sufficiently well-defined.

The Unicode Standard puts great emphasis on the fact that individual characters may be represented by a number of glyphs, and is therefore reticent to accept as new characters what it percieves to be variant glyphs. It will be obvious to most people that the various shapes of letters in printed type faces, such as Baskerville, Palatino, Helvetica etc., should not be seen as different characters, as shown in fig. 2.1.

Fig. 2.1 Various shapes (glyphs) of the characters “A” and “a” in Courier, Times and Lucida typefaces

Unicode draws a distinction between small (minuscule) characters such as “a” and large (majuscule) characters such as “A”, since there is a possible semantic value attached to each set of characters. Thus, “the white house” can refer to any house which is white in colour, while “the White House” refers (normally) to one specific building. It can be argued that the same applies to the distinction between Roman types, “a”, and italics, “a”. For example, while “Metope” refers a poem by the Norwegian author Olaf Bull, Metope (according to a widespread bibliographical practice) refers to the book in which this poem is published (a book which, co-incidentally, bears the same name as one of the poems contained in the book). However, Unicode does not regard italics (or bold type) as individual characters. There are good reasons for this, but the example serves to illustrate the fact that the definition of a character is not always clear-cut.

 

2.2 Characters

Medieval Nordic manuscripts were written in the Latin alphabet from the very beginning. The basic inventory is thus the characters a-z / A-Z. They were supplemented with a number of new (or borrowed) characters, several ligatures and a variety of diacritical marks. There was also a large number of abbreviation marks in use, especially in Old Icelandic and Old Norwegian manuscripts. We shall go through the inventory of ordinary characters, i.e. those based on the set a-z / A-Z, in ch. 5 and abbreviation marks in ch. 6, and we shall refer to both types as characters. In fact, some abbreviation marks behave as ordinary characters in the sense that they occupy a separate position on the base line. On the other hand, many components of ordinary characters are diacritical, i.e. placed above (or through or below) another character, and thus akin to typical abbreviation marks. This means that the rules for transcribing ordinary characters and abbreviation marks should be identical.

We believe that it is possible to identify a base line in all texts, as shown in fig. 2.2. We recommend that the transcriber identifies each separate character on the base line and record this in the same sequence as in the manuscript. Thus, the characters in fig. 2.2 would be transcribed as “abpþ”. Note the use of an entity, “þ”, for the last character. Entities are explained in ch. 1 and discussed further in ch. 5.

Fig. 2.2 Position of characters on the base line

If there are marks of any sort placed above, through or below any base line character, we recommend that these marks (if they are to be interpreted as characters) are transcribed immediately after the base line character. In general, we refer to these marks as diacriticals. As mentioned above, abbreviation marks are also frequently written above (and in some cases through or below) a base-line character. Assuming that the sign above “h” is referred to with the entity “&er;”, the transcription of the very first word in fig. 2.3 would be “h&er;”.

Fig. 2.3 Diacritical marks and abbreviation marks

Diacritical marks are often seen as forming an integral part of a base line character and the whole encoded as a single entity. This applies to accent marks, such as the one above “e” in fig. 2.3. This character is usually encoded as a single entity, in Unicode referred to as LATIN SMALL LETTER E WITH ACUTE and the hexidecimal code value 00E9. As we shall see below, it is possible to decompose this letter in Unicode and refer to it as a combination of LATIN SMALL LETTER E and COMBINING ACUTE ACCENT. We would like to emphasize that both encodings are equivalent.

Abbreviation marks, on the other hand, are usually treated as separate characters and encoded as entities in their own right. From a purely graphical point of view, the distinction between the acute accent in “é” and abbreviation marks such as the “zigzag” mark and the bar, both exemplified in fig. 2.3, is far from obvious, but the semantics are different. The acute accent may in some manuscripts be used to signify length, but it is often used quite freely, sometimes only to distinguish one minim character from another. Abbreviation marks have a definite (if sometimes ambiguous) meaning and can be expanded into one or more characters; the zigzag mark above “h” in fig. 2.3 signifies “er”, and the bar above “n” signifies another “n”.

 

2.2.1 Rules for encoding characters

We suggest the following basic rules for encoding characters, irrespective of whether they are ordinary (alphabetic) characters or abbreviation marks.

1. Each character is encoded according to its position in the direction of writing.

2. Alphabetical characters on the base line are encoded first. If the character belongs to the ordinary Latin character set a-z / A-Z (commonly known as ISO 646 or ASCII) it is encoded as such. If not, it is encoded as an entity, as explained in ch. 5 below.

3. Abbreviation marks occupying a separate position on the base line are encoded in the same manner as alphabetical characters. This applies to e.g. the Tironian nota for “et” (in Latin) or “ok” (in Old Norse), which is encoded with the entity “&et;” as explained in ch. 6.2.

4. Alphabetical characters with diacritical marks, e.g. “é”, are encoded in one of two equivalent ways:

4.1 As a base line character + one or more combining marks. Thus the character “é” would be encoded as “e” + “&combacute;” (the latter entity meaning COMBINING ACUTE ACCENT).
4.2 As a composite base line character and encoded with a single entity. Thus the character “é” would be encoded as “é”.

5. Characters with abbreviation marks are encoded in the same manner as alphabetical characters, i.e. in one of two equivalent ways:

5.1 As a base line character + one or more combining marks. Thus the first character in fig. 3.2 above would be encoded as “h” + “&er;” (the latter entity meaning COMBINING ABBREVIATION MARK “ER”).
5.2 As a composite base line character and encoded with a single entity. Thus the above character might be encoded with a single entity, e.g. as “&her;”.

As a rule, we would recommend the first solution, since the number of combinations of base line characters and combining abbreviation marks is very high. Cf. the discussion in ch. 6.3.

6. If there is more than one combining character, they are encoded in this order:

(a) Combinations with the base line character within the x height of the base line character.
(b) Combinations with the base line character outside its x height, but still in contact with it.
(c) Combinations with the base line character outside its x height and without any contact with it.

7. If there is more than one combining character in any of the three positions defined in (6) above, they are encoded in a clockwise direction, beginning at 6 o'clock and moving through 9 o'clock, 12 o'clock etc.

 

2.2.2 Entities and Unicode values

By using entities it is possible to define as many characters as one believes are necessary for the transcription of a certain corpus of texts. Entities are used in numerous encoding schemes, and for the sake of transparency and interchangeability, we recommend that entities as far as possible conform to the standard ISO entity sets. An updated list of ISO conformant entities can be found at the Oasis web site:

ISO entities

In addition to the ISO entities, we need a number of entities for characters not designated in this standard. The rules for constructing new entites are discussed in ch. 5 and 6 below.

Furthermore, entities need to be displayed by appropriate fonts. Therefore, we strongly recommend that all entities are defined and described with reference to the Unicode Standard. In this standard, each character is identified by a unique code point, exemplified by a typical graphic form (“glyph”), and given a descriptive name. An increasing number of fonts contains a large set of characters in the Unicode Standard. This greatly facilitates the display of encoded texts.

The Basic Multilingual Plane of the Unicode Standard has 65,536 different code points. This includes a large Private Use Area (PUA), comprising some 6,000 code points. This area can be used for characters not defined in the Standard (so far). Our present recommendation is to use this area for characters not included in the Unicode Standard. It should be noted that the use of PUA is an interim solution. A long-term solution is obviously to apply to Unicode for the inclusion of additional characters and/or use other rendering techniques (such as OpenType).

Code points in Unicode are usually given in hexadecimal format, in which each digit spans a sequence of 16 positions, 0-1-2-3-4-5-6-7-8-9-A-B-C-D-E-F. Thus, 0001 equals 1 in the decimal system, 000F equals 16, 0010 equals 17 etc. The whole range thus goes from 0000 to FFFF (65,536). The PUA is located at E000-F8FF.

The Latin alphabet is the first to be described in the Unicode Standard. As was mentioned, many characters in Unicode can be defined in several ways, either as a single, composite character or as combination of a base line character and one or more combining marks.

(a) Commonly used characters have a single description in Unicode. This applies to all base line characters in the Latin alphabet.

Glyph

Encoding

Code point

Unicode descriptive name

a

0061

LATIN SMALL LETTER A

(b) Composite characters may be described in more than one way. Thus “a with acute accent” can be encoded as a combination of “a” and a combining acute accent or as a single character, “a with acute accent”. Both descriptions are equivalent:

Glyph

Encoding

Code point

Unicode descriptive name

a + &combacute;

0061 + 0301

LATIN SMALL LETTER A + COMBINING ACUTE ACCENT

á

00E1

LATIN SMALL LETTER A WITH ACUTE

Note that the entity á belongs to the ISO set, while &combacute; is an example of an entity defined in this handbook (cf. ch. 5 below for more information).

(c) Some characters are not found in Unicode and must therefore be allocated to the Private Use Area (PUA), either as a character with its own code point or as a combination of an existing character and a combining diacritical mark in the PUA. The ligature “av” is not included in Unicode, and since we (at the moment) would rather not encode it as a sequence of “a” + “zero width joiner” + “v”, we have allocated it to a code point in the PUA, E406.

Glyph

Encoding

Code point

Unicode descriptive name

&avlig;

E406

LATIN SMALL LIGATURE AV

This may look unnecessarily complicated. It should be borne in mind, however, that the great majority of characters are defined in Unicode, and in many transcriptions the need for special characters in the PUA will not arise. With appropriate fonts, the transcriber does not need to spend much time on the technicalities of this problem.

For a complete list of entities and Unicode code points, including PUA, cf. the character list.

 

2.2.3 Rules for naming of characters

For practical reasons, all characters needed for the transcription of medieval Nordic manuscripts should be given descriptive names. We have found the naming scheme in the Unicode Standard to be a good model. There are, however, a considerable number of characters which so far have not been defined and described in Unicode. For these characters we must resort to the Private Use Area, and we need rules for the naming of such characters.

Descriptive names have basically the same syntax as in rules (6) and (7) in ch. 2.2.1 above. The following examples refer to characters in the official Unicode Standard and thus serve to illustrate the naming scheme.

1. Base line character.

Glyph

Descriptive name

LATIN SMALL LETTER A

2. Modification of a base line character within its x-height.

Glyph

Descriptive name

LATIN SMALL LETTER O WITH STROKE

3. Modification of a base line character touching the base character outside its x-height. As explained in ch. 2.2.2 above, this character can be encoded and described in two equivalent ways.

Glyph

Descriptive name

LATIN SMALL LETTER O + COMBINING OGONEK
= LATIN SMALL LETTER O WITH OGONEK

4. Modification of a base line character not touching the base line character itself. Also this character can be encoded and described in two equivalent ways.

Glyph

Descriptive name

LATIN SMALL LETTER O WITH STROKE + COMBINING ACUTE ACCENT
= LATIN SMALL LETTER O WITH STROKE AND ACUTE

5. More than one modification. Here, there are in fact three equivalent ways of encoding and describing this character.

Glyph

Descriptive name

LATIN SMALL LETTER O + COMBINING OGONEK + COMBINING ACUTE ACCENT
= LATIN SMALL LETTER O WITH OGONEK + COMBINING ACUTE ACCENT
= LATIN SMALL LETTER O WITH OGONEK AND ACUTE

For a full discussion of characters and entity names, please refer to ch. 5 below.

 

2.3 Words

2.3.1 Basic mark-up

As a rule, medieval Nordic manuscripts in the Latin alphabet are written with a clearly identifiable space between each word. This obviously facilitates the work for the transcriber, since the word is a basic linguistic unit in grammars and dictionaries. In a simple transcription, word division can simply be entered by the space bar on the keyboard. Thus, a piece of text (from Barlaams ok Josaphats saga ch. 48) might be transcribed as

En ef ver fallum i hinar fornno syndir oc huerfum aptr til hinna fyrrv misverka sem hundr til spyu sinnar þa kann lettlega at vera at oss kunni til hannda at berazt sem i guðspialleno segir.

Here, each word is delimited by a space (or a punctuation mark). However, for a more detailed analysis it can be convenient to identify each word with a separate <w> element (for “word"). The <w> element thus functions as a container for information on levels of text representation (cf. ch. 3 below) and grammatical analysis (cf. ch. 8). In this example, each word has been identified by the <w> element, and the lemma (dictionary entry) specified as an attribute to the <w> element:

<w lemma="en">En</w>
<w lemma="ef">ef</w>
<w lemma="v&eacute;r">ver</w>
<w lemma="falla">fallum</w>
<w lemma="&iacute;">i</w>
<w lemma="hinn">hinar</w>
<w lemma="forn">fornno</w>
<w lemma="synd">syndir</w>
etc.

TEI P5 would like to restrict the use of attributes. Rather than giving the lemma as an attribute to the <w> element, the lemma might be referred to with a separate <lemma> element within the <w> element. And, as argued in ch. 3, the level of text representation might be specified, e.g. as <diplomatic> (i.e. that the transcription is fairly close to the source):

<w>
<lemma>en</lemma>
<dipl>En</dipl>
</w>

<w>
<lemma>ef</lemma>
<dipl>Ef</dipl>
</w>

<w>
<lemma>v&eacute;r</lemma>
<dipl>ver</dipl>
</w>

<w>
<lemma>falla</lemma>
<dipl>fallum</dipl>
</w>

<w>
<lemma>&iacute;</lemma>
<dipl>i</dipl>
</w>

<w>
<lemma>hinn</lemma>
<dipl>hinar</dipl>
</w>

<w>
<lemma>forn</lemma>
<dipl>fornno</dipl>
</w>

<w>
<lemma>synd</lemma>
<dipl>syndir</dipl>
</w>

Ch. 3 will discuss further levels of transcription (facsimile and normalised), and ch. 8 how words can be marked for morphological categories.

2.3.2 Deviations in word division (words written together or apart)

Although words as a rule are separated by spaces in medieval Nordic manuscripts, there are many exceptions to this rule. For this reason, a distinction should be drawn between graphical words and lexical words. A graphical word is a sequence set out by space on either side, while a lexical word is a member of the set of word forms defined by grammars and dictionaries for the language in question. In the great majority of cases, graphical and lexical words are identical. However, we sometimes see that a preposition and its object may be written as a single word (“aveiðiskap” = “á veiðiskap”), or that compounds are written as two separate words (“veiði kona” = “veiðikona”).

Fig. 2.4 Text adopted from Barlaams saga ok Josaphats, Holm perg. fol. nr. 6, f. 138

If the transcriber wishes to analyse two (or more) graphical words as a single lexical word, we suggest that this is done by putting the whole sequence within the <w> element:

<w>vei&eth;i kona</w>

Information on e.g. lemma can be given either as an attribute to the <w> element,

<w lemma="vei&eth;ikona">vei&eth;i kona</w>

or by using a separate <lemma> element (as shown in 2.3.1 above):

<w>
<lemma>vei&eth;ikona</lemma>
<dipl>vei&eth;i kona</dipl>
</w>

In both examples, the sequence “vei&eth;i kona” appears within a single element. This means that the transcriber interprets it as one lexical word, “veiðikona”. The space is left untouched, so that in a display of the transcription, the sequence will still show up as two graphical words, “veiði” and “kona”. However, since both graphical words are placed within a single element the lemma will refer to both parts.

The converse case is a single graphical word which the transcriber would like to analyse as two (or more) lexical words, e.g. “aveiðiskap” = “á veiðiskap”. Each lexical word should be placed within a <w> element, and information on lemma, morphological form etc. can be given within each <w> element. However, to generate a correct display of the text, i.e. a display with no space between each part, we suggest that the <seg> element is used with a type attribute. The value “nb” would indicate that there is no break between the parts in the <w> element. If the lemma is given by way of an attribute, the encoding would look like this:

<seg type="nb">
<w
lemma="&aacute;">a</w>
<w
lemma="vei&eth;iskap">vei&eth;iskap</w>
</seg
>

Alternatively, the lemma may be specified in a separate element:

<seg type="nb">
<w>

<lemma>&aacute;</lemma>
<dipl>a</dipl>
</w>
<w>

<lemma>vei&eth;iskap</lemma>
<dipl>vei&eth;iskap</dipl>
</w>
</seg
>

In some rather marginal cases, a sequence may be encoded as both types. A simplified example from Codex Regius is “aravk stola” which should be read as “a ravkstola”. This sequence might be encoded in this way:

<seg type="nb">
<w>

<lemma>&aacute;</lemma>
<dipl>a</dipl>
</w>
<w>

<lemma>r&oogon;kst&oacute;ll</lemma>
<dipl>ravk stola</dipl>
</w>
</seg
>

This encoding shows that “a” in “aravk stola” is a lexical word, sc. the preposition “á”, and that “ravk stola” is another lexical word, sc. the noun “rökstóll” (for practical reasons, “ö” is used here rather than “o ognek”). It will also allow a correct display of the sequence, since it specifies that there should be no space between “a” and “rauk stola”, and the space between “rauk” and “stola” is also encoded (analoguous to the encoding of “veiði kona” above).

2.3.3 Encoding of word constituents

The encoder might want to encode constituent parts of a word, e.g. prefixes, roots, derivational forms etc. We recommend using the <m> element (for "morpheme") in such cases (cf. TEI Guidelines ch. 15.1). This element may also be used for constituent parts such as “veiði” and “kona” in the examples above. It allows encoding of constituent parts in greater detail, e.g. with information on level of text representation, lemma etc. We shall repeat the encoding of “veiði kona” above:

<w>
<lemma>vei&eth;ikona</lemma>
<dipl>vei&eth;i kona</dipl>
</w>

Now, if the encoder wishes to add lexicographical (or other) information on the two constituent parts, that can easily be done by inserting <m> elements in the <w> element:

<w>
<lemma>vei&eth;ikona</lemma>
<dipl>vei&eth;i kona</dipl>
<m>
<lemma>vei&eth;i</lemma>
<dipl>vei&eth;i</dipl>
</m>
<m>
<lemma>kona</lemma>
<dipl>vkona</dipl>
</m>
</w>

This encoding would make a clear distinction between lemmata on the first level of encoding, in this case “veiðikona”, and lemmata referring to constituent parts, “veiði” and “kona”.

Lemmatisation is further discussed in ch. 8 below and is here only given as an example of a word-based type of mark-up. Grammatical information can also be conveniently attached to the word through the pos (part of speech) attribute. This is also discussed in ch. 8.

 

Top of page

 

Version 1.0 published 20 May 2003. Version 2.0 beta published 15 September 2004.