Medieval Unicode Font Initiative


Disclaimer: This site is managed by scholars in Medieval studies with the aim of establishing a consensus on the use of Unicode among medievalists. It is not affiliated with or endorsed by Unicode.


A proposal for subranges within the Private Use Area of Unicode:
Supplements to the Latin alphabet for Medieval texts

by Odd Einar Haugen, Department of Scandinavian languages and literature, University of Bergen
Version 1.0, 15 June 2002

 

Background

Unicode is a standard for encoding characters in as many languages on as many computer platforms as possible. A computer which supports Unicode can save and print a huge number of characters, and files can easily be transferred from one computer to another without loss or change of any characters. In version 3.2 of Unicode there are no less than 95156 characters, which cover the scripts of the principal written languages of the world (including Chinese, Japanese and Korean), as well as many mathematical and other symbols.

Unicode started out as a 16 bits system of (256 x 256) = 65536 characters. This is often referred to as the BMP (Basic Multilingual Plane), and fonts with a smaller or larger selection of these characters can be found on a large number of computers, especially within the Windows community (although the users may not know that they have a Unicode font on their computer). Now, Unicode has added 16 supplementary planes of equal size to the BMP, so that the total number of possible characters exceeds one million.

The BMP of Unicode contains a large repertoire of characters in the Latin alphabet; in addition to the base characters a-z /A-Z there are several hundred characters with one or more diacritical marks, such as accents, dots, hooks, strokes etc. All modern languages using the Latin alphabet are now well covered, and an increasing number of historical languages.

 

Characters for medievalists

Many medieval texts can easily be transcribed and edited using exisiting Unicode characters. For diplomatic transcriptions and editions, however, there are many characters not yet covered by Unicode. Until Unicode decides to include these characters in one of its offical ranges, medievalists must resort to the Private Use Area. This is an area which will not be used for any official characters by Unicode, and which is open to scholarly and other types of communities. Characters in this area can be included in fonts for several computer platforms (e.g. Windows and MAC OS X) and they can be designed by several font editors (e.g. FontLab, for Windows and Mac).

 

Selection of characters

Only characters not defined by Unicode in one of its official ranges should be allocated to the Private Use Area. The selection of characters has so far been based on three sources:

1. The list of characters defined for the Medieval Nordic Text Archive in the Menota handbook, chs. 5 and 6.
2. The list of characters in the fonts
ReykjavikTimes and AkureyriTimes, widely used in the Old Norse community (e.g. in publications by the Arnamagnæan Institutes in Reykjavík and Copenhagen).
3. The list of characters in the Private Use Area of Peter S. Baker's
Junicode font for Old English.

The additional characters in the Cardo font by David J. Perry are being considered for inclusion.

Not all characters in these fonts are included; only those which are thought to be relevant for transcriptions and editions of Medieval primary sources.

Suggestions for additional characters (or, indeed, ranges) are most welcome. The proposal now comprises 340 characters.

 

Location of the Private Use Area

This proposal refers to the Private Use Area of the Basic Multilingual Plane, E000 - F8FF (57344 - 63743); note that Unicode refers to each character slot with hexadecimal numbers. In this proposal, the Private Use Area has been subdivided into 25 subranges each containing 256 code points (character slots). Presently, the proposal comprises 12 subranges with anything from 4 to 169 characters in each range. This means that there is ample room for additions in each range, and it is furthermore possible to define another 13 subranges, each containing 256 code points, all within the bounds of the Private Use Area of the Basic Multilingual Plane.

 

Structure of the proposal

The proposal is divided into 12 subranges. In addition to a short introduction, each subrange contains a table with four fields for each proposed character:

(a) Glyph (i.e. an image of the character), based on Courier
Please note that these glyphs are supplied only for the sake of illustration. The glyphs in
subrange 11 are based on Peter S. Baker's Junicode, as are a few of the Old English glyphs in subrange 1 (slightly modified to fit in with Courier). Courier is very little Medieval in style, so many glyphs may look a little odd, but not too foreign, I hope.

(b) Suggested entity name, for use in SGML or XML documents.
These names are actually heavily abbreviated forms of the descriptive names suggested in (d) below. For more information on the structure of entity names, see the
Menota handbook ch. 2. Note that all entity names must begin with "&" and end with ";".

(c) Suggested Unicode code points
Presently, code points have only been specified for the first character in each range. Note that code points are given in the hexadecimal number format.

(d) Suggested descriptive name
The vocabulary and syntax of the descriptive names are based on Unicode. In the Unicode standard, the small "a" in the Latin alphabet is described as LATIN SMALL LETTER A, the small "a" with acute accent as LATIN SMALL LETTER A WITH ACUTE, etc. By analogy, the small "a" with a macron and breve - not yet included in Unicode - is described in this proposal as LATIN SMALL LETTER A WITH MACRON AND BREVE, etc.

 

Base and precomposed characters  

As mentioned above, Unicode contains several hundred characters of the Latin alphabet with diacritical marks, such as á (a with acute), ê (e with circumflex), ö (o with diaeresis) and many, many more. This is a legacy from the code pages of ISO, beginning with Basic Latin and continuing with several additions, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, and Latin Extended Additional (cf. the overview in The ISO 8859 Alphabet Soup). Since characters such as á can be analysed as a combination of a base character, "a", and a combining diacritical mark, "´", they are often referred to as precomposed.

The number of possible combinations of Latin characters and diacritics is very high. In fact, it is so high that Unicode is very reluctant to accept new precomposed characters in its official ranges. For example, Unicode has defined "o ogonek" (the ogonek looks like an inverted cedille) as a character of its own (01EB). However, there is no "o ogonek with acute" in any official Unicode range, and may possibly never be, although this character is frequently used in Old Norse. Unicode prescribes that such characters should be encoded as a combination of a base character, e.g. "o ogonek", and one or more combining marks, in this case a combining acute accent.

This is fine in theory, and obviously saves a lot of space. In practice, however, there are hardly any computer platforms and applications which can handle combining characters as easily as the precomposed ones. For more information on how (and on which platforms) combining Unicode characters work, see the paper Unicode Polytonic Greek for the World Wide Web by Patrick Rourke and the booklet Word processing in Classical languages by David J. Perry (available as PDF download).

In this proposal, subranges 2 and 9 contain only precomposed characters, while the other subranges contain various base characters.

 

Proposed subranges

No.

Name of range

Inventory

Allocated span

1

Mixed script characters

19

E000 - E0FF

2

Precomposed diacritical characters [NB! very large page]

183

E100 - E1FF

3

Small capitals

19

E200 - E2FF

4

Enlarged minuscules

28

E300 - E3FF

5

Ligatures

15

E400 - E4FF

6

Punctuation marks

4

E500 - E5FF

7

Base line abbreviation marks

15

E600 - E6FF

8

Combining abbreviation marks

11

E700 - E7FF

9

Precomposed abbreviated characters

8

E800 - E8FF

10

Superscript (interlinear) characters

22

E900 - E9FF

11

Metrical symbols

12

EA00 - EAFF

12

Critical and epigraphical signs

4

EB00 - EBFF

Total number of characters included in this proposal

340


Notes

1. Presently, there are three areas for private use in Unicode with a total of 137468 code points:
(1) Private Use Area of Plane 0 (Basic Multilingual Plane): E000 - F8FF (57344 - 63743) containing 6400 code points.
(2) Private Use Plane 15: F0000 - FFFFD (983040 - 1048573) containing 65534 code points.
(3) Private Use Plane 16: 100000 - 10FFFD (1048576 - 1114109) containing another 65534 code points.

Each code point can accommodate one unique character, including combining characters (such as diacritical marks), so that the number of possible characters is extremely high.

2. For an explanation of the hexadecimal format, cf. e.g. this presentation. Unicode code points are often prefixed with "U+", e.g. U+E000 for the first code point of the Private Use Area, but in this proposal the prefix will be left out for the sake of brevity. Please note that all four digit numbers in this proposal are given in the hexadecimal format.


Version 1.0, 15 June 2002 OEH