|
A
proposal for subranges within the Private Use Area
of Unicode:
Supplements to the Latin alphabet for Medieval
texts
by Odd Einar
Haugen, Department of Scandinavian languages and
literature, University of Bergen
Version 1.0, 15 June 2002
Background
Unicode
is a standard for encoding characters in as many
languages on as many computer platforms as
possible. A computer which supports Unicode can
save and print a huge number of characters, and
files can easily be transferred from one computer
to another without loss or change of any
characters. In version 3.2 of Unicode there
are no less than 95156 characters, which cover the
scripts of the principal written languages of the
world (including Chinese, Japanese and Korean), as
well as many mathematical and other
symbols.
Unicode started
out as a 16 bits system of (256 x 256) = 65536
characters. This is often referred to as the BMP
(Basic Multilingual Plane), and fonts with a
smaller or larger selection of these characters can
be found on a large number of computers, especially
within the Windows community (although the users
may not know that they have a Unicode font on their
computer). Now, Unicode has added 16 supplementary
planes of equal size to the BMP, so that the total
number of possible characters exceeds one
million.
The BMP of
Unicode contains a large repertoire of characters
in the Latin alphabet; in addition to the base
characters a-z /A-Z there are several hundred
characters with one or more diacritical marks, such
as accents, dots, hooks, strokes etc. All modern
languages using the Latin alphabet are now well
covered, and an increasing number of historical
languages.
Characters
for medievalists
Many medieval
texts can easily be transcribed and edited using
exisiting Unicode characters. For diplomatic
transcriptions and editions, however, there are
many characters not yet covered by Unicode. Until
Unicode decides to include these characters in one
of its offical ranges, medievalists must resort to
the Private
Use Area.
This is an area which will not be used for any
official characters by Unicode, and which is open
to scholarly and other types of communities.
Characters in this area can be included in fonts
for several computer platforms (e.g. Windows and
MAC OS X) and they can be designed by several font
editors (e.g. FontLab,
for Windows and Mac).
Selection
of characters
Only characters
not defined by Unicode in one of its official
ranges should be allocated to the Private Use Area.
The selection of characters has so far been based
on three sources:
1. The list of
characters defined for the Medieval Nordic Text
Archive in the Menota
handbook,
chs. 5 and 6.
2. The list of characters in the fonts
ReykjavikTimes
and AkureyriTimes,
widely used in the Old Norse community (e.g. in
publications by the Arnamagnæan Institutes in
Reykjavík and Copenhagen).
3. The list of characters in the Private Use Area
of Peter S. Baker's Junicode
font for Old English.
The additional
characters in the Cardo
font by David J. Perry are being considered for
inclusion.
Not all
characters in these fonts are included; only those
which are thought to be relevant for transcriptions
and editions of Medieval primary
sources.
Suggestions for
additional characters (or, indeed, ranges) are most
welcome. The proposal now comprises 340
characters.
Location
of the Private Use Area
This proposal
refers to the Private Use Area of the Basic
Multilingual Plane, E000 - F8FF (57344 -
63743); note that Unicode refers to each character
slot with hexadecimal
numbers. In this proposal, the Private Use Area has
been subdivided into 25 subranges each containing
256 code points (character slots). Presently, the
proposal comprises 12 subranges with anything from
4 to 169 characters in each range. This means that
there is ample room for additions in each range,
and it is furthermore possible to define another 13
subranges, each containing 256 code points, all
within the bounds of the Private Use Area of the
Basic Multilingual Plane.
Structure
of the proposal
The proposal is
divided into 12 subranges. In addition to a short
introduction, each subrange contains a table with
four fields for each proposed character:
(a) Glyph (i.e.
an image of the character), based on Courier
Please note that these glyphs are supplied only for
the sake of illustration. The glyphs in
subrange
11 are
based on Peter S. Baker's Junicode, as are a few of
the Old English glyphs in subrange
1
(slightly modified to fit in with Courier). Courier
is very little Medieval in style, so many glyphs
may look a little odd, but not too foreign, I
hope.
(b) Suggested
entity name, for use in SGML or XML documents.
These names are actually heavily abbreviated forms
of the descriptive names suggested in (d) below.
For more information on the structure of entity
names, see the Menota
handbook
ch. 2. Note that all entity names must begin with
"&" and end with ";".
(c) Suggested
Unicode code points
Presently, code points have only been specified for
the first character in each range. Note that code
points are given in the hexadecimal number
format.
(d) Suggested
descriptive name
The vocabulary and syntax of the descriptive names
are based on Unicode. In the Unicode standard, the
small "a" in the Latin alphabet is described as
LATIN SMALL LETTER A, the small "a" with acute
accent as LATIN SMALL LETTER A WITH ACUTE, etc. By
analogy, the small "a" with a macron and breve -
not yet included in Unicode - is described in this
proposal as LATIN SMALL LETTER A WITH MACRON AND
BREVE, etc.
Base
and precomposed characters
As mentioned
above, Unicode contains several hundred characters
of the Latin alphabet with diacritical marks, such
as á (a with acute), ê
(e with circumflex), ö (o with
diaeresis) and many, many more. This is a legacy
from the code pages of ISO, beginning with
Basic
Latin and
continuing with several additions, Latin-1
Supplement,
Latin
Extended-A,
Latin
Extended-B,
and Latin
Extended
Additional
(cf. the overview in The
ISO 8859 Alphabet
Soup).
Since characters such as á can be
analysed as a combination of a base character, "a",
and a combining diacritical mark, "´", they
are often referred to as
precomposed.
The number of
possible combinations of Latin characters and
diacritics is very high. In fact, it is so high
that Unicode is very reluctant to accept new
precomposed characters in its official ranges. For
example, Unicode has defined "o ogonek" (the ogonek
looks like an inverted cedille) as a character of
its own (01EB). However, there is no "o ogonek with
acute" in any official Unicode range, and may
possibly never be, although this character is
frequently used in Old Norse. Unicode prescribes
that such characters should be encoded as a
combination of a base character, e.g. "o ogonek",
and one or more combining marks, in this case a
combining acute accent.
This is fine in
theory, and obviously saves a lot of space. In
practice, however, there are hardly any computer
platforms and applications which can handle
combining characters as easily as the precomposed
ones. For
more information on how (and on which platforms)
combining Unicode characters work, see the paper
Unicode
Polytonic Greek for the World Wide
Web by
Patrick Rourke and the booklet Word
processing in Classical
languages
by David J. Perry (available as PDF
download).
In this proposal,
subranges 2 and 9 contain only precomposed
characters, while the other subranges contain
various base characters.
Proposed
subranges
|