THE HELSINKI CORPUS OF ENGLISH TEXTS
CODING CONVENTIONS AND LISTS OF SOURCE TEXTS
Third Edition
Compiled by Merja Kytö
Department of English
University of Helsinki
Helsinki, 1996
Merja Kytö
Manual to the Diachronic Part of the Helsinki Corpus
of English Texts: Coding Conventions and Lists of Source
Texts
Copyright: Department of English, University of Helsinki
Information on the Helsinki Corpus and the manual available from:
The HIT Centre/Humanities Information Technologies
Research Programme
Allégaten 27
N-5007 Bergen
Norway
E-mail: icame@hit.uib.no
URL: http://www.hit.uib.no/
Tel: +47 55 582954/55/56
Fax: +47 55 589470
The Oxford Text Archive
Oxford University Computing Services
13 Banbury Road
Oxford OX2 6NN
United Kingdom
E-mail: info@ota.ahds.ac.uk
URL: http://info.ox.ac.uk/~archive
Tel: +44 1865 273238
Fax: +44 1865 273275
Department of English
P.O. Box 4 (Yliopistonkatu 3)
FI-00014 University of Helsinki
Finland
E-mail: merja.kyto@helsinki.fi
merja.kyto@engelska.uu.se
Fax: +358 9 19123072
+46 18 4711229
Note:
For the Helsinki Corpus
* texts and file names, see Part One, Section
2
* distribution formats, see Part One, Section
4
For the use of the Helsinki Corpus
* and the Oxford Concordance Program, see Part
One, Section 5.1.
* and the WordCruncher (4.1; 4.30), see Part One
Section 5.2.
For looking up the source texts
* according to authors and names of texts, see Part
Two
* according to abbreviated titles, see Part Three
The Helsinki Corpus of English Texts: Diachronic and Dialectal is a computerized collection of extracts of continuous text. It is the result of a project commenced in 1984 and directed by Matti Rissanen and Ossi Ihalainen at the University of Helsinki. The Corpus contains a diachronic part covering the period from c. 750 to c. 1700 and a dialect part based on transcripts of interviews with speakers of British rural dialects from the 1970's. The aim of the Corpus is to promote and facilitate the diachronic and dialectal study of English as well as to offer computerized material to those interested in the development and varieties of language. The material is intended for both mainframe and microcomputer use.
The aim of this guide is to help the users of the text files included in the diachronic part of the Corpus. It contains a key to coding conventions, and lists of source references and abbreviated titles to the extracts included. A volume by Rissanen et al. (1993, see Note 3) discusses the principles of compilation and offers a number of pilot studies illustrating the use of the material.
A brief introduction to the overall structure of the diachronic part will precede the list of source texts and the index of abbreviated titles. For the benefit of those interested in dialect material, a concise description of the dialectal part is included. The guide will give information on the character set and coding conventions used in the diachronic part, with practical advice on how to identify examples of data drawn from individual texts.
A number of the sections printed here accompany the Corpus as text files. These sections include Texts and File Names (Part One, Section 2), Character Set (Part One, Section 3.1.), Source Texts (Part Two) and Abbreviated Titles (Part Three). Owing to restrictions set by the lay-out format adopted and the character set used in most mainframe systems, other sections only appear printed in this guide.
The following researchers have participated in the team work resulting in the present version of the diachronic part of the Helsinki Corpus:
Leena Kahlas-Tarkka, Matti Kilpiö, Ilkka Mönkkönen, Aune Österman |
|
Inkeri Blomstedt, Juha Hannula, Mailis Järviö, Leena Koskinen, Saara Nevanlinna, Tesma Outakoski, Päivi Pahta, Kirsti Peitsara, Irma Taavitsainen |
|
Merja Kytö, Anneli Meurman-Solin, Terttu Nevalainen, Helena Raumolin-Brunberg, Ritva Tiusanen |
The Old English section has profited especially from the work of Leena Kahlas-Tarkka and Matti Kilpiö, and the Middle English section from that of Saara Nevanlinna and Irma Taavitsainen. In compiling the Early Modern English section, Terttu Nevalainen and Helena Raumolin-Brunberg have been responsible for collecting the Southern British English texts, Anneli Meurman-Solin the Scottish texts, and Merja Kytö the American English texts.
The Old English section of the Corpus is based on the material taken, with the kind permission of the Editors, from the machine-readable transcript (Release 1, October 1982) prepared for the Dictionary of Old English Project at the University of Toronto. We are truly grateful to acknowledge the help offered by the Editors, Antonette diPaolo Healey and Ashley Crandell Amos, and the staff of the Dictionary.
The researchers who have contributed to building up the dialectal part of the Helsinki Corpus include Markku Filppula, Jussi Klemola, Anna-Liisa (Ojanen) Vasko, Kirsti Peitsara, Ossi Stigell and Irmeli Tammivaara-Balaam. A special mention should be given to Gunnel Melchers for permission to include her transcriptions of the Yorkshire material in the Corpus.1 The availability of the dialectal texts depends on the individual researchers, who have full copyright for the material (for permission to use the files, contact Matti Rissanen).
Matti Rissanen has supervised and coordinated the work on the diachronic part and Ossi Ihalainen on the dialectal part; Merja Kytö has acted as the project secretary coordinating the team work and devising the database arrangements.
The team of undergraduate students at the English Department who have given an invaluable contribution in keying in and proofreading the texts includes Kirsi Heikkonen, Jussi Klemola, Asta Kuusinen, Tuula Lehtonen, Tom Löfström, Arja Nurmi, Minna Palander, Tiina Selki and Päivi Öhman. We have also been helped in proofreading by Lisa Arnold, Päivi Käki and Deborah Ruuskanen.
We are most indebted to Norman Blake, Derek Pearsall, Martyn Wakelin, Roger Lass and Robert Stanton for helping us solve many problems concerning the compilation of the Corpus. A number of visiting scholars have generously given their time and expertise in proofreading the texts. They include Fran Colman, Christiane Dalton-Puffer, Anne Finell, Jonathan Hope, Hanna Mausch, Irmeli Valtonen, Susan Wright, and Brita Wårvik.
Invaluable help in more technical problems has been received from, among others, Stig Johansson, Knut Hofland, Lou Burnard, Daryl Gibb, Kimmo Koskenniemi, Visa Rauste, Hannu Hartikka, Teo Kirkinen, Leena Sadeniemi and Seppo Syrjänen.
We owe thanks to all those who have granted us permission to use the texts included in the Corpus. We are particularly happy to acknowledge the generosity of all authors of the editions, and especially the following persons, publishers and institutions:
Almqvist & Wiksell International, Stockholm, Sweden
American Philosophical Society, Philadelphia, Pa., U.S.A.
Edward Arnold, Kent, U.K.
A. Asher & Co. B.V., Amsterdam, The Netherlands
Cambridge University Press, Cambridge, U.K.
Centaur Press Ltd., Arundel, U.K.
The Chetham Society, Manchester, U.K.
Professor Peter Clemoes, Emmanuel College, Cambridge, U.K.
Columbia University Press, Irwington, Ny., U.S.A.
Constable Publishers, London, U.K.
J. M. Dent & Sons Ltd., London, U.K.
The Early English Text Society, Oxford, U.K.
Elsevier Science Publishers, Physical Sciences &
Engineering
Division, Amsterdam, The Netherlands
Garland Publishing Inc., New York, Ny., U.S.A.
H. Gianotten, Tilburg, The Netherlands
Hakluyt Society, London, U.K.
The Johns Hopkins University Press, Baltimore, Md., U.S.A.
Houghton Mifflin Company, Cambridge, Ma., U.S.A.
Indiana University Press, Bloomington, In., U.S.A.
Manchester University Press, Manchester, U.K.
The Modern Language Society, Helsinki, Finland
Norfolk Record Society, Norwich, U.K.
Oxford University Press, Oxford, U.K.
Professor Derek J. Price, Yale University, New Haven, Ct.,
U.S.A.
Princeton University Press, Princeton, Nj., U.S.A.
Random Century Group, London, U.K.
Royal Historical Society, London, U.K.
Scolar Press, Berkeley, Ca., U.S.A.
The Surtees Society, Durham, U.K.
Université de Liège, Liège, Belgium
The University of Michigan Press, Ann Arbor, Mi., U.S.A.
The University of Tennessee Press, Knoxville, Tn., U.S.A.
The University of Toronto Press, Toronto, Canada
Unwin Hyman Ltd., London, U.K.
The Wellcome Institute for the History of Medicine, London,
U.K.
Carl Winter Universitätsverlag, Heidelberg, Germany
Yale University Press, New Haven, Ct., U.S.A.
A bibliography of the works published or in preparation is given in Appendix 2. The team of com-pilers wishes to be informed about published research based on the Corpus, and the scholars using the Corpus are requested to report on studies published and in progress to Matti Rissanen, address: Department of English, P.O. Box 4 (Yliopistonkatu 3), FIN-00014 University of Helsinki, Finland.
In the second edition, Appendix 2 has been updated and Appendix 3 added. Bibliographical references and distribution information (formats, addresses etc.) given in Part I have, similarly, been brought up-to-date. A few minor corrections have been introduced in the text throughout.
In the third edition, Appendix 2 and the information given on distribution and other addresses has been updated.
CONTENTS
Preface
PART ONE
INTRODUCTION
1. Overall Structure1.1. Diachronic Part3.1. Character SetPreliminary Remarks
1. General Format(1a) Lines
(1b) Paragraphs
(1c) Word-boundaries and Punctuation
(1d) Hyphens
(1e) `Ash', `Eth', `Thorn' Etc.
(1f) Superscript (=)
(1g) Abbreviations (~)
(1h) Accents (`)
(1i) Material Excluded2. `Text Level' Codes
Preliminary Remarks
(2a) `Font Other Than the Basic Font' (^...^)
(2b) `Foreign Language' (\...\)
(2c) `Runes' (}...})
(2d) `Emendation' [{...{] and `Editor's Comment' [\...\]
(2e) `Heading' [}...}]
(2f) `Our Comment' [^...^]
(2g) `Multiple Text Level Codes'3.3.1. Preliminary RemarksPreliminary Remarks
(1) <B = `Name of Text File'
(2) <Q = `Text Identifier'
(3) <N = `Name of Text'
(4) <A = `Author'
(5) <C = `Part of Corpus'
(6) <O = `Date of Original'
(7) <M = `Date of Manuscript'
(8) <K = `Contemporaneity'
(9) <D = `Dialect'
(10) <V = `Verse' or `Prose'
(11) <T = `Text Type'
(12) <G = `Relationship to Foreign Original'
(13) <F = `Foreign Original'
(14) <W = `Relationship to Spoken Language'
(15) <X = `Sex of Author'
(16) <Y = `Age of Author'
(17) <H = `Social Rank of Author'
(18) <U = `Audience Description'
(19) <E = `Participant Relationship'
(20) <J = `Interaction'
(21) <I = `Setting'
(22) <Z = `Prototypical Text Category'
(23) <S = `Sample'
(24) <P = `Page'
(25) <R = `Record'4.1. Availability5.1 Oxford Concordance Program
Notes to the Preface and Sections 1 to 5
PART TWO
SOURCE TEXTS
Introductory Note
PART THREE
ABBREVIATED TITLES
Introductory NoteAbbreviated Titles
Appendix 1.
Full List of `File Names' and `Text Identifiers'
Appendix 2.
Studies Using Evidence from the Diachronic Part of the Helsinki Corpus [OMITTED]
Appendix 3.
Lexa - Corpus Processing Software
PART ONE
INTRODUCTION
The principles of compilation2 of the Helsinki Corpus reflect the view that linguistic change should be approached through evidence based on synchronic variation inherent in the structure of the language studied.3 It follows that the researcher of variant expressions conveying one and (more or less) the same meaning welcomes access to a rich collection of texts, representative of various text types, levels of style and modes of expression, geographical and social varieties etc. The emergence of computer techniques has radically changed corpus-based study of language. Depending on the topic, hours of tedious work spent on the manual collecting of data can be cut down by using a computer for compiling and processing routines. The compilers of the Helsinki Corpus aim at facilitating access to versatile data for those interested in computer-based corpus linguistics and the structure and development of the English language.
The diachronic part of the Helsinki Corpus includes a basic selection of texts compiled from the Old, Middle and Early Modern (British) English periods, and a supplementary part focusing on regional varieties (Scots now available and early American English in preparation). Except for shorter texts given in toto, the length of the extracts varies from 2,000 to 10,000 words. At present the Old English section of the Corpus contains 413,300 words, the Middle English section 608,600 words and the British English section 551,000 words, a total of 1,572,800 words (the figures exclude passages in foreign languages, and our own and the editor's comments).4 In the supplementary part the Scots section contains 870,000 words and the early American English section 300,000 words. In the basic selection, the focus of this guide, the different sub-periods are represented as follows:
Sub-period |
Words
|
%
|
OLD ENGLISH I -850
Total
II 850-950 III 950-1050 IV 1050-1150 |
2 190
92 050 251 630 67 380 413 250 |
0.5
22.3 60.9 16.3 100.0 |
MIDDLE ENGLISH I 1150-1250
Total
II 1250-1350 III 1350-1420 IV 1420-1500 |
113 010
97 480 184 230 213 850 608 570 |
18.6
16.0 30.3 35.1 100.0 |
EMODE, BRITISH I 1500-1570
Total
II 1570-1640 III 1640-1710 |
190 160 189 800 171 040 551 000 |
34.5 34.5 31.0 100.0 |
By and large, the selectional criteria adopted for including a text in the diachronic part of the Corpus reflect the principles of socio-historical variation analysis: the selection strives for a representative coverage of language written in a specific period. Periodization has been of primary importance (with regard to the date of the original text and the date of the manuscript), but attention has also been paid to geographical dialect, type and register of writing (text type, relationship to spoken language, setting on formal-informal axis) and sociolinguistic variation (different author-related parameters such as gender, age, social rank). For a more detailed discussion, cf. Rissanen et al. (1993, cf. Note 3). At the moment, no grammatical tagging is included, though texts are described by a system of parameter codings (cf. Section 3.3.).
As pointed out above, the Old English section of the Corpus is drawn from the text files of the Toronto Corpus.5 The material has been edited by both computer and manually to make it compatible, though not fully identical in all details, with the format adopted for the material keyed in at Helsinki. Mention will be made of the main discrepancies between the Old English section and the Middle and Early Modern English sections of the Corpus in the relevant parts of the guide.
A number of texts included in the Middle English section of the Corpus are based on computerized material obtained from the Oxford Text Archive. Both manual and automatic editing has been carried out to convert the material into the Helsinki Corpus format. The texts based on the Oxford Text Archive material include the extracts drawn from Katherine, Margarete, Juliane and Hali Meiðhad in the Katherine Group (MS Bodley), from Layamon's Brut (MS Caligula), Chaucer's Canterbury Tales, The Pricke of Conscience, and Cursor Mundi.
The texts keyed in at Helsinki are based on the best editions available. In most cases the use of original manuscripts has not been possible. All Middle and Early Modern English texts have been proofread twice and the Toronto-based Old English texts once. It is obvious, however, that a certain number of errors inevitably remains.
The material included in the dialectal part of the Helsinki Corpus is transcribed orthographically by the postgraduate students who made the recordings in the 1970's. The material, based on interviews of elderly male and female natives of small rural villages, is unique in the field of present-day dialectology, characterized as it is by an interest in urban varieties. As the speakers may show great variation in some areas of grammar even within the range of the same local dialect, the aim has been to include a substantial sample of material from each individual speaker rather than take a great number of shorter passages from several speakers. The following geographical regions are covered at present (the material collected from the West Midlands of England will be included in the near future):
Region |
Words |
EAST ANGLIA Total |
117,700
254,000 16,000 18,900 406,600 |
The textual codes (in COCOA format, cf. Section 3.3.2.) introduce each sample of material, identifying the speaker and indicating, among other things, his or her age and occupation, the county, the village and the date of recording (<F = `file name'; <S = `speaker'; <G = `gender'; <A = `age'; <O = `occupation'; <C = `county'; <V = `village'; <D = `dialect'; <I = `interviewer'; <P = `page number').
Tagging and parsing systems that will make it possible to add grammatical information to the data are being prepared.6 Experiments with the CLAWS1 tagging suite (originally devised for the LOB Corpus) have yielded promising results.7 Much interest lies in experiments aimed at linking the database with sound by way of an interface built on the Macintosh Hypercard and Mac-Recorder applications.
For the availability of the material included in the dialectal part, contact Matti Rissanen, cf. above.
The principles of organizing and coding the material included in the diachronic part will be discussed in the sections below.
The following short title list of texts and text file names presents the contents of the diachronic part of the Corpus in a condensed form. The list is organized to show how different texts and text files relate to sub-period and text type codings. The relevant word(s) locating an entry in Part Two of this guide are given in bold face (for the representation of non-ASCII characters, see below).
The list follows the order of sub-period sections. When dealing with texts which have different codes for the original and the manuscript versions, the date of the manuscript has been followed. Within the sub-periods, the texts are listed according to the text type code they have been ascribed.
The order of text types is, by and large, that given below. In most cases a file contains samples coded as representative of one text type only. Sometimes, however, it has been necessary to ascribe different text type codes to samples of text located in one and the same file. In the list below these files are indicated with asterisks (** = `the file contains several types of text'; * = `this extract can be found in the file with the same name marked by two asterisks').
TEXTS AND FILE NAMES:
PERIOD, TEXT TYPE AND TEXT (FILE NAME)
OLD ENGLISH
Documents
DOCUMENTS 1 (HARMER; ROBERTSON; BIRCH) (CODOCU1)
Text type undefined (OE verse)
CAEDMON'S HYMN; BEDE'S DEATH SONG; (CONORTHU)
THE RUTHWELL CROSS; THE LEIDEN RIDDLE
Law
LAWS (ALFRED'S INTRODUCTION TO LAWS; ALFRED; INE) (COLAW2)
Documents
DOCUMENTS 2 (HARMER; ROBERTSON; SWEET-WHITELOCK) (CODOCU2)
Handbooks, medicine
LAECEBOC (COLAECE)
Philosophy
ALFRED'S BOETHIUS (COBOETH)
Religious treatises
ALFRED'S CURA PASTORALIS (COCURA)
Prefaces
ALFRED'S PREFACE TO CURA PASTORALIS (COPREFCP)
History
CHRONICLE MS A EARLY (COCHROA2)BEDE'S ECCLESIASTICAL HISTORY (COBEDE)
OHTHERE AND WULFSTAN (MS L) (COOHTWU2)
ALFRED'S OROSIUS (COOROSIU)
Bible
THE VESPASIAN PSALTER (COVESPS)
Text type undefined (OE verse)
THE BATTLE OF BRUNANBURH (COBRUNAN)
Law
LAWS (ELEVENTH CENTURY) (COLAW3)
Documents
DOCUMENTS 3 (HARMER; ROBERTSON; WHITELOCK) (CODOCU3)
Handbooks, medicine
LACNUNGA (COLACNU)QUADRUPEDIBUS (COQUADRU)
Science, astronomy
BYRHTFERTH'S MANUAL (COBYRHTF)AELFRIC'S DE TEMPORIBUS ANNI (COTEMPO)
Homilies
WULFSTAN'S HOMILIES (O3) (COWULF3)THE BLICKLING HOMILIES (COBLICK)
AELFRIC'S CATHOLIC HOMILIES (II) (COAELHOM)
AELFRIC'S HOMILIES (SUPPL. II)
Rules
THE BENEDICTINE RULE (COBENRUL)THE DURHAM RITUAL (CODURHAM)
Religious treatises
AELFRIC'S FIRST AND SECOND LETTERS (COAELET3)TO WULFSTAN; AELFRIC'S LETTER
TO SIGEFYRTH
Prefaces
AELFRIC'S PREFACE TO CATH. HOM. I; II; (COAEPREF)LIVES OF SAINTS; GRAMMAR
AELFRIC'S PREFACE TO GENESIS (COAEPREG)
History
CHRONICLE MS A LATE (O3) (COCHROA3)OHTHERE AND WULFSTAN (MS G) (COOHTWU3)
Geography
MARVELS (COMARVEL)
Travelogue
ALEXANDER'S LETTER (COALEX)
Biography, lives
AELFRIC'S LIVES OF SAINTS (COAELIVE)GREGORY THE GREAT, DIALOGUES (MS H) (COGREGD3)
MARTYROLOGY (COMARTYR)
Fiction
THE OLD ENGLISH APOLLONIUS OF TYRE (COAPOLLO)
Bible
THE OLD TESTAMENT (COOTEST)THE PARIS PSALTER (COPARIPS)
WEST-SAXON GOSPELS (COWSGOSP)
LINDISFARNE GOSPELS (COLINDIS)
RUSHWORTH GOSPELS (CORUSHW)
Text type undefined (OE verse)
FATES OF APOSTLES; ELENE; JULIANA (COCYNEW)GENESIS (COGENESI)
EXODUS (COEXODUS)
CHRIST (COCHRIST)
THE KENTISH HYMN, THE KENTISH PSALM (COKENTIS)
ANDREAS (COANDREA)
THE DREAM OF THE ROOD (CODREAM)
THE WANDERER; THE SEAFARER; WIDSITH; THE FORTUNES OF MEN; MAXIMS I; THE RIMING POEM; THE PANTHER; THE WHALE; THE PARTRIDGE; DEOR; WULF AND EADWACER; THE WIFE'S LAMENT (COEXETER)
BEOWULF (COBEOWUL)
RIDDLES (CORIDDLE)
THE METRICAL PSALMS OF THE PARIS PSALTER (COMETRPS)
PHOENIX (COPHOENI)
THE METERS OF BOETHIUS (COMETBOE)
Law
LAWS (LATE; WILLIAM) (COLAW4)
Documents
DOCUMENTS 4 (ROBERTSON; ROBERTSON, APPENDIX) (CODOCU4)
Handbooks, astronomy
PROGNOSTICATIONS (COPROGNO)
Philosophy
THE OLD ENGLISH DICTS OF CATO (CODICTS)
Homilies
WULFSTAN'S HOMILIES (O3/4) (COWULF4)A HOMILY FOR THE SIXTH ... SUNDAY (COEPIHOM)
Rules
WULFSTAN'S `INSTITUTES OF POLITY' (COINSPOL)
Religious treatises
AELFRIC'S LETTER TO SIGEWEARD; WULFSIGE (COAELET4)ADRIAN AND RITHEUS (COADRIAN)
SOLOMON AND SATURN (COSOLOMO)
AN OLD ENGLISH VISION OF LEOFRIC (COLEOFRI)
Prefaces
ALFRED'S PREFACE TO SOLILOQUIES (COPREFSO)
History
CHRONICLE MS E (O3/4); (O4) (COCHROE4)
Biography, lives
CHAD (COCHAD)GREGORY THE GREAT, DIALOGUES (MS C) (COGREGD4)
A PASSION OF ST MARGARET (COMARGA)
MIDDLE ENGLISH
Handbooks, medicine
PERI DIDAXEON (CMPERIDI)
Philosophy
VESPASIAN HOMILIES, NO. III (cf. Homilies) (CMVESHOM)*
Homilies
ORM, THE ORMULUM (CMORM)TRINITY HOMILIES (CMTRINIT)
VESPASIAN HOMILIES (cf. Philosophy) (CMVESHOM)**
BODLEY HOMILIES (CMBODLEY)
LAMBETH HOMILIES (CMLAMBET)
SAWLES WARDE (CMSAWLES)
Religious treatises
HISTORY OF THE HOLY ROOD-TREE (CMROOD)ANCRENE WISSE (CMANCRE)
HALI MEIDHAD (CMHALI)
VICES AND VIRTUES (CMVICES1)
History
THE PETERBOROUGH CHRONICLE (CMPETERB)LAYAMON (CMBRUT1)
Biography, lives
KATHERINE (CMKATHE)MARGARETE (CMMARGA)
JULIANE (CMJULIA)
Documents
THE PROCLAMATION OF HENRY III (CMDOCU2)
Homilies
KENTISH SERMONS (CMKENTSE)
Religious treatises
DAN MICHEL, AYENBITE OF INWYT (CMAYENBI)A BESTIARY (CMBESTIA)
History
ROBERT OF GLOUCESTER (CMROBGLO)HISTORICAL POEMS (in MS Harley 2253) (CMPOEMH)
Biography, lives
THE LIFE OF ST. EDMUND (THE EARLY SOUTH-ENGLISH LEGENDARY) (CMSELEG)
Fiction
MAN IN THE MOON (CMMOON)DAME SIRITH; INTERLUDE (CMSIRITH)
THE FOX AND WOLF IN THE WELL (CMFOXWO)
THE THRUSH AND THE NIGHTINGALE (CMTHRUSH)
Romances
THE ROMANCE OF SIR BEUES OF HAMTOUN (CMBEVIS)KYNG ALISAUNDER (CMALISAU)
HAVELOK (CMHAVELO)
KING HORN (CMHORN)
Bible
THE EARLIEST COMPLETE ENGLISH PROSE PSALTER (CMEARLPS)
Text type undefined (ME verse)
SONG OF THE HUSBANDMAN; SATIRE ON THECONSISTORY COURTS; SATIRE ON THE RETINUES (in MS Harley 2253) (CMPOEMS)
Documents
USK, APPEAL(S); PETITIONS (M3); RETURNS; (CMDOCU3)JUDGEMENTS; TESTAMENTS AND WILLS;
PROCLAMATIONS
Handbooks, astronomy
CHAUCER, A TREATISE ON THE ASTROLABE (CMASTRO)THE EQUATORIE OF THE PLANETIS (CMEQUATO)
Handbooks, medicine
A LATE MIDDLE ENGLISH TREATISE ON HORSES (CMHORSES)
Science, medicine
A LATIN TECHNICAL PHLEBOTOMY (CMPHLEBO)
Philosophy
CHAUCER, BOETHIUS (CMBOETH)Idem, THE TALE OF MELIBEE (cf. Fiction) (CMCTPROS)*
Homilies
THE NORTHERN HOMILY CYCLE (THE EXPANDED VERSION) (CMNORHOM)
Sermons
ENGLISH WYCLIFFITE SERMONS (CMWYCSER)
Rules
THE BENEDICTINE RULE (CMBENRUL)AELRED OF RIEVAULX'S DE INSTITUTIONE INCLUSARUM (MS VERNON) (CMAELR3)
Religious treatises
PURVEY, THE PROLOGUE TO THE BIBLE (CMPURVEY)THE CLOUD OF UNKNOWING (CMCLOUD)
MANNYNG, ROBERT OF BRUNNE'S "HANDLYNG SYNNE" (CMHANSYN)
THE PRICKE OF CONSCIENCE (CMPRICK)
CHAUCER, THE PARSON'S TALE (CMCTPROS)*
History
CURSOR MUNDI (CMCURSOR)THE BRUT OR THE CHRONICLES OF ENGLAND (CMBRUT3)
TREVISA, POLYCHRONICON (CMPOLYCH)
Travelogue
MANDEVILLE'S TRAVELS (CMMANDEV)
Fiction
CHAUCER, THE GENERAL PROLOGUE TO THE CANTERBURY TALES; THE WIFE OF BATH'S PROLOGUE; THE SUMMONER'S TALE; THE MERCHANT'S TALE (CMCTVERS)CHAUCER, THE TALE OF MELIBEE (cf. Philosophy) (CMCTPROS)**
GOWER, CONFESSIO AMANTIS (CMGOWER)
Letters, non-private
HENRY V, LETTERS (AN ANTHOLOGY; A BOOK OF LONDON ENGLISH); LETTER(S), LONDON (CMOFFIC3)
Bible
THE OLD TESTAMENT (WYCLIFFE) (CMOTEST)THE NEW TESTAMENT (WYCLIFFE) (CMNTEST)
Law
STATUTES (II) (CMLAW)
Documents
INDENTURE, PETITIONS (M4); SHILLINGFORD (DOCUMENT(S)) (cf. Proceedings, depositions) (CMDOCU4)**
Handbooks, medicine
THE `LIBER DE DIVERSIS MEDICINIS' IN THE THORNTON MS (CMTHORN)
Handbooks, other
REYNES, THE COMMONPLACE BOOK (CMREYNES)METHAM, PHYSIOGNOMY (CMMETHAM)
Handbooks, astronomy
METHAM, DAYS OF THE MOON (CMMETHAM)
Science, medicine
THE CYRURGIE OF GUY DE CHAULIAC (CMCHAULI)
Sermons
MIDDLE ENGLISH SERMONS ... MS. ROYAL (CMROYAL)CAPGRAVE, CAPGRAVE'S SERMON (CMCAPSER)
MIRK, MIRK'S FESTIAL (CMMIRK)
GAYTRYGE, DAN JON GAYTRYGE'S SERMON (CMGAYTRY)
IN DIE INNOCENCIUM (CMINNOCE)
FITZJAMES, SERMO DIE LUNE (CMFITZJA)
Rules
AELRED OF RIEVAULX'S DE INSTITUTIONE (CMAELR4)INCLUSARUM (MS BODLEY 423)
Religious treatises
THE BOOK OF VICES AND VIRTUES (CMVICES4)KEMPE, THE BOOK OF MARGERY KEMPE (CMKEMPE)
JULIAN OF NORWICH, ... REVELATIONS OF DIVINE LOVE (CMJULNOR)
HILTON, ... EIGHT CHAPTERS ON PERFECTION (CMHILTON)
ROLLE, THE BEE AND THE STORK (CMROLLBE)
Idem, PROSE TREATISES (CMROLLTR)
Idem, THE PSALTER OR PSALMS OF DAVID (cf. Bible) (CMROLLPS)*
Prefaces
CAXTON, THE PROLOGUES AND EPILOGUES (CMCAXPRO)
Proceedings, depositions
DEPOSITIONS (cf. Documents) (CMDOCU4)*
History
CAPGRAVE, ... ABBREUIACION OF CRONICLES (CMCAPCHR)GREGORY, THE HISTORICAL COLLECTIONS OF A CITIZEN OF LONDON (CMGREGOR)
Biography, lives
THE LIFE OF ST. EDMUND (MIDDLE ENGLISH RELIGIOUS PROSE) (CMEDMUND)
Fiction
CAXTON, THE HISTORY OF REYNARD THE FOX (CMREYNAR)
Romances
MALORY, MORTE DARTHUR (CMMALORY)THE SIEGE OF JERUSALEM IN PROSE (CMSIEGE)
Drama, mystery plays
LUDUS COVENTRIAE (CMLUDUS)MANKIND (CMMANKIN)
THE WAKEFIELD PAGEANTS IN THE TOWNELEY CYCLE (CMTOWNEL)
THE YORK PLAYS (CMYORK)
DIGBY PLAYS (CMDIGBY)
Letters, private
SHILLINGFORD (LETTERS); PASTON (CLEMENT; MARGARET; JOHN); MULL; STONOR; BETSON; CELY (GEORGE; RICHARD (THE YOUNGER)) (CMPRIV)
Letters, non-private
PASTON, WILLIAM (CMOFFIC4)
Bible
ROLLE, THE PSALTER OR PSALMS OF DAVID (cf. Religious treatises) (CMROLLPS)**
Law
STATUTES (III) (CELAW1)
Handbooks, other
FITZHERBERT, THE BOOK OF HUSBANDRY (CEHAND1A)TURNER, A NEW BOKE OF ... ALL WINES (CEHAND1B)
Science, medicine
VICARY, THE ANATOMIE OF THE BODIE OF MAN (CESCIE1A)
Science, other
RECORD, THE PATH-WAY ... OF GEOMETRIE (CESCIE1B)
Educational treatises
ELYOT, THE BOKE NAMED THE GOUERNOUR (CEEDUC1A)ASCHAM, THE SCHOLEMASTER (CEEDUC1B)
Philosophy
COLVILLE, BOETHIUS (CEBOETH1)
Sermons
FISHER, SERMONS BY JOHN FISHER (CESERM1A)LATIMER, SERMON ON THE PLOUGHERS; SEVEN SERMONS BEFORE EDWARD VI (CESERM1B)
Proceedings, trials
THE TRIAL OF SIR NICHOLAS THROCKMORTON (CETRI1)
History
MORE, THE HISTORY OF KING RICHARD III (CEHIST1A)FABYAN, THE NEW CHRONICLES OF ENGLAND (CEHIST1B)
Travelogue
LELAND, THE ITINERARY OF JOHN LELAND (CETRAV1A)TORKINGTON, YE OLDEST DIARIE (CETRAV1B)
Diaries
MACHYN, THE DIARY OF HENRY MACHYN (CEDIAR1A)EDWARD VI, THE DIARY OF EDWARD VI (CEDIAR1B)
Biography, autobiograpy
MOWNTAYNE, THE AUTOBIOGRAPHY (CEAUTO1)
Biography, other
ROPER, WILLIAM, THE LYFE OF SIR THOMAS MOORE (CEBIO1)
Fiction
A HUNDRED MERY TALYS (CEFICT1A)HARMAN, A CAVEAT ... FOR COMMEN CURSETORS (CEFICT1B)
Drama, comedies
UDALL, ROISTER DOISTER (CEPLAY1A)STEVENSON (?), GAMMER GVRTONS NEDLE (CEPLAY1B)
Letters, private
BEAUMONT; PLUMPTON (AGNES; ISABEL; WILLIAM; DOROTHY; ROBERT); MORE (LETTER(S), THE CORRESPONDENCE); ROPER (MARGARET); CROMWELL (GREGORY); CUMBERLAND; SCROPE (CEPRIV1)
Letters, non-private
HOWARD; TUNSTALL; A LETTER BY THE LORDS; WOLSEY; HENRY VIII; BEDYLL; CROMWELL (THOMAS); MORE (LETTER(S), ORIGINAL LETTERS)(CEOFFIC1)
Bible
THE OLD TESTAMENT (TYNDALE) (CEOTEST1)THE NEW TESTAMENT (TYNDALE) (CENTEST1)
Law
STATUTES (IV) (CELAW2)
Handbooks, other
GIFFORD, A DIALOGUE CONCERNING WITCHES (CEHAND2A)MARKHAM, COUNTREY CONTENTMENTS (CEHAND2B)
Science, medicine
CLOWES, TREATISE FOR THE ARTIFICIALL CURE OF STRUMA (CESCIE2A)
Science, other
BLUNDEVILE, A BRIEFE DESCRIPTION OF THE TABLES ... LINES SECANT (CESCIE2B)
Educational treatises
BRINSLEY, LUDUS LITERARIUS OR THE GRAMMAR SCHOOLE (CEEDUC2A)BACON, THE TWOO BOOKES ... ADVANCEMENT OF LEARNING (CEEDUC2B)
Philosophy
ELIZABETH I, BOETHIUS (CEBOETH2)
Sermons
HOOKER, TWO SERMONS UPON PART OF S. JUDES EPISTLE (CESERM2A)SMITH, TWO SERMONS ON "OF USURIE" (CESERM2B)
Proceedings, trials
THE TRIAL OF THE EARL OF ESSEX (CETRI2A)THE TRIAL OF SIR WALTER RALEIGH (CETRI2B)
History
STOW, THE CHRONICLES OF ENGLAND (CEHIST2A)HAYWARD, ANNALS OF THE FIRST FOUR YEARS... (CEHIST2B)
Travelogue
TAYLOR (JOHN), THE PENNYLES PILGRIMAGE (CETRAV2A)COVERTE, A TRVE AND ALMOST INCREDIBLE REPORT OF AN ENGLISHMAN (CETRAV2B)
Diaries
MADOX, AN ELIZABETHAN IN 1582: THE DIARY ... (CEDIAR2A)HOBY, DIARY OF LADY MARGARET HOBY (CEDIAR2B)
Biography, autobiography
FORMAN, THE AUTOBIOGRAPHY (CEAUTO2)
Biography, other
PERROTT (?), THE HISTORY OF THAT MOST EMINENT STATESMAN, SIR JOHN PERROTT (CEBIO2)
Fiction
ARMIN, A NEST OF NINNIES (CEFICT2A)DELONEY, JACK OF NEWBURY (CEFICT2B)
Drama, comedies
SHAKESPEARE, THE MERRY WIVES OF WINDSOR (CEPLAY2A)MIDDLETON, A CHASTE MAID IN CHEAPSIDE (CEPLAY2B)
Letters, private
KNYVETT; HARLEY; PASTON (WILLIAM; KATHERINE) FERRAR (NICHOLAS; RICHARD); BARRINGTON (JOHN); MASHAM; BARRINGTON (THOMAS); EVERARD; PROUD; PETTIT;OXINDEN (RICHARD; KATHERINE); PEYTON; GAWDY (CEPRIV2)
Letters, non-private
CECIL (ROBERT); EDMONDES; ELIZABETH I (LETTERS); CECIL (WILLIAM); A LETTER BY THE FELLOWS OF TRINITY COLLEGE; CONWAY (CEOFFIC2)
Bible
THE OLD TESTAMENT (AUTHORIZED VERSION) (CEOTEST2)THE NEW TESTAMENT (AUTHORIZED VERSION) (CENTEST2)
Law
STATUTES (VII) (CELAW3)
Handbooks, other
WALTON, THE COMPLEAT ANGLER (CEHAND3A)LANGFORD, PLAIN AND FULL INSTRUCTIONS TO RAISE ALL SORTS OF FRUIT-TREES (CEHAND3B)
Science, other
HOOKE, MICROGRAPHIA (CESCIE3A)BOYLE, ELECTRICITY & MAGNETISM (CESCIE3B)
Educational treatises
LOCKE, DIRECTIONS CONCERNING EDUCATION (CEEDUC3A)HOOLE, A NEW DISCOVERY OF THE OLD ART OF TEACHING SCHOOLE (CEEDUC3B)
Philosophy
PRESTON, BOETHIUS (CEBOETH3)
Sermons
TILLOTSON, SERMONS (CESERM3A)TAYLOR (JEREMY), THE MARRIAGE RING (CESERM3B)
Proceedings, trials
THE TRIAL OF TITUS OATES (CETRI3A)THE TRIAL OF LADY ALICE LISLE (CETRI3B)
History
BURNET, ... HISTORY OF MY OWN TIME (CEHIST3A)MILTON, THE HISTORY OF BRITAIN (CEHIST3B)
Travelogue
FIENNES, THE JOURNEYS OF CELIA FIENNES (CETRAV3A)FRYER, A NEW ACCOUNT OF EAST INDIA (CETRAV3B)
Diaries
PEPYS, THE DIARY OF SAMUEL PEPYS (CEDIAR3A)EVELYN, THE DIARY OF JOHN EVELYN (CEDIAR3B)
Biography, autobiography
FOX, THE JOURNAL OF GEORGE FOX (CEAUTO3)
Biography, other
BURNET, SOME PASSAGES OF THE LIFE AND DEATH OF ... EARL OF ROCHESTER (CEBIO3)
Fiction
PENNY MERRIMENTS (CEFICT3A)BEHN, OROONOKO (CEFICT3B)
Drama, comedies
VANBRUGH, THE RELAPSE (CEPLAY3A)FARQUHAR, THE BEAUX STRATAGEM (CEPLAY3B)
Letters, private
HADDOCK (RICHARD, SR; RICHARD, JR; NICHOLAS); STRYPE; OXINDEN (HENRY; ELIZABETH); HATTON (CHARLES; FRANCES; ALICE; ANNE; ELIZABETH); PINNEY (JANE; JOHN); HENRY (PHILIP) (CEPRIV3)
Letters, non-private
SOMERS; SPENCER; A LETTER BY THE PRIVY COUNCIL; CAPEL; CHARLES II; OSBORNE; AUNGIER; A LETTER BY THE COMMISSIONERS (CEOFFIC3)
A number of conventions have been introduced for coding special characters, typographical practices, editorial comments and so forth in the computerized version. The set of conventions used is introduced below.
Main Coding Key
The coding system is based on the set of ASCII codes (96 printable characters). In the list below the coding symbol is followed by the coded symbol.8 The character "=" stands for "represents". The use and functions of the codes are discussed in more detail in Section 3.2.
I CHARACTERS USED
A. Alphanumeric characters
The following characters represent themselves:
A = A
B = B
C = C
Etc.a = a
b = b
c = c0 = 0
1 = 1
2 = 2
B. Non-alphanumeric characters
B.1. The following characters represent themselves:
Dec.
Hex.
Char.
Description
34
39
40
41
45
33
44
46
4758
59
6322
27
28
29
2D
21
2C
2E
2F3A
3B
3F"
'
(
)
-
!
,
.
/:
;
?= quotation marks (double quote)
= apostrophe (single quote)
= opening parenthesis (cf. B.2.1., below)
= closing parenthesis (cf. B.2.1., below)
= hyphen or minus (within or outside a word)
= exclamation point
= comma
= period or decimal point
= slash (cf. `reference codes', 3.3.2.
`punctuation', 3.2. (1c))= colon
= semicolon
= question markAbbreviations: Dec. = decimal code; Hex. = hexadecimal code;
Char. = character.
B.2. The following characters are used for coding purposes:
B.2.1. Characters used for coding `text levels':
Dec.
Hex.
Char.
Description
40
41
91
93
123
125
92
9428
29
5B
5D
7B
7D
5C
5E(
)
[
]
{
}
\
^= opening parenthesis (cf. B.1., above)
= closing parenthesis (cf. B.1., above)
= opening bracket
= closing bracket
= opening brace
= closing brace
= back slash
= circumflex`Text levels':
(^.....^) = `font other than the basic font'
(\.....\) = `foreign language'
(}.....}) = `runes'
[{.....{] = `emendation'
[\.....\] = `editor's comment'
[^.....^] = `our comment'
[}.....}] = `heading'
for `multiple text level' codes, see below
B.2.2. Characters other than those used for coding `text levels':
Dec.
Hex.
Char.
Description
38
43
60
62
61
96
12626
2B
3C
3E
3D
60
7E&
+
<
>
=
`
~= ampersand (= `and', `ond', `et')
= plus (= `ash', `eth' etc.)
= less than (= `reference codes', cf. 3.3.2.)
= greater than (= `reference codes', cf. 3.3.2.
= equals (= `superscript')
= grave accent (= `accent')
= tilde (= `abbreviation')`Ash', `eth', `yogh', `thorn' etc.:
+A
+D
+G
+T
+TT
+Tt
+L= u.c. ash
= u.c. eth
= u.c. yogh
= u.c. thorn
= u.c. crossed thorn
= u.c. crossed thorn
= (£) pound sign+a
+d
+g
+t
+tt+e
= l.c. ash
= l.c. eth
= l.c. yogh
= l.c. thorn
= l.c. crossed thorn= l.c. e caudata
For the Lexa Font Module, which enables the display of the original Old and Middle English characters, see Appendix 3.
II CHARACTERS NOT USED
The following characters are not used:9
Dec.
Hex.
Char.
Description
36
37
42
64
95
12424
25
2A
40
5F
7C$
%
*
@
_
|= dollar sign
= percent sign
= asterisk
= commercial at
= underline
= vertical line
Preliminary Remarks
When coding the material, the main aim has been to key in as much as possible of the source text as reliably and consistently as possible. When editorial interference has proved inevitable, the changes follow the format below, or are indicated by `our comments' in the computerized text. Obvious instances of ill-formed characters in the source text (e.g. a character produced upside down) have been silently corrected. In this section the following points will be taken up in more detail (for textual parameters, or `reference codes', cf. Section 3.3.):
1. GENERAL FORMAT
(1a) Lines(1b) Paragraphs
(1c) Word-boundaries and punctuation
(1d) Hyphens
(1e) `Ash', `eth', `thorn' etc.
(1f) Superscript (=)
(1g) Abbreviations (~)
(1h) Accents (`)
(1i) Material excluded
2. `TEXT LEVEL' CODES
(2a) `Font other than the basic font' (^...^)(2b) `Foreign language' (\...\)
(2c) `Runes' (}...})
(2d) `Emendation' [{...{] and `editor's comment' [\...\]
(2e) `Heading' [}...}]
(2f) `Our comment' [^...^]
(2g) `Multiple text level' codes
1. GENERAL FORMAT
(1a) Lines
In the Middle and Early Modern English sections of the Corpus the main line division of the source text has been preserved (i.e. a new line in the source text begins a new line in the computerized version). However, to leave some space for further coding (on an 80-character line), the maximum length of the line is no more than 65 characters in the computerized version: column 64 (when the left margin is set on 1) is reserved for a space, and column 65 for the `line continues' character (#). Lines longer than 65 characters in the source text have been continued by placing the `line continues' character in column 65 and keying in the remaining words on the next line.
In the Old English section of the Corpus the line divisions do not follow those of the source texts. Instead, the material based on the Toronto Corpus (with lines longer than 80 characters in the format delivered) has been converted into the Helsinki Corpus format by editing the line length accordingly.
(1b) Paragraphs
A new paragraph begins on the fourth space (i.e. indented by three spaces). Strings of spaces in the original (marking a gap in the manuscript, or introducing a heading, for instance) have been ignored by moving the text to begin from the left margin. Paragraph-initial capitalized words (or capitalized words in comparable positions) have been standardized by capitalizing the first character only. Paragraph signs (&) have been omitted (cf. item (1i)).
(1c) Word-boundaries and punctuation
A space (or, inadvertently, two or more spaces) indicates a typographic word-boundary in the source text. Extra spaces between words in the original have been omitted. As the main aim has been to prepare the word form for subsequent indexing and concordance work, it has sometimes been necessary to separate two words typed as one word in the source text or, conversely, join parts of words separated in the original, e.g.:
Ex. 1.
Source text:
- - Ah nis nawtbi þeos iseid. þ ha forrotieð þrin.' 3ef ha hare wed
lac lahe-liche haldeð. Ah þe ilke sari wrecches þe
iþe fule wurðinge. vnwedde waleweð.' beoð þe deof
les eaueres. þ rit ham & spureð ham to don al þ he
wule. þeos walewið iwurdinge & forrotieð þrin.
(Hali Meiðhad, in The Katherine Group. Edited from
Ms. Bodley 34, ed. S. T. R. O. d'Ardenne, Paris, 1977, p. 137).
Helsinki Corpus:
- - Ah nis nawtbi +teos iseid. +tt ha forrotie+d +trin; +gef ha hare wedlac
lahe-liche halde+d. Ah +te ilke sari wrecches +te
i +te fule wur+dinge. vnwedde walewe+d, beo+d +te deofles
eaueres. +tt rit ham & spure+d ham to don al +tt he
wule. +teos walewi+d i wurdinge & forrotie+d +trin.
The punctuation of the source text has been retained as far as possible. The main punctuation marks are as follows:
! = exclamation point
, = comma
. = period or decimal point
/ = slash
: = colon
; = semicolon
? = question mark
- = hyphen or minus (also within the word-boundaries, cf. (1d), below)
Line-initial periods (.) are preceded by a space. The slash and double slash (//) may occur as clause separators or in fractions (`1/2' for `½;', etc.). For the sentential use of hyphens, cf. (1d), below; for punctuation and `text level' codes.
(1d) Hyphens
Used within wordsLine-end hyphens in the source text, when regarded as additional to the normal spelling of the word, have been deleted. If space allows, the rest of the word is moved to the same line as the beginning of the word; if there is no space on the line or if the continuation of the word is the only item on the line, the whole word is moved to the next line (and the `line continues' character is placed at the end of the line). Hyphens have been preserved when deemed essential to the form of the word (e.g. in compounds); where doubt remains, the line-end hyphen has been retained or deleted according to the judgment of the coder.
Used outside words
Hyphens or dashes used outside the word-boundaries in the source text have been keyed in as hyphens, preceded and followed by a space (this type of hyphen may occur in a line-end position). Cf., for instance:
Ex. 2.
& tuss he toc forr+trihht anan
To m+alenn wi+t+t +te Laferrd;
Ma+g+gstre, - we witenn sikerrli+g
+Tatt tu +turrh Godess wille
& all o Godess hallfe arrt sennd
Larfaderr her to manne;
(The Ormulum (I-II), ed. R. Holt, Oxford, 1878, p. II, 225).
(1e) `Ash', `eth', `thorn' etc.
A set of compound characters has been introduced to key in the u.c. and l.c. `ash' (+A / +a), `eth' (+d / +D), `yogh' (+G / +g), `thorn' (+T / +t), `crossed thorn' (+Tt, +TT / +tt), and `e caudata' (+e) into the computerized version. `Yogh' does not occur in the Old English section of the Corpus, as this character is replaced by the letters g and G in the Toronto Corpus. The only compound characters to occur in the Early Modern English section of the Corpus are u.c. +A for `ash' and +L for the `pound sterling sign'.For the Lexa Font Module, which enables the display of the original Old and Middle English characters, see Appendix 3.
(1f) Superscript (=)
Owing to possible syncretisms (e.g., or / or for `or' / `our'), letters printed in superscript in the source text have been indicated as in, e.g., y=t= for yt, yo=r= for yor, Ma=tie= for Matie, int=r=cess=rs= for intrcessrs, =xx=llll for llll.
(1g) Abbreviations (~)
Abbreviations indicated by a tilde or dash above the letter(s), by a letter with a flourish, or (in rare cases) by an apostrophe in the source text, have been coded with the letter followed by a tilde (~), as in, e.g., fro~ for fro, p~voked for pvoked, co~mau~dyd for comaudyd, s=r~=prised for sr'prised, Cobhm~ for Cobhm, Cobh~m for Cobh'm etc.
(1h) Accents (`)
Accents of various types, when functional in the spelling of the word, have been coded by placing the grave accent (`) after the accented letter as in, e.g., cite` for cité, charite` for charité, me`me for même, tho` for thô etc. Other accents have been ignored, as in, e.g., swá, þám, pút etc.
(1i) Material excluded
- Extra-textual material in the margins or in the text (titles, tables, diagrams, pictures, signs of the zodiac etc.) has been excluded when not relevant to the main line of the running text; explanatory comments on omissions have been added when deemed necessary.
- Paragraph signs (¶) have been omitted.
- Folio numbers have been omitted, as well as the characters (e.g. / and |) frequently used for introducing a new folio.
- Lists of names and longer extracts of foreign language, or verse in a prose text have been omitted.
2. `TEXT LEVEL' CODES
Preliminary Remarks
As editorial and typographical conventions vary in different source texts (e.g. emendations can be indicated by italics, parentheses, brackets etc.), a number of `text level' codes have been used to transfer the function of the convention to the computerized version, irrespective of the particular format followed in the source text.The opening and closing parentheses or brackets are keyed in next to the following or preceding word. `Emendation' codes may occur within the word (e.g. r+a[{d{]e[{+d{] for ræ[d]e[ð]; w[{ha{]-swa for w[ha]-swa), but otherwise a space has been keyed in to separate the character (punctuation marks included) following the closing parenthesis or bracket, e.g.:
Ex. 3.
+Durh +dessere senne ic, un+gesali saule, fel in to an #
o+der senne,
+de is icleped (\propria voluntas\) , +tat is, au+gen-wille. #
(Vices and Virtues, Part I, ed. F. Holthausen (E.E.T.S., O.S. 89), London, 1888, p. 13).
Ex. 4.
- - So as saynt Paule maketh many hedes sayenge.
(\Caput mulieris vir. caput viri christus. christi
vero deus.\) Se here be thre heedes vnto a woman. god,
chryst, & hyr husbande. - -10
(The English Works of John Fisher, Bishop of Rochester,
Part I, ed. J. E. B. Mayor (E.E.T.S., E.S. 27), London, 1935
(1876), p. 321).
(2a) `Font other than the basic font' (^.....^)
In the Middle and Early Modern English sections of the Corpus typographical shifts (italics, bold face, gothic etc.) in the source text have been distinguished from the basic font. If the basic font has been italics (bold face, gothic etc.), fonts other than the basic font have been coded apart. Font changes coinciding with `foreign language' have not been coded, except in the instances of stage directions and names of characters found in drama, see (2g), A.1., p. 36. For italicized emendations and expansions, which occur repeatedly through the text, see (2d), p. 31 ff.
(2b) `Foreign language' (\.....\)
Coding language other than English separately from the running text has been deemed useful, as coded material can be excluded from subsequent word indexes, word lists and concordances. An effort has been made to distinguish between English and foreign-language material, but a certain amount of inconsistency has been unavoidable. Instances of loan words (e.g. technical and professional terms) have occasionally been somewhat problematic, particularly in pioneering texts dating from the periods of great influx of foreign elements into English.
Ex. 5a.
On +tam geare +te man h+at (\solarem\) on Lyden beo+d +treo
hund daga & fif & syxtig daga, & syx tida.
<R 64.9>
+ta synd on Lyden (\quadrantes\) genemned.
(Byrhtferth's Manual (A. D. 1011), I, ed. S. J. Crawford
(E.E.T.S., 177), London, 1966 (1929), p. 64).
Ex. 5b.
C.^) and (^B. D.^) & passing
both through
the Centre (^E.^) is diuided
into fower
Quadrantes or
quarters, the upper
Quadrante
whereof on the left hand is marked with the letters #
(^A. B. E.^) in
which Quadrant, the right perpendicular line marked with
the
letters F. H. betokeneth the right Sine of the giuen Arke #
(^A. F.^)
(Blundevile, A Briefe Description of the Tables etc.,
London, 1597, p. 49V).
Latin ligature characters have been replaced by combined characters in English (AE for Æ, ae for æ, OE for , oe for , etc.). Material printed in other than the Roman alphabet (Greek, for the most part) has been omitted with a note given in `our comment'; for `accents', cf. (1h), above.
(2c) `Runes' (}.....})
In the Old English section of the Corpus runes are distinguished from the nonrunic basic text (the conversion is based on the coding included in the Toronto Corpus).
(2d) `Emendation' [{.....{] and `editor's comment'[\.....\]
If italicized emendations and expansions occur frequently and repeatedly through the text, they have been left uncoded as in e.g. him, from, drihten, spæcon etc. When emendations indicated by italics are coded, the emendation code covers the whole word or words, even when only part of the word is italicized in the source text.
In the material drawn from the Toronto Corpus, in-stances of `emendation' (checked and corrected against manuscript readings by the Editors) are indicated with a percent sign (%) placed after the word (see examples given below). This sign has been replaced by the `emendation' signs used for the other sections of the Corpus. Whether to include several successive emended words within one `emendation' code or leave the Toronto Corpus single-word format as such has been up to the coder. Examples (line divisions in the material quoted from the Toronto Corpus have been edited to follow the Helsinki Corpus format):
Ex. 6.
Source text:
- - ado in hluttor eala;
beren[d]2 / & 3e3nid feowerti3 /
lybcorna [&]3 ado þonne
in[to]4
ðæm wyrtum; læt standan þreo niht; syle
drincan ær uhton
lytelne scænc fulne þæt se drænc sy
ðe ær 3eleored.
2 berend C; beren MS. L.
3 &. The MS. has, in error, a crossed l; expanded to
oððe L; omitted in text C.
4 into: in MS. CL.
(Lacnunga in Anglo-Saxon Magic and Medicine,
Illustrated Specially from the Semi-Pagan Text `Lacnunga',
ed. J. H. G. Grattan and C. Singer, London, 1972 (1952), p.
118).
Toronto Corpus:
ado in hluttor eala, berend% & gegnid feowertig
lybcorna
&% ado *onne into% =#m wyrtum, l#t standan
*reo
niht, syle drincan #r uhton lytelne sc#nc fulne *#t se
dr#nc sy =e #r geleored.
(Character key: % = `emendation'; * = l.c.
`thorn';
= = l.c. `eth'; # = l.c. `ash')
Helsinki Corpus:
ado in hluttor eala, [{berend{] & gegnid
feowertig lybcorna
[{&{] ado +tonne [{into{] +d+am
wyrtum, l+at standan +treo
niht, syle drincan +ar uhton lytelne sc+anc fulne +t+at se
dr+anc sy +de +ar geleored.
Ex. 7.
Source text:
- - seo is ealra2
duna mæst & hyhst.
[Þær syndon gedefelice menn þa habbað
him]3
to cynedome þone re[a
dan4 sæ & to anwalde - -
- -
hyda hy habbað him to hrægle gedon [þa
syndan]3 hundic
gean6 swiðast nemde7 .
& [fore hundum]3 tigras &
leon8
- -
2 K: ealra.
3 Bracketed words supplied from T.
4 K: re[a]dan. H: readan.
6 K: hunticgean. T: huntigystran. Lat. text: venatrices.
7 K, C, H: nemde.
8 C: leon[es].
(`Wonders of the East' in Three Old English Prose
Texts in Ms. Cotton Vitellius A XV, ed. S. Rypins (E.E.T.S.
161), New York, 1971 (1924), p. 64).
Toronto Corpus:
Seo is ealra duna m#st & hyhst.
<R 25.6>
*#r% syndon% gedefelice% menn% *a% habba=% him%
to cynedome *one readan s# & to anwalde. - -
- -
<R 26.1>
Ymb *as stowe beo= wif acenned, *a habba= beardas swa
side o= hyra breost, & horses hyda hy habba=
him to hr#gle gedon.
<R 26.3>
*a% syndan% hundic gean% swi=ast nemde, &
fore% hundum% tigras & leon
Helsinki Corpus:
Seo is ealra duna m+ast & hyhst.
<R 25.6>
[{+t+ar syndon gedefelice menn +ta habba+d
him{]
to cynedome +tone readan s+a anwalde. - -
- -
<R 26.1>
Ymb +tas stowe beo+d wif acenned, +ta habba+d beardas swa
side o+d hyra breost, & horses hyda hy habba+d
him to hr+agle gedon.
<R 26.3>
[{+ta{] [{syndan{]
hundic[{gean{] swi+dast nemde, &
[{fore{] [{hundum{] tigras & leon
In the Middle and Early Modern English sections of the Corpus the use of the `emendation' code closely reflects the practice followed by the editor in the source text: in the computerized version the `emendation' brackets cover only the strings of characters (or words) as indicated in the source text. Compare:
Ex. 8.
Source text:
The .vij. day ys fortunat to begynne alle werkys vp-on;
that persone [that ys born]
that day schuld be dysposyd to
be sotel off wytt and dyuerse off
condycionnys and chongabyl,
and dysposyd to lyfe longe; and yff a body falle in-to
seke-
nes that day, he schuld sone
r[e]couer; and [qwat
that]
a man dremyth schuld turne to trwthe with-in half a
yere;
(Days of the Moon, in The Works of John Metham Including
the Romance of Amoryus and Cleopes, ed. H. Craig (E.E.T.S.,
O.S. 132), London, 1916, p. 150).
Helsinki Corpus:
The .vij. day ys fortunat to begynne alle werkys vp-on;
that persone [{that ys born{] that day schuld be
dysposyd to
be sotel off wytt and dyuerse off condycionnys and chongabyl,
and dysposyd to lyfe longe; and yff a body falle in-to sekenes
that day, he schuld sone r[{e{]couer; and
[{qwat that{]
a man dremyth schuld turne to trwthe with-in half a yere;
(2e) `Heading' [}.....}]
Text coded as `heading' is always in upper case. See, also, `multiple text level' codes (2g), below.
(2f) `Our comment' [^.....^]
(2g) `Multiple text level' codes
When two or three types of `text level' codes coincide, the codes are embedded as follows:A. Combinations of two `text level' codes: 1. (^... (\...\) ...^)
2. (^... [{...{] ...^)
3. (\... [{...{] ...\)
4. (\... [\...\] ...\)
5. (\... [^...^] ...\)
6. [{... (^...^) ...{]
7. [{... (\...\) ...{]
8. [\... (\...\) ...\]
9. [^... [{...{] ...^]
10. [}... (^...^) ...}]
11. [}... (\...\) ...}]
12. [}... [{...{] ...}]
13. [}... [\...\] ...}]
14. [}... [^...^] ...}]Examples:
1. `font other than the basic font' and `foreign language' (in Middle and Early Modern English drama texts only):
(^ (\Dixit angelus:\) ^)(^ (\Angelus.\) ^)
O Joseph, ryse vp, and loke thu tary nought!
Take Mary with the, and into Egipt flee!
(The Late Medieval Religious Plays of Bodleian MSS Digby 133 and E Museo 160, ed. D. C. Baker, J. L. Murphy, and L. B. Hall, Jr. (E.E.T.S. 283), Oxford, 1982, p. 104).
(^Dwelle we not in ofte etyngis and
drunkenesse[{s{] ^) +tat #
sue+t aftur. - -
(English Wycliffite Sermons, I, ed. A. Hudson, Oxford,
1983, p. 478).
3. `foreign language' and `emendation'
And +tan till his men +tus he said,
(\Cecus autem si ceco ducatum prestet,
ambo in fou[{e{]am cadunt.\)
(The Northern Homily Cycle, Part II, ed. S. Nevanlinna
(Société Néophilologique de Helsinki, 41),
Helsinki, 1973, p. 73).
4. `foreign language' and `editor's comment'
EXPANDED\] \)
(An Anthology of Chancery English, eds. J. H. Fisher, M.
Richardson and J. L. Fisher, Knoxville, 1984, p. 197).
5. `foreign language' and `our comment'
to minde what (^Cicero^) saide, when hee gaue generall
thanks. (\Difficile [^EDITION:
difffcile^] non aliquem; #
ingratum quenquam praeterire:\)
(Francis Bacon, The Twoo Bookes of the Proficience and
Advancement of Learning (1605) (English Experience, 218),
Amsterdam and New York, 1970, p. 3R).
6. `emendation' and `italics'
20 Our Viccar preached on 11: (^Heb:^) 7: In the afternoone
#
our
Curate on [{(II. 1. (^Cor:^)
24){] :
(The Diary of John Evelyn, ed. E. S. de Beer, London,
New York and Toronto, 1959, p. 928).
7. `emendation' and `foreign language'
thynges I wold do. First I wold shewe that the instruccyons
of this holy gospell perteyneth to the vniuersal
chirche of chryst. Secondly that the heed of the
vnyuersall chirche [{ (\iure diuino\) {]
is the pope. - -
(The English Works of John Fisher, Bishop of Rochester,
Part I, ed. J. E. B. Mayor (E.E.T.S., E.S. 27), London, 1935
(1876), p. 314).
8. `editor's comment' and `foreign language'
(The Wakefield Pageants in the Towneley Cycle, ed. A. C.
Cawley, Manchester, 1958, p. 24).
9. `our comment' and `emendation'
INDICATED BY SQUARE BRACKETS IN THE EDITION
ARE SURROUNDED BY ROUND BRACKETS AND CODED
AS `EMENDATIONS' IN THE VERSION BELOW:
[{(.....){] ^]
(The Diary of John Evelyn, as above).
10. `heading' and `italics'
(Blundevile, A Briefe Description of the Tables etc.,
London, 1597, p. 155R).
11. `heading' and `foreign language'
(Ælfric's De Temporibus Anni, ed. H. Henel
(E.E.T.S. 213), London, 1942, p. 18).
12. `heading' and `emendation'
- -
HIE NE WINNA+D WI+D [{+D{]ONE GODCUNDAN
DOM.}]
(King Alfred's West-Saxon Version of Gregory's Pastoral
Care, Parts I-II, ed. H. Sweet (E.E.T.S., O.S. 45, 50),
London, 1958 (1871), p. 47).
13. `heading' and `editor's comment'
[} [\SIR THOMAS MORE TO CARDINAL
WOLSEY.\] }]
(Original Letters, Illustrative of English History;
Including Numerous Royal Letters, Third Series, I,
ed. H. Ellis, London, 1846, p. 203).
14. `heading' and `our comment'
(The Camden Miscellany, Volume the Eighth: Containing ...
Correspondence of the Family of Haddock, 1657-1719, ed. E.
M. Thompson (Camden Society, N.S. XXXI), London, 1965 (1883),
p. 44).
B. Combinations of three `text level' codes:
2. [} ..[\ ... (\ ...\) ...\] ... }]
Examples:
1. `heading', `foreign language' and `emendation'
[} (\DOMINICA QUARTA [{POST FESTUM
TRINITATIS.
EVANGELIUM.{] SERMO 4.\) }]
(English Wycliffite Sermons, as above, p. 236).
2. `heading', `editor's comment' and `foreign language'
[} [\ (\DE DIE.\) \] }]
(Ælfric's De Temporibus Anni, as above, p.
2).
Each text or group of comparable texts is introduced by a set of textual parameters consisting of 24 reference codes intended to help identify and describe the text (two additional codes appear in the Old English section of the Corpus). The codes also make it possible to execute computer searches through the material selectively, focusing only on those sections of the Corpus that fulfil a defined set of criteria.
The main principles for defining the values will be explained in Rissanen et al. (1993, cf. Note 3). Decisions on the code values are, of course, based on earlier scholarship and, in the last resort, the subjective views of the compilers. Misjudgments and errors in debatable matters of this kind are inevitable. We are most grateful for corrections and suggestions by the users of the Corpus.
The reference codes follow the COCOA format (cf. Oxford Concordance Program and Micro-OCP, Section 5.1.). The format can be easily converted to suit the format required by other concordance programs (cf. e.g. WordCruncher, Section 5.2.).11
The value of each reference code introduced by an angular bracket is valid until it is replaced by a new value (for details, cf. Hockey and Marriott 1984: 16-19; Hockey and Martin 1988: 17-19, cited in Note 11). In the majority of cases, if the value of one reference category (other than <S for `sample', <P for `page' or <R for `record') changes in the middle of a text, the whole set of references is repeated.
The maximum length for the value ascribed to a reference code is 20 characters (allowed by the COCOA format of the OCP, version 1.0., and the first level reference code in the WordCruncher, versions 4.1 and 4.30). The characters used for the reference code are in the upper case. Alternative or double values are indicated by a slash (/), cf. 3.1. The character X given as a value reads `irrelevant' or `not known'. Expansions for the abbreviations used will be given in Section 3.3.4., following an extract of a corpus text given as an example below.
The beginning of a text file typically looks as follows:
<B CMPOLYCH>
<Q ME3 NN HIST TREVISA>
<N POLYCHRONICON>
<A TREVISA JOHN>
<C ME3>
<O 1350-1420>
<M 1350-1420>
<K CONTEMP>
<D SL>
<V PROSE>
<T HISTORY>
<G TRANSL>
<F LATIN>
<W WRITTEN>
<X MALE>
<Y 40-60>
<H PROF>
<U X>
<E X>
<J X>
<I X>
<Z NARR NON-IMAG>
<S SAMPLE X>
[^TREVISA, JOHN.
POLYCHRONICON RANULPHI HIGDEN,
MONACHI CESTRENSIS, VOLS. VI, VIII.
ENGLISH TRANSLATIONS OF JOHN TREVISA AND OF
AN UNKNOWN WRITER OF THE FIFTEENTH CENTURY.
ROLLS SERIES, 41.
ED. J. R. LUMBY.
LONDON, 1876, 1882.
VI, PP. 209.14 - 231.7 (SAMPLE 1)
VIII, PP. 83.1 - 111.19 (SAMPLE 2)
VIII, PP. 347.1 - 352.13 (SAMPLE 3)^]
<S SAMPLE 1>
<P VI,209>
[} (\CAPITULUM VICESIMUM QUARTUM.\) }]
Leo +te emperour lete be +te enemyes of +te empere, and
werrede a+genst figures and ymages of holy seyntes. Pope
<P VI,211>
Gregory and Germanius of Constantynnoble wi+tstood hym
nameliche, as +te olde usage and custome wolde +tat is allowed
and apreeved by holy cherche, and seide +tat it is wor+ty and
medeful to do hem +te affecioun of worschippe. For we #
worschippe+t
in hem but God, [{and{] in worschippe of God and of
holy seyntes, +tat man have+t in mynde efte by suche ymages,
God allone schal be princepalliche worschipped, [{and after
hym creatures schal be i-worschipped{] in worschippe of hym.
Beda, (\libro 5=o=, capitulo 24=o=.\) +Tat +gere deide #
Withredus kyng
of Caunterbury, and Thobias bisshop of Rouchestre, +tat cou+te
Latyn and Grew as wel as his owne longage. (\Paulus,
libro 7=o=.\) +Tat +gere Sarasyns com to Constantynnoble and #
byseged
it +tre +gere, and took +tennes moche good and catel.
In the above passage the reference codes indicate that we are dealing with Polychronicon, a prose work of historical writing and non-imaginative narration translated from Latin into English in a Southern dialect by John Trevisa, a representative of the professional ranks, and aged between 40-60 years old. Both original and manuscript versions date from the third sub-period of Middle English (1350-1420). The set of reference codes is followed by a bibliographical reference to the source text giving information on the volume, page and line references to the extracts selected.
Preliminary Remarks
The values used for defining the textual parameters are listed below in the order the codes occur at the beginning of each text file. To help select only sections from the Corpus for computer searches, all the values appearing in the database will be given for each parameter (except for <B, <Q, <N, <A, <S, <P and <R) and for the three main periods distinguished.
(1) <B = `name of text file'
(2) <Q = `text identifier'
(3) <N = `name of text'
(4) <A = `author'
(5) <C = `part of corpus'
(6) <O = `date of original'
(7) <M = `date of manuscript'
(8) <K = `contemporaneity'
(9) <D = `dialect'
(10) <V = `verse' or `prose'
(11) <T = `text type'
(12) <G = `relationship to foreign original'
(13) <F = `foreign original'
(14) <W = `relationship to spoken language'
(15) <X = `sex of author'
(16) <Y = `age of author'
(17) <H = `social rank of author'
(18) <U = `audience description'
(19) <E = `participant relationship'
(20) <J = `interaction'
(21) <I = `setting'
(22) <Z = `prototypical text category'
(23) <S = `sample'
(24) <P = `page'
(25) <R = `record'
(1) <B = `name of text file'
The names of the 242 files follow MS-DOS conventions. Each file name begins with the character C (for `Corpus'), followed by O (for `Old English'), M (for `Middle English' or E (for `Early Modern English'). The file names reflect, by and large, the names of authors or texts in Old and Middle English sections of the Corpus. In the Early Modern English section the file names are based on the systematic coverage of different text types.
(2) <Q = `text identifier'
The purpose of the `text identifier' is to sum up the main characteristics of the text in one code and permit identifying the source for an example retrieved from the Corpus (cf. Section 5). A full list of the <Q codes found in the Corpus is given in Appendix 1.
<Q |
O2/4 |
NN |
BIL |
GDC> |
(a) |
(b) |
(c) |
(d) |
(a) `part of corpus'
(b) `prototypical text category'
(c) `text type'
(d) abbreviated title
Of these `part of corpus', `prototypical text category' and
`text type' conveniently repeat the information given by the
reference codes <C, <Z and <T, see (a) and (b)-(c) below.
For abbreviated titles, see (d) below.
(a) The values used to indicate `part of corpus' in the <Q code
are those listed in (5).
(b)-(c) The abbreviations used for the values of `prototypical
text category' and `text type' in the <Q code are as follows
(the code <T is further discussed in (11), and the code <Z
in (22)):
(d) Abbreviated titles are listed in alphabetical order in Part Three of this guide and can be used to look up the source references to the texts (cf. Section 5.).
(3) <N = `name of text'
Names ascribed to texts reflect, by and large, the key words of the title (or author or type) of the text, e.g.<N GREG DIAL C>
stands for `Gregory's Dialogues', MS C. Many `names of text' in the Old English section of the Corpus reflect those given in Healey and Venezky 1980, cf. Note 5); many of those adopted for the Middle English section are based on the title stencils given in the Middle English Dictionary (eds. H. Kurath, S. M. Kuhn etc., Ann Arbor, Michigan: University of Michigan Press, 1954-). In the Middle and Early Modern English sections the words LET TO stand for `letter to'.
(4) <A = `author'
The names of the authors, when known (in full), are given in the order `surname' - `first name', e.g.<A WAERFERTH>
<A CHAUCER GEOFFREY>
(5) <C = `part of corpus'
The value for the sub-period represented by the text is ascribed as a combination of the original and manuscript versions (cf. codes <O and <M, below). When the values coincide, only one figure is given; when they differ, the value given for the original version precedes that given for the manuscript, separated by a slash (/). In printed texts the value <M is marked as irrelevant (X).
(6) <O = `date of original'
(7) <M = `date of manuscript'
(8) <K = `contemporaneity'
The overall time spans covering the dates of the original version and the manuscript of the source text are indicated by the codes <O and <M; contemporaneity of the two is specified by the code <K (within the time span of some 40 years). The value SAME in this code means that, as far as we know, the manuscript is the same as the original text (e.g. Ormulum).
Old English
Middle English
EMod English
<O
-850
850-950
950-1050
1050-1150
X1150-1250
1250-1350
1350-1420
1420-1500
X1500-1570
1570-1640
1640-1710<M
- 850
850-950
950-1050
1050-11501150-1250
1250-1350
1350-1420
1420-1500
XX
<K
CONTEMP
NON-CONTEMP
SAME
XCONTEMP
NON-CONTEMP
SAME
XX
(9) <D = `dialect'
Old English |
Middle English |
EMod English |
A/X |
EML |
ENGLISH |
Abbreviations: The elements of mixed dialects are separated by slashes. The final letter in Middle English dialect codings denotes the source of the definition: L = LALME (A Linguistic Atlas of Late Mediaeval English by Angus McIntosh, M. L. Samuels and Michael Benskin. Aberdeen: Aberdeen University Press, 1986); O = source other than LALME. |
||
A = `Anglian' |
EML, EMO = `East Midland' |
|
ENGLISH = `Southern British standard' |
(10) <V = `verse' or `prose'
The values VERSE and PROSE occur throughout the main three periods of the Corpus.
(11) <T = `text type'
Old English |
Middle English |
EMod English |
LAW |
LAW |
LAW |
DOCUM |
DOCUM |
|
HANDB ASTRONOMY |
HANDB ASTRONOMY |
HANDB OTHER |
SCIENCE ASTRONOMY |
|
|
EDUC TREAT |
||
PHILOSOPHY |
PHILOSOPHY |
PHILOSOPHY |
HOMILY |
HOMILY |
|
SERMON |
SERMON |
|
RULE |
RULE |
|
REL TREAT |
REL TREAT |
|
PREFACE/EPIL |
PREFACE/EPIL |
|
PROC DEPOS |
PROC TRIAL |
|
HISTORY |
HISTORY |
HISTORY |
GEOGRAPHY |
||
TRAVELOGUE |
TRAVELOGUE |
TRAVELOGUE |
DIARY PRIV |
||
BIOGR LIFE SAINT |
BIOGR LIFE SAINT |
|
FICTION |
FICTION |
FICTION |
ROMANCE |
||
DRAMA MYST |
|
|
LET PRIV |
LET PRIV |
|
BIBLE |
BIBLE |
BIBLE |
X |
X |
|
Abbreviations: |
|
|
(12) <G = `relationship to foreign original'
(13) <F = `foreign original'
Old English |
Middle English |
EMod English |
|
<G |
GLOSS |
|
|
<F |
LATIN |
LATIN |
LATIN |
(14) <W = `relationship to spoken language'
Old English |
Middle English |
EMod Englis |
X |
WRITTEN |
WRITTEN |
Abbreviations: SCRIPT = `written to be spoken' |
(15) <X = `sex of author'
(16) <Y = `age of author'
(17) <H = `social rank of author'
Old English |
Middle English |
EMod English |
|
<X |
X |
MALE |
MALE |
<Y
|
X
|
-20 |
-20 |
<H
|
X
|
HIGH |
HIGH |
Abbreviations: |
HIGH PROF = `the author is moving from higher social ranks to professional ranks' |
(18) <U = `audience description'
(19) <E = `participant relationship'
(20) <J = `interaction'
(21) <I = `setting'
Old English |
Middle English |
EMod English |
|
<U |
X |
PROF |
PROF |
<E |
X
|
INT DOWN |
INT DOWN |
<J |
X |
INTERACTIVE |
INTERACTIVE |
<I |
X |
INFORMAL |
INFORMAL |
Abbreviations: |
PROF = `the work is intended for a professional
audience' |
(22) <Z = `prototypical text category'
Finally, the texts have been grouped in larger categories, which are presumed to reflect the continuity of the types of text represented throughout the history of English.
Old English |
Middle English |
EMod English |
STAT |
STAT |
STAT |
Abbreviations: |
EXPOS = `expository' |
(23) <S = `sample'
(24) <P = `page'
On the one hand, the category of `sample' has been applied as a flexible means of marking the different extracts drawn from one text. Conversely, the category has been used to group several related texts or text extracts (e.g. charters fulfilling one and the same set of parameter values, letters written by the representatives of one and the same family etc.).
When confusion might arise, page numbers include volume references; columns are marked with C1 and C2, e.g.
<P I,64.C1>
reads `first volume, page 64, column 1'. When page numbers are not
indicated, quarto (or folio) numbers are referred to instead (R =
`recto', V = `verso'), e.g.,
<P E4R>
reads `quarto E4, recto'.
(25) <R = `record'
The category `record' occurs in the Old English section of the Corpus only, as keyed in for the purposes of the Toronto Corpus (Healey and Venetzky 1980, cf. Note 5).
At the moment the different versions of the Helsinki Corpus are distributed by the HIT Centre/Humanities Information Technologies Research Programme (Allégaten 27, N-5007 Bergen, Norway; fax: +47-55-589470; e-mail: icame@hit.uib.no) and the Oxford Text Archive (Oxford University Computing Services, 13 Banbury Road, Oxford OX2 6NN, United Kingdom; fax: +44-1865-273275; e-mail: info@ota.ahds.ac.uk).
For tape and diskette formats offered, please consult the order forms of the distributors.
The following versions are available (four sections of this guide accompany all versions as text files, see above):
1. For mainframe use the material is available in two versions:(1) 242 source files
(2) 1-file version, created by appending the 242 source files one after another
The tape distributed by the Oxford Text Archive contains the
1-file version of the Corpus (10 Mb in size; in ASCII or EBCDIC
code).
2. For microcomputer use the material is available in two main versions distributed by the HIT Centre only:
The 1-file version and the 242 source files are distributed as
compressed files to be decompressed with a program that
accompanies the diskettes. The compressed files are organized
to follow the sub-section division of the Corpus.
(2) WordCruncher files for MS-DOS machines
To use the WordCruncher files of the Corpus, first acquire the
WordCruncher program (version 4.1 or 4.30) and make sure that
it runs on your machine. The WCView part of the program permits
carrying out searches on words and phrases; the WCIndex part of
the program is needed for re-indexing the text files into new
versions. The program can be obtained from the Johnston and
Company (P.O. Box 446, American Fork, Utah 84003-0446, U.S.A.;
fax: +1-801-756-0242), or from the HIT Centre.
Three WordCruncher versions of the Corpus are available:
3-file version: the material organized into three main
period files
11-file version: the material organized into eleven
sub-period sections
For further details, see Section 4.2. and Section 5.2.
Moreover, several versions of the Corpus are included in the CD-ROM disk ICAME Collection of English Language Corpora available from the HIT Centre (text version for MS-DOS, Macintosh and Unix, and WordCruncher and TACT versions for MS-DOS).
How the 242 files occur in the larger units is explained in Section 4.2.
4.2. Larger Units
For mainframe use the 242 source files were appended to create the one-file version HKI (the order of the 242 files in the HKI file is shown in Section 2 and in Appendix 1).
Respectively, for the WordCruncher version the 242 files were compiled as one HKI file. In addition, mainly for making it easier to include only parts of the Corpus in WordCruncher searches, the following versions were compiled:
(1) eleven sub-period files:HO1, HO2, HO3, HO4 = Old English sub-sections
HM1, HM2, HM3, HM4 = Middle English sub-sections
HE1, HE2, HE3 = Early Modern English sub-sections
HCM = HM1 + HM2 + HM3 + HM4
HCE = HE1 + HE2 + HE3
The following table summarizes how the material is organized (the period code is ascribed according to the reference code <C, cf. Section 3.3.4. (5)). The files under the title MF are available for mainframe use in ASCII or EBCDIC format; the files under the title T are available for microcomputer use; the files under the title WCr are available in WordCruncher format:
Sub-periods |
242 files |
Larger units |
||
MF/T |
WCr |
WCr |
MF/T/WCr |
|
-850 |
CODOCU1 |
HO1 |
||
850-950 |
COLAW2 |
HO2 |
HCO |
|
950-1050 |
COLAW3 |
HO3 |
||
1050-1150 |
COLAW4 |
HO4 |
||
1150-1250 |
CMPERIDI |
HM1 |
||
1250-1350 |
CMDOCU2 |
HM2 |
HCM |
HKI |
1350-1420 |
CMDOCU3 |
HM3 |
||
1420-1500 |
CMLAW |
HM4 |
||
1500-1570 |
CELAW1 |
HE1 |
||
1570-1640 |
CELAW2 |
HE2 |
HCE |
|
1640-1710 |
CELAW3 |
HE3 |
||
Abbreviations: MF = mainframe (ASCII/EBCDIC); T = text file for microcomputers; WCr = WordCruncher |
5. Using Concordance Programs
This section is aimed at illustrating the use of the Helsinki Corpus with two ready-made concordance program packages, the Oxford Concordance Program (OCP) and WordCruncher. The early versions of these programs were available or being prepared, when the Corpus project was launched. Though the coding scheme adopted is largely tailored to fit these programs, applications with other programs and operating systems have also proved highly rewarding. A recent application is Lexa, a multi-purpose program package illustrated in Section 5.3.
5.1. Oxford Concordance Program
Given the COCOA format adopted for coding the textual parameters, the Helsinki Corpus text files lend themselves to OCP and Micro-OCP applications as such.
The command SELECT can be used to include only portions of text for processing. Information on the reference categories coded in the Helsinki Corpus is given in Section 3.3. above. The source texts used are listed in Part II (according to authors and names of texts) and Part III (according to abbreviated titles). Part III is intended for immediate identification of the source of an example drawn from the Corpus; Part II gives further information on the samples included from each text.
Please note the following practical points when using the Corpus with OCP:
- the values used to define the reference code categories are given in Section 3.3.4.
- the reference category <Q can be conveniently used
- for obtaining information on the background of the source texts (sub-period, prototypical text category and text type), see Section 3.3.4.
- for identifying the source text on the basis of the abbreviated titles listed in Part III, see the example below.- the reference category <P can be used
- for obtaining the page or like reference to the source text, see the example below. - to make sure that all possible word forms of the items looked for will be taken into consideration,
- check the word forms from kwic-concordances prepared with OCP or some other program, or use the WordCruncher version of the Corpus for this purpose. - check ASD, MED, OED or other basic reference works for possible variant spellings.
- notice that the characters declared as DIACRITICS in the OCP command file may help to sort out the output data but may also exclude occurrences not taken into consideration in the *ACTION section (e.g. if the superscript character is declared as a DIACRITIC, the form d=r= (for dr and Dr) is not found if only the form dr is given in the PICK command).
- runs may fail if too many types of COMMENTS are defined in the OCP command file when applied to the one-file version or other larger units of the Corpus (the term COMMENTS is used to refer to text which does not appear as headwords but is given in the context of other words). The remedy is usually to reduce the number of types of COMMENTS defined.
By way of illustration, the following OCP file (for the mainframe version 2.30) makes a kwic-concordance of certain occurrences of the impersonal construction me thinks in Middle and Early Modern British English. The search is restricted to the instances written as two words and based on the -i- and -y- stems of the verb. Foreign language is defined as COMMENTS, and the contexts of the occurrences of each keyword are sorted according to what appears to the right of the keyword.
Example:
*input |
references cocoa "<" to ">". |
*words |
alphabet "A=a +A=+a B=b C=c D=d +D=+d E=e +E=+e F=f
G=g +G=+g H=h I=i J=j K=k L=l +L=+l M=m N=n O=o P=p Q=q R=r
S=s T=t +T=+t U=u V=v W=w X=x Y=y Z=z 1 2 3 4 5 6 7 8 9 0
&". |
*action |
do concordance. |
*format |
layout length 80. |
*go |
The search carried out on the HKI file yields the following data (only the top of the output file can be given here; the relevant examples are printed in italics; the location of the `text identifier', abbreviated title and `page' codes is indicated above the first example):
`text identifier'
abbreviated title (see Part Three)`page'
M1 IR RELT VICES1 |
23 |
+de-liker, +gif +du |
me +din |
1 |
E2 XX COME SHAKESP |
43.C2 |
I say, loue me: By |
|
|
E2 NI FICT DELONEY |
73 |
een ready # to tell |
|
|
M3 IR HOM NHOM |
II, 205 |
ws +tus changed se. |
|
|
E1 XX COME STEVENSO |
14 |
hyn taketh, And Tyb |
|
|
E2 XX COME MIDDLET |
16 |
me Calues Head now, |
|
|
E1 NI FICT HARMAN |
40 |
lesse him for me!" |
|
|
E3 EX SCIO HOOKE |
13.5,116 |
ve in it; # though, |
|
|
Further searches on the construction should take into consideration instances written as one word (using the PICK WORDS command), forms with other stem vowels and so forth.
When compiling the WordCruncher (4.1 or 4.30) versions of the Corpus, the changes in the text files were kept to the minimum. Please note that
- to provide information on the source text and page references, the code <Q (`text identifier') was converted into the 1st-level code |Q (20 characters allowed) and the code <P (`page') into the 2nd-level code |P (eight characters allowed).- as no 3rd-level codes (a number between 0 and 65535) were used, WordCruncher added the word `Heading' at the end of the reference outline.
- the files can, of course, be re-indexed using other reference codes when need be.
- the larger WordCruncher units are named as follows (cf. 4.2. above):
1-file version:
HKI
(all material)
3-file version:
HCO
HCM
HCE(Old English)
(Middle English)
(Early Modern English)11-file version:
HO1, HO2, HO3, HO4
HM1, HM2, HM3, HM4HE1, HE2, HE3(OE by sub-sections)
(ME by sub-sections)
(EModE by sub-sections)
Which version to use depends largely on the item(s) investigated and on the level of delicacy required when sorting out the output lists. When looking for high-frequency words, working through the HCO, HCM and HCE files one by one may facilitate the task of sorting out the variant forms. Working on low-frequency items, on the other hand, may be handiest when using the HKI file.
By way of illustration, instances of the conjunction as if were retrieved from the Early Modern English texts by combining the possible variant forms of as and if within the space of eight characters. The context was restricted to seven lines. The first four examples obtained run as follows (the location of the `text identifier', abbreviated title and `page' codes is indicated under the first example):
Example:
Computer Book: c:\HC\HCE.BYB
Reference List: as,ase,if,iff,yf,yfe,yff
in euerye quarter a London busshell, or there
about. For
the small corne lyeth in the holowe and voyde places
of
the greate beanes, and yet shall the greate beanes be
solde
as dere, as if they were all together, or
derer, as a man
may proue by a famylier ensample. Let a man
bye
.C. hearynges, two hearynges for a penye, and an
other
.C. hearynges, thre for a peny, and let hym sell
these
(E1 IS HANDO FITZH 41:Heading)
`page'abbreviated title (see Part Three)
`text identifier'
but the Skinne and Musculus fleshe, for
the
panicle vnderneth it is of Pericranium,
and
the bone is of the Coronal bone. Howebeit
there
it is made broade, as yf ther were a double
bone,
whiche maketh the forme of the Browes. It is
called
the Forhead or Front, from one Eare to the other,
and
from the rootes of the Eares of the head before, vnto
y=e=
(E1 EX SCIM VICARY 34:Heading)
that thyng semeth cheyfly to be desyred or wished,
for the
#
cause
or loue, wherof any thing is desyred. As yf a
man would ryde
for cause of helth, he desyreth not so much the
mouing to
ryde,
as the effect of his helth. Therfore when that all
thyngs be
(E1 XX PHILO BOETHCO 77:Heading)
thy selfe, for thou hast cast thy selfe into the
worste
#
thynges.
Like as if thou shouldest loke vpon the foule
erth and heuen
in
order (all outwarde thynges leyde apart for the tyme)
then
it
(E1 XX PHILO BOETHCO 101:Heading)
5.3. Lexa
The software package Lexa consists of a set of programs intended to perform the following types of task:
(i) automatically tag a file or files grammatically. Tagging is cumulative and permits later revision and re-tagging. Manual tagging with user confirmation is also available.
(ii) generate a database of unique word forms from a group of input files (reverse dictionary generation is also catered for).
(iii) produce a concordance file from a set of input files (KWIC or KWOC type).
(iv) allow various types of information retrieval from any input texts.
The main tagging program (lexa.exe), along with others in the set, work interactively from a desktop with pull-down menus and mouse support or in a batch mode from the DOS command line. The programs have been adapted to deal with the Helsinki Corpus texts, e.g. they recognize comments and special coding in these and can skip (or include) them. The user profile (set up) can be stored in a configuration file and thus can be accessed repeatedly. A font module is included, which allows users to see original Old and Middle English symbols on screen and have these printed as well.
An install program is also included with which one can set up Lexa for one's own system; all programs are MS-DOS executable files which are machine-independent. To avail oneself of the font module, a system must have a VGA, EGA or Hercules Plus video adapter.
Included in the Lexa suite are a number of statistical options. These can be used in conjunction with any text files to carry out standard inferential statistical tests (chi-square test, Wilcoxon or Spearman tests, among many others). Lexical density and frequency tables can be generated, values can be ranked and various calculations later computed.
The above package has been published as three volumes with accompanying diskettes. Interested users of the Helsinki Corpus can obtain further information from the following address:
The HIT Centre/Humanities Information Technologies Research
Programme
Allégaten 27
N-5007 Bergen
Norway
For more details on this corpus processing software, see Appendix 3.
Notes to the Preface and Sections 1 to 5
1. Cf. G. Melchers, Studies in Yorkshire Dialects, Based on Recordings of 13 Dialect Speakers in the West Riding (I-II) (= Stockholm Theses in English, 9), Stockholm University, 1972.
2. For previous reports on the structure and compilation of the Corpus, cf. O. Ihalainen, M. Kytö and M. Rissanen, "The Helsinki Corpus of English Texts: Diachronic and Dialectal: Report on work in progress", Corpus Linguistics and Beyond. Proceedings of the Seventh International Conference on English Language Research on Computerized Corpora, ed. W. Meijs (Amsterdam: Rodopi), 1987: 21-32; M. Kytö and M. Rissanen, "The Helsinki Corpus of English Texts: classifying and coding the diachronic part", Corpus Linguistics, Hard and Soft. Proceedings of the Eighth International Conference on English Language Research on Computerized Corpora (= Language and Computers: Studies in Practical Linguistics, 2), eds. M. Kytö, O. Ihalainen and M. Rissanen (Amsterdam: Rodopi), 1988: 169-179; M. Rissanen, "Three problems connected with the use of diachronic corpora", ICAME Journal, 1989 13: 16-19; M. Kytö, "Progress report on the diachronic part of the Helsinki Corpus", idem, pp. 12-15 and "Introduction to the Use of the Helsinki Corpus of English Texts: Diachronic and Dialectal", Proceedings from The Stockholm Conference on the Use of Computers in Language Research and Teaching, September 7-9, 1989, ed. M. Ljung (Stockholm: English Department, University of Stockholm), 1990: 41-56; M. Kytö and M. Rissanen, "A Language in Transition: The Helsinki Corpus of English Texts", ICAME Journal, 1992 16: 7-27.
3. For basics, cf. e.g. U. Weinreich, W. Labov and M. I. Herzog, "Empirical foundations for a theory of language change", Directions for Historical Linguistics: A Symposium, eds. W. P. Lehmann and Y. Malkiel (Austin and London: University of Texas Press), 1968: 95-195; M. Rydén, An Introduction to the Historical Study of English Syntax (= Stockholm Studies in English, LI) (Stockholm: Almqvist and Wiksell International), 1979; S. Romaine, Socio-historical Linguistics, Its Status and Methodology (= Cambridge Studies in Linguistics, 34) (Cambridge, London etc.: Cambridge University Press), 1982; M. Rissanen, "Variation and the study of English historical syntax", Diversity and Diachrony (= Current Issues in Linguistic Theory, 53), ed. D. Sankoff (Amsterdam and Philadelphia: John Benjamins), 1986: 97-109. For a more detailed discussion, cf. Rissanen et al., Early English in the Computer Age: Explorations through the Helsinki Corpus (= Topics in English Linguistics, 11) (Berlin and New York: Mouton de Gruyter, 1993).
4. The word counts are obtained using a microcomputer program devised by Mr. Micha» Jankowski (Adam Mickiewicz University, Poznan, Poland).
5. Cf. A. diPaolo Healey and R. L. Venezky, A Microfiche Concordance to Old English. The List of Texts and Index of Editions (Publications of the Dictionary of Old English, 1) (Toronto: Pontifical Institute of Mediaeval Studies), 1980.
6. Cf. O. Ihalainen, "Creating linguistic databases from machine-readable dialect texts", Methods in Dialectology. Proceedings of the Sixth International Conference Held at the University College of North Wales, 3rd-7th August 1987 (= Multilingual Matters, 48), ed. A. R. Thomas (Clevedon, Philadelphia: Multilingual Matters Ltd), 1988: 569-584; "A source of data for the study of English dialectal syntax: the Helsinki Corpus", Theory and Practice in Corpus Linguistics (= Language and Computers: Studies in Practical Linguistics 4), eds. J. Aarts and W. Meijs (Amsterdam and Atlanta, GA: Rodopi), 1990: 83-103.
7. Cf. R. Garside, "The CLAWS word-tagging system", The Computational Analysis of English. A Corpus-Based Approach, eds. R. Garside, G. Leech and G. Sampson (London and New York: Longman), 1987: 30-41.
8. The character descriptions are based on VAX EDT Reference Manual (VAX/VMS Volume 3A), Software Version VAX/VMS Version 4.0 by Digital Equipment Corporation, Maynard (Massachusetts), September 1984 (pp. A/1-6).
9. Owing to the problems encountered with the WordCruncher substring searches, the asterisk (*) used for `ash', `eth' (*a, *A, *d, *D) etc. in the pilot versions of the Corpus has been replaced by the plus sign (+) in the current version.
10. Two dashes in the examples cited indicate material deleted here.
11. Cf. WordCruncher. Text Indexing and Retrieval Software (Versions 4.1 and 4.30) (Provo, Utah: Brigham Young University and Electronic Text Corporation), 1987 and 1989; OCP = Oxford Concordance Program, cf. Users' Manual (Version 2), comps. S. Hockey and J. Martin (Oxford: Oxford University Computing Service), 1988; cf. also Users' Manual (Version 1.0), comps. S. Hockey and I. Marriott (Oxford: Oxford University Computing Service), 1984; Micro-OCP = a microcomputer implementation of OCP, comps. S. Hockey and J. Martin, Oxford University Computing Service (Oxford: Oxford University Press), 1988. For further applications, cf., e.g., TACT, User's Guide, Version 1.2 by J. Bradley (Toronto: University of Toronto Computing Services, 1990). A UNIX-based retrieval program devised for the diachronic part of the Corpus is being prepared at the Department of General Linguistics, University of Helsinki.