THE TAGGED LOB CORPUS
Users' Manual
by
Stig Johansson
in collaboration with
Eric Atwell
Roger Garside
Geoffrey Leech
Norwegian Computing Centre
for the Humanities Bergen, 1986
Preface
The tagged LOB Corpus is the result of cooperation among researchers at the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen. The principal members of the research team have been:
Lancaster:
Geoffrey Leech and Roger Garside (project leaders)
Erie Atwell
lan Marshall
Oslo:
Stig Johansson (project leader)
Mette-Cathrine Jahr
Bergen:
Knut Hofland
The project was supported by the Social Science Research Council and the Norwegian Research Council for Science and the Humanities.
The section on computational aspects (4) is a revised version of: Geoffrey Leech Roger Garside, and Erie Atwell, 'The Automatic Grammatical Tagging of the LOB Corpus,' ICAME News 7 (1983), pp 13-33. The rest of the manual is the work of Stig Johansson, who could also draw on the information in Erie Atwell's Manual Pre-Edit Handbook (November 1981) and Manual Post-Edit Handbook (June 1982).
Stig Johansson |
Erie Atwell |
Roger Garside |
Geoffrey Leech |
Oslo |
Leeds |
Lancaster |
Lancaster |
Contents
1 The LOB Corpus
2 Tagged versions
3.1 An overview of the LOB tag set
3.2 Some differences between the LOB and Brown tag sets
3.3 Ditto tags
4.1 Pre-editing
4.2 Tag assignment
4.3 Tag selection
4.4 Idiom tagging
4.5 Post-editing
5 Differences between the original corpus and the tagged versions
5.1 Capitalisation
5.2 Punctuation marks and sentence/paragraph division
5.3 Contractions
5.4 Codes for abbreviations and 'non-English' words
5.5 Other differences
6 Principles in post-editing
7 Problem areas
7.1 Word division
7.2 Idioms
7.3 -ed forms
7.4 -ing forms
7.5 Auxiliaries
7.6 Nouns: number and case
7.7 Proper nouns
7.8 Adjectives
7.9 Adjective vs noun
7.10 Adverbs
7.11 Adverb vs adjective
7.12 Determiners/pronouns
7.13 Prepositions
7.14 Conjunctions
7.15 Conjunction vs preposition
7.16 WH-words
7.17 Numerals
7.18 Interjections
7.19 Abbreviations
7.20 Non-standard forms
7.21 Foreign words and expressions
7.22 Formulas and scientific symbols
7.23 Cited forms
7.24 Punctuation marks
7.25 Letters
8.1 Tapes and files
8.2 Records
8.3 Sorting
8.4 Example
8.5 Frequencies
8.6 Index to the KWIC concordance
9 Developments
Notes
References
Appendix 1: General flowchart of Tag Assignment Program
Appendix 2: Tagging decisions of APPLYHYPHEN
Appendix 3: Tagging decisions of APPLYWIC
Appendix 4: List of tags
Coding key
The following codes have been taken over from the original (untagged) LOB Corpus:
*@ |
degree symbol |
*+ |
pound |
*- |
dash |
*/ |
asterisk |
*' |
begin quote |
*' |
end quote |
*[ |
begin comment tag |
*] |
end comment tag |
*; |
begin subscript |
**; |
end subscript |
*: |
begin superscript |
**: |
end superscript |
*?1 |
macron on preceding character |
*?2 |
acute accent on preceding character |
*?3 |
grave accent on preceding character |
*?4 |
tilde on preceding character |
*?5 |
circumflex accent on preceding character |
*?6 |
cedilla under preceding, character |
" |
umlaut or diaeresis on preceding character |
\0 |
abbreviation |
For a full list of *? codes (=uncoded character), see Johansson et al (1978). The word-class tags are surveyed in Section 3 and Appendix 4. As regards other coding conventions, see Sections 2.6 (special information in the vertical version) and 5.2 (sentence and paragraph division).