THE TAGGED LOB CORPUS

THE TAGGED LOB CORPUS

Users' Manual

Stig Johansson

in collaboration with

Eric Atwell

Roger Garside

Geoffrey Leech

Norwegian Computing Centre

for the Humanities Bergen, 1986

Preface

The tagged LOB Corpus is the result of cooperation among researchers at the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen. The principal members of the research team have been:

Lancaster:
Geoffrey Leech and Roger Garside (project leaders)
Erie Atwell
lan Marshall

Oslo:
Stig Johansson (project leader)
Mette-Cathrine Jahr

Bergen:
Knut Hofland

The project was supported by the Social Science Research Council and the Norwegian Research Council for Science and the Humanities.
The section on computational aspects (4) is a revised version of: Geoffrey Leech Roger Garside, and Erie Atwell, 'The Automatic Grammatical Tagging of the LOB Corpus,' ICAME News 7 (1983), pp 13-33. The rest of the manual is the work of Stig Johansson, who could also draw on the information in Erie Atwell's Manual Pre-Edit Handbook (November 1981) and Manual Post-Edit Handbook (June 1982).

Stig Johansson	Erie Atwell	Roger Garside	Geoffrey Leech
Oslo	Leeds	Lancaster	Lancaster

Contents

1 The LOB Corpus
2 Tagged versions

2.1 Description of tape and files - vertical version
2.2 Description of records - vertical version
2.3 Description of tape and files - horizontal version
2.4 Description of records - horizontal version
2.5 Reference code
2.6 Special information in the vertical version
2.7 Number of words
2.8 Sample text extract - vertical version
2.9 Sample text extract - horizontal version

3 The LOB tag set

3.1 An overview of the LOB tag set
3.2 Some differences between the LOB and Brown tag sets
3.3 Ditto tags

4 The LOB tagging suite

4.1 Pre-editing
4.2 Tag assignment
4.3 Tag selection
4.4 Idiom tagging
4.5 Post-editing

5 Differences between the original corpus and the tagged versions

5.1 Capitalisation
5.2 Punctuation marks and sentence/paragraph division
5.3 Contractions
5.4 Codes for abbreviations and 'non-English' words
5.5 Other differences

6 Principles in post-editing
7 Problem areas

7.1 Word division
7.2 Idioms
7.3 -ed forms
7.4 -ing forms
7.5 Auxiliaries
7.6 Nouns: number and case
7.7 Proper nouns
7.8 Adjectives
7.9 Adjective vs noun
7.10 Adverbs
7.11 Adverb vs adjective
7.12 Determiners/pronouns
7.13 Prepositions
7.14 Conjunctions
7.15 Conjunction vs preposition
7.16 WH-words
7.17 Numerals
7.18 Interjections
7.19 Abbreviations
7.20 Non-standard forms
7.21 Foreign words and expressions
7.22 Formulas and scientific symbols
7.23 Cited forms
7.24 Punctuation marks
7.25 Letters

8 KWIC concordance

8.1 Tapes and files
8.2 Records
8.3 Sorting
8.4 Example
8.5 Frequencies
8.6 Index to the KWIC concordance

9 Developments
Notes
References
Appendix 1: General flowchart of Tag Assignment Program
Appendix 2: Tagging decisions of APPLYHYPHEN
Appendix 3: Tagging decisions of APPLYWIC
Appendix 4: List of tags

Coding key

The following codes have been taken over from the original (untagged) LOB Corpus:

*@	degree symbol
*+	pound
*-	dash
*/	asterisk
*'	begin quote
*'	end quote
*[	begin comment tag
*]	end comment tag
*;	begin subscript
**;	end subscript
*:	begin superscript
**:	end superscript
*?1	macron on preceding character
*?2	acute accent on preceding character
*?3	grave accent on preceding character
*?4	tilde on preceding character
*?5	circumflex accent on preceding character
*?6	cedilla under preceding, character
"	umlaut or diaeresis on preceding character
\0	abbreviation

For a full list of *? codes (=uncoded character), see Johansson et al (1978). The word-class tags are surveyed in Section 3 and Appendix 4. As regards other coding conventions, see Sections 2.6 (special information in the vertical version) and 5.2 (sentence and paragraph division).