MANUAL OF INFORMATION
TO ACCOMPANY
THE AUSTRALIAN CORPUS OF ENGLISH (ACE)
MACQUARIE UNIVERSITY
BY
PAM PETERS
WITH THE ASSISTANCE OF
ADAM SMITH
MANUAL TO ACCOMPANY THE AUSTRALIAN CORPUS OF ENGLISH (ACE)
The Australian Corpus of English (ACE) was compiled in the department of Linguistics at Macquarie University NSW Australia, from 1986 on. It was supported by a small grant 1988-9 from the Australian Research Grants Council, and by a series of grants from Macquarie University. Other support came from the National Languages and Literacy Institute of Australia and the University of New South Wales. The project was conceived by Pam Peters, Peter Collins and David Blair, and was carried through with the help of a number of research assistants, notably Alison Moore, Elizabeth Green, Robert Jenkins, Catherine Martin, Diana Grace, Heather Middleton, Wendy Young and Adam Smith. Computational help and advice was provided by Harry Purvis and Steve Cassidy, and the project enjoyed continuous infrastructure support from Macquarie's Speech, Hearing and Language Research Centre.
Introduction
Rationale
Sampling procedures : overview
Subjects and genres included within
categories
Sample size and Coding
Appendix 1 : Corpus versions
Appendix 2 : Published papers
associated with the Corpus
Textextracts
ACE was the first systematically compiled heterogeneous corpus in Australia, designed to support a variety of linguistic research. Interest in the differentiation between Australian, British and American English meant that a corpus modeled on the Brown and LOB corpora would provide ready comparisons. It would also serve as a strategic sample of current Australian English, and as a reference corpus for comparisons with more specialised, homogeneous corpora in Australia.
ACE matches the Brown and LOB corpora in most aspects of its structure and constituency, so that direct interdialectal comparisons can be made on a comparable range of printed genres. (The few small points of difference are outlined below pp. 3 to 8.) Yet the desire to create an up-to-date corpus of Australian English prompted the decision not to match Brown and LOB chronologically, ie. with data drawn from publications of the early 1960s. Instead, ACE consists of material from 1986. A time difference is therefore inherent in any regional intercomparisons with Brown and LOB, though that may itself be of considerable interest in showing the direction of influence in the latter part of this century. The twenty-five year difference in fact allowed us to match rather more categories of publishing than would have been possible had we attempted to create a retrospective corpus of Australian publications of the 60s (as LOB did). Independent southern hemisphere publishing has increased steadily since World War II, yet even in 1986 the range of locally published novels was limited and insufficient for the quota required by the Brown/LOB model. It was topped up with a higher proportion of extracts from short stories than were used in the model corpora. (See below, Table 4, p.5.)
Sampling procedures : overview
The prime objective in compiling ACE was to match the balance of genres represented in Brown and LOB, and to create a more or less equivalent set of 2000-word samples in each category. This provided quantitative targets in each of the fifteen categories of Brown and LOB, and the number of samples in the ACE categories A to J are closely matched with them, as shown below in Table 1. The fiction categories in ACE are slightly different in their constituency, for reasons explained below, p.8, but the total of fiction samples remains the same.
Table 1. Makeup of the three corpora
ACE |
Brown |
LOB |
|
A Press: reportage | 44 |
44 |
44 |
B Press: editorial | 27 |
27 |
27 |
C Press: reviews | 17 |
17 |
17 |
D Religion | 17 |
17 |
17 |
E Skills, trades, and hobbies | 38 |
36 |
38 |
F Popular lore | 44 |
48 |
44 |
G Belles lettres, biography, essays | 77 |
75 |
77 |
H Miscellaneous (government documents, | 30 |
30 |
30 |
foundation reports, industry reports, | |||
college catalogue, industry house organ) | |||
J Learned and scientific writings | 80 |
80 |
80 |
K General fiction | 29 |
29 |
29 |
L Mystery and detective fiction | 15 |
24 |
24 |
M Science fiction | 7 |
6 |
6 |
N Adventure and western fiction (bush) | 8 |
29 |
29 |
P Romance and love story | 15 |
29 |
29 |
R Humor | 15 |
9 |
9 |
S Historical fiction | 22 |
- |
- |
W Womens fiction | 15 |
- |
- |
Total | 500 |
500 |
500 |
Within each corpus category, the sampling procedures were mostly strategic rather than random, because of the felt need to match subgenres and subject areas where possible. In some categories, e.g. fiction, the corpus requirements were such that we sampled almost every Australian monograph published in that year, and so the representation in ACE is almost total. Where there was a choice, as with the selection of monographs in some nonfiction categories, we gave preference to those which were held in multiple libraries in several states, and therefore probably had more readers and more impact. Among the serials, both popular and scholarly, the selection was usually dictated by subject, to insure a spread of interests and disciplines like the broad range captured by our predecessors.
Table 2. Sampling of Australian newspapers for categories A,B,C
( * indicates tabloid format, but not necessarily low-brow journalism.)
Newspaper |
Circulation 1986 |
A |
B |
C |
National | ||||
The Australian | 134,000 |
1 |
1 |
1 |
Australian Financial Review* | 66,000 |
1 |
1 |
- |
National Times | 86,000 |
1 |
1 |
- |
Weekly Times* | 46,000 |
1 |
1 |
- |
New South Wales | ||||
Daily Mirror* | 296,000 |
3 |
1 |
1 |
Daily Telegraph | 265,000 |
2 |
1 |
1 |
The Sun* | 258,000 |
2 |
1 |
1 |
Sydney Morning Herald | 255,000 |
2 |
1 |
1 |
Sun-Herald* | 650,000 |
3 |
1 |
1 |
A.C.T. | ||||
Canberra Times | 45,000 |
1 |
1 |
- |
Victoria | ||||
The Age | 233,000 |
2 |
1 |
1 |
The Herald | 237,000 |
2 |
1 |
1 |
Sun News-Pictorial* | 549,000 |
5 |
1 |
1 |
Sunday Press* | 140,000 |
1 |
1 |
- |
Queensland | ||||
Courier-Mail | 217,000 |
2 |
1 |
1 |
Daily Sun* | 133,000 |
1 |
1 |
1 |
Telegraph* | 119,000 |
1 |
1 |
1 |
Sunday Sun* | 375,000 |
2 |
1 |
1 |
Newspaper |
Circulation 1986 |
A |
B |
C |
South Australia | ||||
Adelaide Advertiser | 211,000 |
2 |
1 |
1 |
The News* | 159,000 |
1 |
1 |
1 |
Sunday Mail* | 254,000 |
1 |
1 |
- |
West Australia | ||||
The West Australian* | 238,000 |
2 |
1 |
1 |
Daily News* | 98,000 |
1 |
1 |
- |
Sunday Times* | 251,000 |
1 |
1 |
1 |
Tasmania | ||||
The Mercury | 55,000 |
1 |
1 |
- |
Sunday Tasmanian | 40,000 |
1 |
1 |
- |
Northern Territory | ||||
Northern Territory News | 18,000 |
1 |
1 |
- |
Total Number of Samples |
44 |
27 |
17 |
We also targeted both Sunday and weekly papers , but the predominance of Sunday papers in Australia means that ACE is closer to LOB in this respect, as shown in Table 3.
Table 3 Sampling of reportage, editorial matter and reviews from daily, weekly and Sunday newspapers in the three corpora.
ACE |
Brown |
LOB |
|
A Press: Reportage | |||
Daily |
33 |
33 |
33 |
Weekly |
2 |
11 |
4 |
Sunday |
9 |
- |
7 |
Total |
44 |
44 |
44 |
B Press: Editorial | |||
Daily |
19 |
19 |
19 |
Weekly |
2 |
8 |
3 |
Sunday |
6 |
- |
5 |
Total |
27 |
27 |
27 |
C Press: Reviews | |||
Daily |
14 |
14 |
8 |
Weekly |
- |
3 |
4 |
Sunday |
3 |
- |
5 |
Total |
17 |
17 |
17 |
The overall balance of samples from books/monographs to articles/short stories is shown in Table 4. Further details on sampling are discussed with the individual categories below.
Table 4 Monographs v. articles/short stories
ACE |
Brown |
LOB |
||
D: Religion | ||||
Books |
7 |
7 |
9 |
|
Periodicals |
7 |
6 |
7 |
|
Tracts |
3 |
4 |
1 |
|
Total |
17 |
17 |
17 |
|
E: Skills, Trades and Hobbies | ||||
Books |
- |
2 |
5 |
|
Periodicals |
38 |
34 |
33 |
|
Total |
38 |
36 |
38 |
|
F: Popular Lore | ||||
Books |
18 |
23 |
16 |
|
Periodicals |
26 |
25 |
28 |
|
Total |
44 |
48 |
44 |
|
G: Belles Lettres etc. | ||||
Books |
38 |
38 |
41 |
|
Periodicals |
39 |
37 |
36 |
|
Total |
77 |
75 |
77 |
|
H: Miscellaneous | ||||
Gov. Documents |
25 |
24 |
24 |
|
Foundation Reports |
- |
2 |
2 |
|
Industry Reports |
2 |
2 |
2 |
|
Univ. catalogue |
1 |
1 |
1 |
|
Ind. House Organ |
2 |
1 |
1 |
|
Total |
30 |
30 |
30 |
|
J: Learned | ||||
monographs |
47 |
41 |
35 |
|
articles |
33 |
39 |
45 |
|
Total |
80 |
80 |
80 |
|
ACE |
Brown |
LOB |
||
K: General Fiction | ||||
novels |
9 |
20 |
20 |
|
short stories |
20 |
9 |
9 |
|
Total |
29 |
29 |
29 |
|
L: Mystery/Detective | ||||
novels |
10 |
20 |
21 |
|
short stories |
5 |
4 |
3 |
|
Total |
15 |
24 |
24 |
|
M: Science Fiction | ||||
monographs |
2 |
3 |
3 |
|
short stories |
5 |
3 |
3 |
|
Total |
7 |
6 |
6 |
|
N: Adventure/Western (Bush) | ||||
monographs |
4 |
15 |
15 |
|
short stories |
4 |
14 |
14 |
|
Total |
8 |
29 |
29 |
|
P: Romance/Love | ||||
monographs |
6 |
14 |
16 |
|
short stories |
9 |
15 |
13 |
|
Total |
15 |
29 |
29 |
|
R: Humor | ||||
monographs |
10 |
3 |
3 |
|
short stories |
5 |
6 |
6 |
|
Total |
15 |
9 |
9 |
|
S: Historical Fiction | ||||
monographs |
15 |
- |
- |
|
short stories |
7 |
- |
- |
|
Total |
22 |
- |
- |
|
W: Womens Fiction | ||||
monographs |
8 |
- |
- |
|
short stories |
7 |
- |
- |
|
Total |
15 |
- |
- |
|
Subjects and genres included within categories
Table 5: types of reporting represented in the three corpora
ACE |
Brown |
LOB |
|
A Press: Reportage | |||
Political |
14 |
14 |
13 |
Sports |
7 |
7 |
7 |
Society |
- |
3 |
3 |
Spot News |
7 |
9 |
10 |
Financial |
7 |
4 |
4 |
Cultural |
- |
7 |
7 |
Living |
9 |
- |
- |
Table 6 Subjects represented in Categories E and F
E Skills, trades and hobbies | ACE |
LOB |
Homecraft, handyman |
7 |
5 |
Hobbies |
6 |
5 |
Music, dance |
3 |
3 |
Pets |
1 |
1 |
Sport |
4 |
4 |
Food, wine |
2 |
2 |
Travel |
2 |
2 |
Miscellaneous |
1 |
4 |
Trade, professional journals |
9 |
9 |
Farming |
3 |
3 |
F Popular lore | ||
Popular politics, psychology, sociology |
15 |
22 |
Popular education |
3 |
- |
Personal development |
4 |
- |
Popular history |
8 |
8 |
Popular health, medicine |
3 |
3 |
Culture |
4 |
4 |
Miscellaneous |
7 |
7 |
Table 7 Genres included in Category G
G Belles lettres, biography, essays | ACE |
LOB |
Biography, memoirs |
35 |
35 |
Literary essays and criticism |
6 |
6 |
Arts |
9 |
9 |
General essays |
27 |
27 |
Table 8. Academic disciplines of Category J
ACE |
Brown |
LOB |
|
J Learned | |||
Natural Sciences |
12 |
12 |
12 |
Medicine |
5 |
5 |
5 |
Mathematics |
4 |
4 |
4 |
Soc. Sciences |
14 |
14 |
14 |
Pol. Science, Law, Education |
15 |
15 |
15 |
Humanities |
18 |
18 |
18 |
Technology and Engineering |
12 |
12 |
12 |
Sample size and Coding
Each sample is notionally 2000 words, the counts being done via
WORD for WINDOWS 6, with the coding excluded. The samples contain
a minimum of 2000 words, though most are a little more than that
in order to conclude the sentence. A few, especially from the
disciplines of mathematics and science, have a larger buffer
because of the high proportion of formulae in them, which tended
to fragment the discourse.
Texts are in the ASCII format, with each category and each sample prefaced by coded identification. The samples carry details of the sources from which they were obtained and individual headings or titles. Within the texts there is a limited amount of markup, for certain discrete elements such as bylines or formulae, and for certain nonalphabetic symbols, both of them in SGML-style codes. Mathematical and scientific symbols beyond those of the Greek alphabet were covered by a generic annotation (&symbol;). The markup <note></note> was used for a variety of extra corpus material, both editorial comment and components of the text itself which stood outside the ongoing discourse, such as extended quotations, graphs or tables.
Format/Comment Coding
<section></section> at the start and end of each category
<title></title> around the title of each category
<sample></sample> around each sample
<subsample></subsample> around any subsample
<id></id> around the sample number
<source></source> around the name of the source from which the sample was taken
<h></h> around the heading or title of each sample/subsample <bl></bl> around any bylines
<list></list> around extended lists
<note></note> to enclose any additional comments or text not be included within the wordcount
to <misc></misc> around unpunctuated or irregularly punctuated sections
* replacement of typographic or spelling errors in original e.g. assessment*assesement
+ replaces hyphen at line-break e.g. proces+sors &formula; replacing any complex formula
&symbol; replacing any symbol not listed below individually
Symbol Coding
& & ε e
£ £ θ q
• η h
° ° ζ z
® ® λ l
¶ ¶ &caplambda; L
Ω W μ m
α a ρ r
β b σ s
γ g υ u
&capgamma; G ψ y
δ d &capomega; W
&capdelta; D
Corpus versions
ACE exists in two versions:
ACE I This is the full version, containing all 500 samples, available for interrogation via CD ROM or Internet connection
ACE II This reduced version includes 75% of ACE I, that is 375 samples available for unrestricted use. (The remaining 25% could not be copyright-cleared for use throughout the world.) The samples excluded are listed below:
E01 | E03 | E04 | E05 | E06 | E08 | E09 | E10 | E11 | |||
E14 | E20 | F06 | F09 | F10 | F13 | F14 | F17 | F21 | |||
F22 | F23 | F25 | F26 | F28 | F29 | F44 | G01 | G02 | |||
G03 | G04 | G05 | G07 | G08 | G09 | G10 | G14 | G17 | |||
G19 | G21 | G26 | G30 | G41 | G44 | G47 | G65 | G66 | |||
G69 | J01 | J02 | J07 | J10 | J12 | J13 | J15 | J22 | |||
J25 | J26 | J29 | J30 | J39 | J40 | J44 | J49 | J52 | |||
J54 | J55 | J56 | J60 | J62 | J63 | J65 | J72 | J78 | |||
K02 | K03 | K04 | K07 | K09 | K11 | K12 | K13 | K14 | |||
K15 | K16 | K18 | K19 | K21 | K22 | K24 | K26 | L02 | |||
L03 | L04 | L06 | L09 | M05 | N05 | N08 | P02 | P04 | |||
P05 | P12 | R04 | R06 | R07 | R12 | R15 | S01 | S02 | |||
S05 | S06 | S08 | S10 | S12 | S13 | S14 | S15 | S19 | |||
S21 | W01 | W03 | W04 | W06 | W09 | W10 | W11 |
Published Papers Associated With The ACE Corpus
1. Peters, P. Towards a corpus of Australian English. ICAME JOURNAL No.11 (1987), 27-38. (ICAME = International Computer Archive of Modern English).
2. Collins, P. and Peters, P. The Australian corpus project in Corpus linguistics, hard and soft, ed. M Kyto et al. Amsterdam: Rodopi (1988), 103-120.
3. Collins, P. Computer corpora in English language research: a critical survey. AUSTRALIAN REVIEW OF APPLIED LINGUISTICS 10 i (1987), 1-19.
4. Peters, P., Collins, P., Blair, D. and Brierley, A. The Australian corpus project, findings on some functional variants in the Australian press. AUSTRALIAN REVIEW OF APPLIED LINGUISTICS 11 i (1988), 22-33.
5. Collins, P. The semantics of some modals in contemporary Australian English. AUSTRALIAN JOURNAL OF LINGUISTICS 8 (1988), 233-258.
6. Peters, P. and Fee, M. New configurations: the balance of British and American English features in Australian and Canadian English. AUSTRALIAN JOURNAL OF LINGUISTICS 9 (1989), 135-147.
7. Peters, P. The Australian corpus project: word punctuation in newspapers, in Frontiers of style: proceedings of Style Councils 87 and 88, ed. P.H. Peters. Sydney: Dictionary Research Centre, Macquarie University (1990) 72-79.
8. Peters, P., Purvis, H., Martin, C. and Jenkins, R. Word frequencies from the Macquarie corpus: the newspaper files. WORKING PAPERS OF THE SPEECH, HEARING AND LANGUAGE RESEARCH CENTRE, MACQUARIE UNIVERSITY (1990) 13-92.
9. Green, E. and Peters, P. The Australian corpus project and Australian English. ICAME JOURNAL no.15 (1991) 37-53.
10. Collins, P. The modals of obligation & necessity in Australian English, in English Corpus Linguistics, edd. Aijmer and Altenberg. London: Longman (1991) 145-165.
11. Peters, P. American & British English in Australian Usage, in Style on the move: proceedings of Style Council 92, ed. P.H. Peters. Sydney: Dictionary Research Centre, Macquarie University (1993) 20-27.
12. Peters, P. Corpus evidence on some points of usage, in J. Aarts et al. edd. English language corpora: design, analysis and exploitation Amsterdam: Rodopi (1993) pp. 247-256
13. Peters, P. American and British influence in Australian verb morphology, in U. Fries et al. edd. Creating and Using English Language Corpora Amsterdam: Rodopi (1994) pp. 149-158
14. Collins P. Get- passives in English World Englishes 15:1 (March 1996) pp. 43-56
15. Peters, P. Comparative insights into comparison World Englishes 15:1 (March 1996) pp.57-68
16. Peters, P. and Delbridge, A. Fowlers Legacy in E. Schneider ed. Englishes Around The World vol. 2 Amsterdam, John Benjamins (1997) pp. 301-318
A B C D E F G
H J K L M-N P R S
W