BY
S. V. SHASTRI
IN COLLABORATION WITH
C. T. PATILKULKARNI
GEETA S. SHASTRI
DEPARTMENT OF ENGLISH
SHIVAJI UNIVERSITY, KOLHAPUR
416004 INDIA
1986
TO
PROFESSOR S. K. VERMA
PREFACE
The present corpus of Indian English was conceived in
Lancaster in 1978 when the main author of this work was
researching under the supervision of Professor G.N. Leech. On his
return to India he started the project with an initial grant from
the Shivaji University in 1980, and carried it forward with a
substantial financial assistance from the U.G.C. supplemented by
support from various other sources including personal funds.
We gratefully acknowledge the financial assistance of Shivaji
University and the University Grants Commission. We would like to
acknowledge the support given by Chhatrapati Shahu
Institutes Vasantdada Professional Computer School in the
form of access to their computer system at nominal charges,
thanks to the kindness of their EDP manager, S.M.Kori. We
also would like to thank the large number of people who have
worked on the project at various stages. Among them: V.V.
Badve and P.R. Kher who were associated with the
project in the initial stages of collection of samples. J.A.
Shinde and D.N. Kulkarni who assisted in proofreading
of some of the texts.
Geeta S. Shastri for her secretarial assistance and
together with C.T. Patilkulkarni for sharing the bulk of
pre-editing and proofreading of texts.
Sumati Salunkhe, S.T. Shingate, B.N. Patil,
Raju Chougule. A.Y. Shinde and Ganesh Surve
who helped with secretarial and proofreading assistance at
various stages.
Special thanks are due to Professor S.K. Desai, Head,
Department of English, Shivaji University, for his encouragement
and guidance also in his capacity as member of the Advisory
Committee. We thank the other members of the Committee Professor Birje-Patil
of M.S. University of Baroda, Professor C.J. Daswani of
the University of Poona and Professor K. Subramanian of
Central Institute of English and Foreign Languages for their
unfailing support.
We are also grateful to Professor G.N. Leech,
University of Lancaster, Professor W. Nelson Francis and
Professor Henry Kucera, both of Brown University and
Professor Stig Johansson, University of Oslo for their
guidance from time to time throughout the duration of the
project, and to Prof. Ramesh Mohan the former Director of
the CIEFL and Professor R.N. Ghosh also of the CIEFL, for
their continued support and help.
And in conclusior, we would like to thank all the copyright
holders, who allowed their texts to be included, free of charge,
in the corpus.
Introduction
Sources and Sampling Techniques
A systematic and comprehensive description of Indian English is now overdue. Of the major national varieties of English, only the American and the British English have so far been described in some detail though several other varieties have already been indentified among the native speaker varieties. Side by side, some non-native varieties of English have also been tacitly recognized among which Indian English is a major one.
Studies of Indian English so far have been confined mainly to aspects of spoken English such as by Bansal of CIEFL, Bansal (1969), and methodological considerations such as by Kachru of Illinois University Kachru (1961, 1979), although Kachru himself has written on many isolated areas of Indian English, Kachru (1965, 1975, 1981)1 . Descriptions of aspects of Indian English wherever they have appeared have been based on selected or available samples such as Desai (1974) and Kachru (1965, 1975), the largest one so far being Nihalani et al (1979). There is no gainsaying that a comprehensive description will have to be based on a standard corpus.
The present corpus of Indian Written English2 is comparable to the Brown and the LOB corpora. It is intended to serve as source material for comparative studies of American, British and Indian English which in its turn is excepted to lead to comprehensive description of Indian English..
Although the Indian Corpus is planned to be comparable to the Brown and the LOB corpora there are some important differences dictated mainly by logistic and practical considerations. Firstly, as far as synchronicity is concerned there is a major departure in that, while the Brown and LOB corpora draw their samples from the materials published in the calendar year 1961 the Indian corpus is drawn from materials published in the year 1978. this decision was made after consultation with authors of the earlier corpora to make sure that the comparability will not suffer much as result of this.3 On the other hand it is felt that the value of the Indian corpus is immensely enhanced in general and in particular as a source for the description of Indian English as the Independence as the Indianness of Indian English is a post-Independence phenomen and may have reached a descernible stage in the thirty years after Independence. It is argued in theory that in the same thirty years the American and British English may not have undergone such changes. The number of texts, the weightage given to different genres of material and sampling procedures are kept very close to the other two corpora. However, in one part, that is of Imaginative Prose there are differences in respect of the kinds of fiction and the proportion of texts representing books and those representing periodicals. It is not surprising to find that the amount and kind of imaginative writings in a second language situation such as in India is very different from that in a first language one such as the American or the British situation. Inspite of including samples from all the available full length novels, the proportion could not come anywhere near those of the LOB or the Brown Corpus. Some of us felt that the weightage would be reduced in order to reflect at least in part the real situation. But it was argued that this might adversely affect comparability and the value of the corpus for its projected purpose.
Sources and Sampling Techniques
The Indian Corpus is intended to be a representative corpus of sample texts printed and published in 1978. The texts were largely selected by stratified random sampling process. The composition of texts in the Indian corpus as compared to those in the other two corpora is given in Table No. 1.
Table 1: The Basic Composition of American, British and Indian Corpora.
Text Categories |
No. of texts in each category |
|||
American |
British |
Indian |
||
A | Press: reportage | 44 |
44 |
44 |
B | Press: editorial | 27 |
27 |
27 |
C | Press: reviews | 17 |
17 |
17 |
D | Religion | 17 |
17 |
17 |
E | Skills, Trades and Hobbies | 36 |
38 |
38 |
F | Popular lore | 48 |
44 |
44 |
G | Belles Lettres | 75 |
77 |
70 |
H | Miscellaneons
(Govt. Documents, foundation reports, industry reports, College catalogue, industry house organ). |
30 |
30 |
37 |
J | Learned and scientific writings | 80 |
80 |
80 |
K | General fiction | 29 |
29 |
58 |
L | Mystery and detective fiction | 24 |
24 |
24 |
M | Science fiction | 6 |
6 |
2 |
N | Adventure (Western fiction) | 29 |
29 |
15 |
P | Romance and love story | 29 |
29 |
18 |
R | Humour | 9 |
9 |
9 |
Total |
500 |
500 |
500 |
The sampling procedure followed is described below:
Books: While the compilers of the Brown and the LOB
corpora had at their disposal ready bibliographies from which
they could sample, we were handicapped in this regard. The Indian
National Bibliography - INB monthly - lists all the publications
received in the National Library under the provisions of the
Delivery of Books, (Public Libraries) Act of 1954 as amended by
Act. No. 99 of 1956. But these issues take a long time in
appearing and in fact those for 1978 had not appeared even by the
end of 1979. So it was decided to compile a bibliography of our
own of books printed in English in 1978 which had already been
received in the National Library Calcutta upto December 1979 when
the work actually began.4 A
list of all such books was compiled from Inward Register of
National Library and this was used as the source bibliography for
the purpose of sampling. Again we could not have recourse to
stratification by reference to Dewy Decimal classification as
many of the books had not been processed by the library. So we
stratified the publication by manual inspection of titles
initially. It must be mentioned that such a procedure was
possible because of the limited number of publications in the
second language situation as compared to that in a native language situation such as the American and the British.
It was found by inspection of earlier entries that about 75 %
of the publications of a particular year was received by the end
of the following year and the remaining kept trickling in for 3
or more years later. In order to make up for whatever deficiency
might have been caused by our listing at the end of 1979 we
repeated the exercise once again at the end of 1980 and 1984 and
found that only about 10% more publications had arrived. It may
be mentioned in passing that our 140 texts from books covering
all the categories were sampled from over 1200 titles which
amounts to about 8%. The sampling was done from the lists
separately with the help of a random number table. Needless to
say, whenever a selected book was not accessible the next
available on the shelf was selected. Whenever sufficient number
of texts did not turn up in this process, other texts were
deliberately chosen to fill the category.
Government Documents: As in the case of books no
catalogue of Government Publications in 1978 was available upto
the end of 1980 and therefore the same procedure of listing from
the Inward5 Register and
random sampling was followed. In this case the job was simplified
because most of the publications of the Govt. Of India are in
English. Random sampling and filling out of the texts required
for various sub-categories was carried out on the lines done in
the case of books.
It will be noticed that we have 37 texts in category H as
compared to 30 of the American and British. The decision to alter
the number was influenced by the fact that there are in India two
types of Govt. Documents, the Union and the State ones. We have included
26 texts from the Union Government Documents and 7 texts from State Government
Documents. It may be mentioned also that the bulk of book
publication of proportional or even greater weightage could not
be seriously considered in view of our commitment to the original
design.
Press Materials: The sampling of news-papers and particular issues of newspapers was first carried out. The 53 English newspapers received in the Central Library6, Bombay was considered to be the universe out of which all the 6 national papers were retained and 15 regional papers were selected to represent the different regions of India. From these again, from each newspaper 16 daily issues and 4 Sunday editions were sampled with the help of a random table. Of the required issues several were not in the files of the library. So as usual the next number was taken. From these the actual texts required were identified and the categories filled out.
Periodicals: Initially an attempt was made to compile
sample lists of periodicals from the Source Book Press in India
1977 and follow the same procedure for sampling texts as we had
done in the case of Books and Press Materials. However, it was
discovered that such a procedure tended not only to exclude most
or all of the most popular and well circulated
periodicals but threw up those that were either unheard
of or unavailable or had ceased to be
published. In the circumstances we took recourse to the other
procedure i.e. of treating the holdings of Central Library in
Bombay and the National Library in Calcutta as the universe from
which to draw the samples. This proved to be wise as far as
popular periodicals was concerned; but in the case of
learned Journals we had to follow a different procedure.
On a impressionistic basis, it was decided to pick on the
richest known libraries: The holdings of the Tata Institute of
Fundamental Research, The Indian Institute of Technology, Bombay
were used for sampling materials pertaining to Sciences and
Technology, the holdings of the Tata Institute of Social Sciences
for materials pertaining to social sciences in addition to the
holdings of University libraries- Bombay, Poona, Shivaji and
Baroda. The actual procedure followed was to compile
category-wise select bibliographies and then sample the texts
required to fill out the categories.
From the foregoing description of selection procedures followed
in building the corpus it is clear that the corpus cannot be
claimed to be a "stratified random sample" of Indian
English in the strict statistical sense. We, like the builders of
the LOB corpus were guided by our aims - of ensuring maximum
comparability with the other two corpora and creating a truly
representative sample of edited and published Indian English
writings.
A01-06 | National | daily | Political |
A07-08 | " | " | Sports |
A09-11 | " | " | Society |
A12-15 | " | " | Spot news |
A16-17 | " | " | Financial |
A18-19 | National | weekly | Political |
A20-21 | " | Sunday | Sports |
A22 | " | " | Spot news |
A23 | " | " | Financial |
A24-26 | " | " | Social/cultural |
A27-31 | Regional | daily | Political |
A32-33 | " | " | Sports |
A34-37 | " | " | Spot news |
A38 | " | " | Financial |
A39-40 | " | " | Cultural |
A41 | " | weekly | Sports |
A42 | " | " | Society |
A43 | " | " | Spot news |
A44 | " | " | Cultural |
B01-06 | National | daily | Institutional editorial |
B07-08 | " | " | Personal editorial |
B09-11 | " | " | Letters to the editor |
B12-14 | " | Sunday | Institutional editorial |
B15 | " | " | Personal editorial |
B16 | " | " | Letters to the editor |
B17-22 | Regional | daily | Institutional editorial |
B23-24 | " | " | Letters to the editor |
B25-26 | " | Sunday | Institutional editorial |
B27 | " | " | Letters to the editor |
C01-06 | National | daily | Book, music, cinema, painting, folk art etc. |
C07-13 | " | Sunday | |
C14 | " | weekly | |
C15-16 | Regional | daily | |
C17 | " | Sunday |
D01-08 | Books |
D09-17 | Periodical and Journals |
Category E (Skills, trades and hobbies)
E01-05 | Homecraft, handiman |
E06-10 | Hobbies |
E11-13 | Music, dance |
E14 | Pets |
E15-18 | Sports |
E19-20 | Food |
E21-22 | Travel |
E23-26 | Miscellaneous |
E27-35 | Trade, professional journals |
E36-38 | Agriculture, farming |
F01-22 | Popular | Politics, psychology, sociology |
F23-30 | Popular | History |
F31-33 | Popular | Health, medicine |
F34-37 | "Culture" | |
F38-44 | Miscellaneous |
Category G (Belles lettres, biography, essays)
G01-35 | Biography, memoirs |
G36-41 | Literary essays and criticism |
G42-50 | Arts |
G51-70 | General essays |
H01-26 | Central Government document | |
A H01-12 | Reports, department publications | |
B H13-14 | Acts | |
C H15-20 | Proceedings, debates | |
D H21-26 | Other Government documents | |
H27-32 | State Government documents | |
H33-37 | Industry reports, house organ, University catalogue. |
Category J (Learned and scientific writings)
J01-12 | Natural and physical sciences | |
J13-17 | Medicine | |
J18-21 | Mathematics | |
J22-35 | Social, behavioural sciences | |
A J22-25 | Psychology | |
B J26-30 | Sociology | |
C J31 | Demography | |
D J32-35 | Linguistics | |
J36-50 | Political Science, law, education, commerce | |
A J36-39 | Education | |
B J40-47 | Politics, economics, commerce | |
C J48-50 | Law | |
J51-68 | Humanities | |
A J51-55 | Philosophy | |
B J56-59 | History | |
C J60-63 | Literary criticism | |
D J64-66 | Art | |
E J67-68 | Music | |
J69-80 | Engineering and technology |
K01-12 | Novels |
K13-58 | Short stories |
Category L (Mystery and detective fiction)
L01-03 | Novels |
L04-06 | Short stories |
L07-11 | Novels |
L12-23 | Short stories |
L24 | Novel |
M01-02 | Short stories |
N01 | Short story |
N02-04 | Novels |
N05-13 | Short stories |
N14 | Novel |
N15 | Short story |
Category P (Romance and love story)
P01-02 | Short stories |
P03 | Novel |
P04-07 | Short stories |
P08 | Novel |
P09-15 | Short stories |
P16 | Novel |
P17-18 | Short stories |
R01-05 | Short stories |
R06 | Book |
R07-09 | Articles from periodicals |
The American (Brown), British (LOB) and
Indian Corpora compared
Categories A-C: In terms of weighting between national and regional newspapers the Indian corpus conforms more closely to the British. This has been deliberate more so because, like the British situation, Indian newspapers have a clear-cut distinction between the national and regional on the basis of both distribution and circulation figures. The proportion of texts drawn from the National and the Regional papers is 62% to 38% in the Indian corpus as compared to the British 60% to 40% (see table 2). As to the sub-categories of texts and their distribution over dailies, Sundays and weeklies, there is even more correspondence between the two (see table 3), except that the two sub-categories society and cultural had to be collapsed into one as no such hard and fast distinction could be observed in the newspaper reportage of the Indian Press. The other very marginal difference is that there are fewer personal editorials in the Indian corpus.
Category D: In terms of subcategories Religious prose is not classified either in Brown or in LOB corpus; but while sampling texts the builders of LOB corpus have by inspecting the Brown texts arrived at a decision to include stylistically heterogenous texts ranging from learned to popular committed writing. The same procedure was followed in selecting texts for the Indian corpus except that the sub-division tracts is unrepresented. The distribution of texts from books and periodicals is nearly maintained (see table 4).
Categories E-J: Sub-categories of texts in the Indian corpus have been matched almost perfectly with the LOB corpus. However, it was not always possible to match the individual texts in terms of the type of source, book or periodical for materials in the three categories E, F and G. While books are over-represented in the case of category E, they are somewhat under-represented in the case of categories F and G (see table 4). As already stated earlier we deliberately altered the weighting between category G and H. G has been reduced by seven texts and H increased by the same number. This was done to reflect the Indian situation in which, firstly, Government documents divide themselves into Central and State govt. documents and the bulk of those far exceeds any other printed material in English except Press materials. This has been reflected in the greater representation of government documents in the Indian corpus. And foundation report is unrepresented (see table 4). J texts have been matched very closely in all the three corpora. This is the only category which has one to one correspondence of weighting to sub-categories (see table 4).
Categories K-R: This section of the corpus that is Imaginative prose is maximally mismatched. As already stated sufficient number of texts were simply not available and the possible consequences of this was discussed with the experts in the field and it was felt that comparability would suffer only marginally. We repeat some of the points here. The sub-categories K, L, M, N and P all representing fiction is a sort of cline from general fiction (K), mystery and detective (L), science fiction (M) adventure and Western fiction (N) and romance and love story (P). The classification is based on the theme/treatment and very often is bound to be overlapping; and the interest of the corpus compile is style. In the process of sampling, it is quite possible for the selected portion of the work as text to run wide of mark, especially in the case of novels. In view of all this, firstly, the sub-categories were defined negatively, that is, whatever was not broadly speaking L, M, N or P was considered to be K; and the selected texts especially from novels were inspected and placed in the categories so defined. In the case of short stories the question did not arise. The fact remains that the number of texts are matched only in the categories L and R. In the case of K we have double the number i.e. 58 in place of 29; science fiction only 2 as against 6;adventure only 15 in place of 29; and romance and love story only on 18 in place of 29. Again, mystery and detective is for the West largely detective and mystery surrounding death, murder etc., but for the Indian it includes other kinds of mystery in the sense of mysterious or miraculous. Similarly in the case of adventure and Western fiction, there is nothing at all corresponding to western fiction in India. So the sub-category is wholly comprised of adventure.
Now it must be stated that if the Imaginative Prose section in the Indian corpus is not on the face of it quite comparable to that of Brown or LOB, it is designed to be truly representative of Indian English.
Table No. 2 - Details of number of texts drawn from different newspapers.
No. | Name of the newspaper (National newspapers) |
Number of texts drawn |
||
daily |
Sunday/ |
Total |
||
1 | The Hindu, Madras | 9 |
2 |
11 |
2 | Economic Times, New Delhi | 1 |
2 |
3 |
3 | The Statesman, Calcutta | 5 |
1 |
6 |
4 | The Hindustan Times, New Delhi | 4 |
4 |
8 |
5 | The Times of India, Bombay | 11 |
8 |
19 |
6 | The Indian Express (various editions) | 3 |
3 |
6 |
Sub-totals |
33 |
20 |
53 |
(Regional newspaper) | ||||
1 | Business Standard, Calcutta | 3 | - | 3 |
2 | Deccan Herald Bangalore | 2 | 1 | 3 |
3 | The Tribune, Chandigarh | 3 | 1 | 4 |
4 | National Herald, Lucknew | 1 | 1.5 | 2.5 |
5 | Searchlight, Patna | 1 | 1 | 2 |
6 | The Assam Tribune, Gauhati | 2 | - | 2 |
7 | Amrit Bazar Patrika, Calcutta | 1 | - | 1 |
8 | Deccan Chronicle, Secunderabad | 3 | - | 3 |
9 | The Western Times, Ahmedabad | 1 | - | 1 |
10 | Madhya Pradesh Chronicle, Bhopal | 2 | - | 2 |
11 | Nagpur Times, Nagpur | 2 | 0.5 | 2.5 |
12 | Navhind Times, Panaji | 1 | 0.5 | 1.5 |
13 | Northern India Patrika, Allahabad | - | 1.5 | 1.5 |
14 | Poona Herald, Poona | 3 | - | 3 |
15 | Blitz Weekly, Bombay | 3 | - | 3 |
Sub-totals |
28 | 7 | 35 | |
Grand Totals |
61 | 27 | 88 |
Table 3. Categories A-C: The American, British and Indian corpora compared
American corpus |
British corpus |
Indian corpus |
||||||||||||||||
National |
Provencial |
National |
Regional |
|||||||||||||||
A. Press: Reportage | daily | weekly | daily | Sunday | daily | weekly 1) | daily | Sunday | daily | Sunday | ||||||||
Political | 10 | 4 | 14 | 6 | 2 | 5 | - | 13 | 6 | 2 | 5 | - | 13 | |||||
Sports | 5 | 2 | 7 | 2 | 2 | 2 | 1 | 7 | 2 | 2 | 2 | 1 | 7 | |||||
Society | 3 | - | 3 | 2 | - | - | 1 | 3 | 3 | 3 | 2 | 2 | 102 | |||||
Spot news | 7 | 2 | 9 | 4 | 1 | 4 | 1 | 10 | 4 | 1 | 4 | 1 | 10 | |||||
Financial | 3 | 1 | 4 | 2 | 1 | 1 | - | 4 | 2 | 1 | 1 | - | 4 | |||||
Cultural | 5 | 2 | 7 | 3 | 1 | 2 | 1 | 7 | - | |||||||||
Total |
44 | Total |
44 | Total |
44 | |||||||||||||
B. Press: Editorial | ||||||||||||||||||
Institutional | 7 | 3 | 10 | 4 | 2 | 3 | 1 | 10 | 6 | 3 | 6 | 1 | 16 | |||||
Personal | 7 | 3 | 10 | 4 | 2 | 3 | 1 | 10 | 2 | 1 | - | 1 | 4 | |||||
Letters to the editor | 5 | 2 | 7 | 3 | 1 | 2 | 1 | 7 | 3 | 1 | 2 | 1 | 7 | |||||
Total |
27 | Total |
27 | Total |
27 | |||||||||||||
C. Press: reviews | 14 | 3 | 17 | 6 | 5 3) | 2 | 1 | 17 | 5 | 9 | 2 | 1 | 17 | |||||
Total |
17 | Total |
17 | Total |
17 |
1) Including Provincial Sunday
2) Including "cultural"
3) The Times Literary Supplement and The Times Educational Supplement
Table 4- Categories D-J: The American, British and Indian Corpora compared
American corpus |
British corpus |
Indian corpus |
||
D. Religion | ||||
Books | 7 | 9 | 8 | |
Periodicals | 6 | 7 | 9 | |
Tracts | 4 | 1 | - | |
E. Skills, Trades and Hobbies | ||||
Books | 2 | 5 | 9 | |
Periodicals | 34 | 33 | 29 | |
F. Popular Lore | ||||
Books | 23 | 16 | 10 | |
Periodicals | 25 | 28 | 34 | |
G. Belles Lettres etc. | ||||
Books | 38 | 41 | 29 | |
Periodicals | 37 | 36 | 41 | |
H. Miscellaneous | ||||
Govt. Documents | 24 | 24 | 32 | |
Foundation Reports | 2 | 2 | - | |
Industry Reports | 2 | 2 | 2 | |
Univ. catalogue | 1 | 1 | 1 | |
Ind. House Organ | 1 | 1 | 2 | |
J. Learned | ||||
Natural Sciences | 12 | 12 | 12 | |
Medicine | 5 | 5 | 5 | |
Mathematics | 4 | 4 | 4 | |
Soc. Sciences | 14 | 14 | 14 | |
Pol. Science, Law, Education | 15 | 15 | 15 | |
Humanities | 18 | 18 | 18 | |
Technology and Engineering | 12 | 12 | 12 |
Table 5 - Categories K-R: The American, British and Indian corpora compared
American corpus |
British corpus |
Indian corpus |
||
K. General Fiction | ||||
Novels | 20 | 20 | 12 | |
Short stories | 9 | 9 | 46 | |
L. Mystery and Detective Fiction | ||||
Novels | 20 | 21 | 9 | |
Short stories | 4 | 3 | 16 | |
M. Science Fiction | ||||
Novels | 3 | 3 | - | |
Short stories | 3 | 3 | 2 | |
N. Adventure and Western | ||||
Novels | 15 | 15 | 4 | |
Short stories | 14 | 14 | 11 | |
P. Romance and Love Story | ||||
Novels | 14 | 16 | 3 | |
Short stories | 15 | 13 | 15 | |
R. Humour | ||||
Novels | 3 | 3 | - | |
Short stories | - | - | 5 | |
Essays, etc. | 6 | 6 | 3 | |
Books | - | - | 1 |
Alphanumeric characters represent themselves. In the case of alphabet symbols, the letter represented is lower case unless otherwise specified:
A = a | B = b | C = c | Etc. |
1 = 1 | 2 = 2 | 3 = 3 | Etc. |
Letter preceded by * = the same letter (word initial Capital)
Letter precede by = the same letter (sentence initial Capital)
NB - both and * are used when a sentence initial capital coincides with word initial capital.
*A = A Word initial only
B = B Sentence initial only
*J = J Word and sentence initial at the same time e.g. as in: John was blind.
Other Characters:
* is reserved as a prefix for a compound coding symbol. When not preceded by *, all other characters represent themselves except for £, $, , ¬
that is to say:
1 | . = . | Full stop |
2 | : = : | Colon |
3 | ; = ; | Semi-colon |
4 | , = , | Comma |
5 | " = " | Double quotes: begin and end quotes are distinguished by a space before and after respectively |
6 | = | Single quotes (but not apostrophe) begin and end quotes are distinguished by a space before and after respectively |
7 | ? = ? | Question-mark |
8 | ! = ! | Mark of exclamation |
9 | - = - | Minus when separated by spaces on either side, hyphen when not so separated. |
10 | -- = -- | Dash |
11 | % = % | Per cent |
12 | & = & | (and) |
13 | ( = ( | Left brace |
14 | ) = ) | Right brace |
15 | + = + | Plus |
16 | / = / | Slash, oblique |
17 | [ = [ | Left bracket |
18 | ] = ] | Right bracket |
19 | @ = @ | At ( the rate of) |
20 | = = = | Equals |
21 | Space | = space |
22 | > = > | |
23 | < = < | |
24 | x = x | into (represents multiplied by when separated by spaces) |
25 | ¬ = | grammatically marked (always follows the word) |
26 | = | sentence initial capital |
27 | * = | Word initial capital. (N.B. both * occur when word initial capital coincides with sentence initial capital) |
28 | £ = | begin non English word |
29 | $ = | new paragraph or new line |
Compound Coding
* | = |
apostrophe | ||
**< | = |
begin major heading | ||
**> | = |
end major heading | ||
*< | = |
begin minor heading | ||
*> | = |
end minor heading | ||
*@ | = |
IBM ASCII 248 (degree symbol) | ||
*+ | = |
£ (pound) | ||
*- | = |
$ (dollar) | ||
*/ | = |
* (asterisk) | ||
*# | = |
end of corpus text | ||
**# | = |
end of corpus | ||
*? | = |
uncoded character (see below) | ||
**[ | = |
begin comment tag | ||
**] | = |
end comment tag | ||
*= | = |
upper case Roman numeral | ||
**= | = |
lower case Roman numeral | ||
*; | = |
begin subscript | ||
**; | = |
end subscript | ||
*: | = |
begin superscript | ||
**: | = |
end superscript | ||
*( | = |
begin hybrid word/expression | ||
*) | = |
end hybrid expression | ||
\0 | = |
abbreviations. Sequence of abbreviations or initials are enclosed in *(0 *) | ||
*¬ | = |
Included sentence | ||
*% | = |
%o (per thousand) |
Type-shift
* | = |
end typeshift |
*1 | = |
begin typeshift for citation |
*2 | = |
begin capitalization |
*3 | = |
begin typeshift for highlighting emphasis (including italics) |
*4 | = |
begin Indian Language word |
*5 | = |
begin Indian Language expression or passage |
*6 | = |
end Indian Language expression or passage |
*7 | = |
begin foreign word |
*8 | = |
begin foreign expression |
*9 | = |
end foreign expression |
*?0 | = |
. (dot) under the preceding character (to indicate retroflex) |
*?00 | = |
. (dot) on the preceding character (to indicate retroflex) |
*?1 | = |
-(macron) on the preceding character (to indicate vowel length) |
*?2 | = |
“ (acute accent) on the preceding character (to indicate vowel length) |
*?3 | = |
(grave accent) on the preceding character (to indicate vowel length) |
*?4 | = |
~ (tilde) on the preceding character (to indicate vowel length) |
(NB: The code *[ applies to languages that use the Roman alphabet. The following apply to all foreign languages irrespective of what script they use.)
Foreign language materiel in non-Roman alphabet transcribed in Roman script
*[1 | = |
begin Assamese material |
*[2 | = |
begin Bengali material |
*[3 | = |
begin Gujarati material |
*[4 | = |
begin Hindi material |
*[5 | = |
begin Kannada material |
*[6 | = |
begin Kashmiri material |
*[7 | = |
begin Malayalam material |
*[8 | = |
begin Marathi material |
*[9 | = |
begin Oriya material |
*[10 | = |
begin Punjabi material |
*[11 | = |
begin Sanskrit material |
*[12 | = |
begin Sindhi material |
*[13 | = |
begin Tamil material |
*[14 | = |
begin Telugu material |
*[15 | = |
begin Urdu material |
Interpretive Codes:
I Apostrophe *
* | = |
apostrophe for possessive
e.g. Johns book = *JOHN*s BOOK. Students Union = STUDENTS* UNION |
*'1 | = |
Contracted from of is e.g. John's coming = *JOHN*' IS Coming |
*'2 | = |
Contracted form of has
e.g. John's been ill = *JOHN*' 2S BEEN ILL |
*' | = |
apostrophe for contracted
form of had e.g. He'd done well = HE*'D DONE WELL |
*'1 | = |
Contracted form of would
e.g. He'd stand for hours = HE*' 1D STAND FOR HOURS. |
*'3 | = |
Other contractions e.g. d'Estaing = D*'3*ESTAING 'Tis a pity = *'3TIS A PITY |
*'4 | = |
Contraction of not e.g. don't = DON*'4T |
*'5 | = |
abbreviation for minutes
(degree & mins) e.g. 4' 30' = 4*@ 30 *'5 |
*'6 | = |
abbreviation for foot/feet
e.g. 3' = 3*'6 |
*'7 | = |
notation for glottal fricative |
II Left arrow (¬ )
( 1) To as preposition is unmarked
To as infinite marker is marked with a left arrow e.g. to go = To ¬ GO
(2). That as a subordinating conjunction in unmarked; that as any other is marked e.g. that day = THAT DAY;
The man that you spoke of = THE MAN THAT ¬ YOU SPOKE OF
Greater than that of the other = GREATER THAN THAT ¬ OF THE OTHER
III Double Quotes ( " )
*" = inches e.g. 6 3" = 6*66nbsp; 3*"
Greek letters are marked by a preceding **Y FOR LOWER for lower case and **Z for upper case characters and represented by Roman alphabets as follows :
Other notations
Mathematical symbols
**MN = mathematical notation e.g. S1 , T1
**MS = mathematical e.g. = , ± , ¹ , ®
**MF = mathematical conventional figure 106, 8-5
**ME = mathematical equation
The material and its organization.
1. The text of a sample starts with the first sentence or the
first section on the first page of the sampled text in the case
of books and at the beginning of the article in the case of
samples drawn from periodicals, journals and newspapers etc. and
ends with the sentence containing the2.000th word.
2. Each corpus text is headed by a line in which the text number is indicated enclosed in comment tags (e.g. **[TXT.A01**] ) and is followed by a line again enclosed in comment tags indicating the number of words in that text (e.g. **[ No. of words = 02008**] ).
3. Headings are coded and included in the texts. The title of the book is often included in the texts ; but there are some inconsistencies in this regards as pre editing was handled by various persons.
4. Sentences used as " tantalisers" etc. , are also included with blanket comment tags** [BEGIN LEADER COMMENT **] and ** [END LEADER COMMENT **]. In this also there are some inconsistencies.
5. Extra textual material such as maps , charts , diagrams, tables etc. , are excluded and represented by descriptive tags.
6. As a rule footnotes are excluded and are represented by a descriptive comment tag **[FOOT NOTE **]. But in the case of texts in which the footnotes were even longer than the body of the text , they have been included.
7. Long foreign quotations and poetry quotations are excluded and represented by descriptive comment tags.
8. Mathematical equations, long formulas etc. are excluded and represented by mathematical symbol codes (see coding key).
9. The text categories are included in the order listed in the tables above.
10. The texts are preserved in card image form (80- character). The first 72 characters of each line contain the text of the sample and the last 7 characters indicate the unique location number. The 73rd character is always a blank.
11. One or more blank spaces ( sometimes a whole blank line ) separates two words. A word is orthographically defined as a character or sequence of characters surrounded by blank spaces. Like in the case of the 1st version of the Brown Corpus , words are often broken at the end of a line( i.e. the 72nd character ). If a word is thus broken it is continued starting with the first column of the next line. If a word ends at the 72nd column , the first column of the next line is left blank.
A sample portion of a text as it appears in the print out is reproduced in the following page.
Sample of print out
**[TXT. J52**] | |
**<*3*NATURE, *MAN ; AND *GOD IN THE *4*VEDAS*O**>$*<31. *2 THE PROBLEM | 0010J52 |
OF CAUSATION*O*> $+3*2^ MAN IS+0 MOST CONCERNED WITH HIS ENVIRONMENT | 0020J52 |
; THE WORLD IN SPACE AND TIME. ^ HENCE, IT IS NATURAL THAT WHEN HE BE | 0030J52 |
COMES REFLECTIVE, HE WANTS TO_ UNDERSTAND THE NATURE OF THIS WORLD. ^ T | 0040J52 |
HE PHYSICAL WORLD SEEMS TO HIM THE PART AND PARCEL OF HIS LIFE. ^ WHE | 0050J52 |
N HE TRIES TO_ UNDERSTAND THE NATURE OF THE PHYSICAL WORLD, THE QUE | 0060J52 |
STIONS THAT COME UP ARE _ _WHO HAS CREATED THIS WORLD; WHAT ARE THE | 0070J52 |
CONSTITUENT ELEMENTS OUT OF WHICH IT IS CREATED AND HOW IT IS CREAT | 0080J62 |
ED? ^ IN OTHER WORDS, WE WANT TO_ KNOW ITS EFFICIENT CAUSE, THE MATE | 0090J52 |
RIAL CAUSE, AND THE PROCESS OF CREATION. $^ THUS THE PROBLEM OF CAUSA | 0100J52 |
TION IS THE PRIMARY QUESTION IN THE UNDERSTANDING OF THE PHYSICAL WO | 0110J52 |
RLD_ _ OR WHAT WE CALL *NATURE. ^ THE *4*VEDAS, AS IS KNOWN, ARE MORE | 0120J52 |
POETIC IN THEIR CONTENT THAN LOGICAL. ^ STILL ONE CAN TRACE CERTAIN I | 0130J52 |
MPORTANT IDEAS REGARDING CAUSATION BEHIND THE POETIC IMAGINATIONS. $ | 0140J52 |
^ THE PRINCIPLE OF CAUTION IN THE *4*VEDAS, THE EARLIEST LITERATURE | 0150J52 |
OF THE *HINDUS, SEEMS TO_ APPEAR IN THE CONCEPT OF *4*RTA. *4^ *RTA REP | 0160J52 |
RESENTS THE LAW, UNITY OR RIGHTNESS, UNDERLYING THE ORDERLINESS WE O | 0170J52 |
BSERVE IN THE WORLD.*4^ *RTA , LITERALLY MEANS THE “COURSE OF THINGS“ | 0180J52 |
^ THIS CONCEPTION SEEMS TO_ HAVE BEEN ORIGINALLY DERIVED FROM THE | 0190J52 |
REGULARITY OF THE MOVEMENTS OF THE HEAVENLY BODIES LIKE THE SUN, THE | 0200J52 |
E MOON, AND THE STARS, THE ALTERNATIONS OF DAY AND NIGHT AND OF THE | 0210J52 |
SEASONS. $^ IN THE *4*VEDAS, THERE ARE NO HYMNS ADDRESSED SPECIFICAL | 0220J52 |
LY TO *4*RTA, BUT BRIEF REFERENCES TO THE IMPORTANT CONCEPTS ARE FOU | 0230J52 |
ND REPEATEDLY IN THE HYMNS TO *4*VARUNA (WHO MANTAINS THE PHYSICAL O | 0240J52 |
RDER), *4*AGNI, *4*VISVEDEVAS \OETC. ^ THE FOLLOWING HYMN WILL ILLUST | 0250J52 |
RATE THE POINT: **[VERSE**] $^ GRADUALLY THE CONCEPT OF ± 4± RTA TAKES | 0260J52 |
A NEW MEANING FROM EXTERNAL PHYSICAL ORDER OR UNIFORMITY OF NATURE_ | 0270J52 |
_ IT ACQUIRES THE SIGNIFICANCE OF A MORAL ORDER. ^ THE WHOLE WORLD WA | 0280J52 |
The corpus is available at cost to bonafide researchers in India from the department of English, Shivaji University, Kolahapur, and is being made available to bonafide researchers outside India through the International Computer Archive of Modern English (ICAME), at the Norwegian Computing Centre for the Humanities, Bergen, Norway. The material is available on magnetic tape in the following format:
1. The tape has no label.
2. There are 24 files on the tape containing the entire material as shown below:
Sr.No of | Texts | Sr.No of | Texts | |
the file | Contained | the file | Contained | |
1 | A01 A22 | 13 | H01 - H20 | |
2 | A23 A44 | 14 | H21 - H37 | |
3 | B01 B27 | 15 | J01 J27 | |
4 | C01 C17 | 16 | J28 J54 | |
5 | D01 D17 | 17 | J55 J80 | |
6 | E01 E20 | 18 | K01 K29 | |
7 | E21 E38 | 19 | K30 K58 | |
8 | F01 F22 | 20 | L01 L24 | |
9 | F23 F44 | 21 | M01 M02 | |
10 | G01 G25 | 22 | N01 N15 | |
11 | G26 - G50 | 23 | P01 P18 | |
12 | G51 G70 | 24 | R01 R09 |
3. Each text is separated by a blank record in the file.
4. The text is divided into 80 character lines, as follows :
a) Text : 72 characters
b) Location number : 8 characters
- 1 space ( character No. 73 )
- 4 characters line number e. g. 0010
- 3 characters sample code e. g. A01
5. Each tape record ( block ) contains 10 lines, except that the last block in each file may contain less than 10.
6. The character code used is ASCII and the material is recorded in 9-track, 1600 fpi density.
Permission to use materials under copyright was sought through a form letter sent under certificate of posting. We are glad to say that most copyrightholders responded promptly. Remainders were sent to those who did not respond for over three months in which option was given to them not to answer the letter if they had no objections to our using the materials.
Individual acknowledgements are made in the notes on text extracts. In the case of those who have not responded so far no such acknowledgement appears.
Bansal, R. K. 1969. The intelligibility of Indian English Monograph No. 4, CIEFL, Hyderabad.
Desai, S. K. 1974. Experimentation with language in Indian Writing in English (Fiction). Monograph of the Dept. of English, Shivaji University, Kohlapur.
Francis, W. N. and Henry Kucera. 1964. Rev 1979. Manual of information to accompany A standard Corpus of present-day edited American English. Dept. of Linguistics, Brown University, Providence, R.I.
Johasson, Stig, G. N. Leech and Helen Goodluck. 1978. Manual of Information to Accompany the Lancester-Oslo/Bergen corpus of British English. Dept. of English, University of Oslo, Oslo.
Kachru, B. B. 1961. An analysis of some features of Indian English: A study in linguistic Method. Unpublished Ediburgh thesis
----------- 1965. The Indianness in Indian English. Word 2: 391-410.
----------- 1975. Lexical innovations in South-East Asia. International Journal of the
Sociology of Language. Vol. 4. Mouton, The Hague.
---------- 1979. The new Englishes and an old models News letter.
January 1979, CIEFL, Hyderabad.
----------- 1981. The pragmatics of non-native varieties of English. In Smith Larry (ed) English for cross cultural communication. Macmillan 15-39.
Nihalani, Paroo, R. K. Tongue and Priya Hosali. 1979. Indian and British English: A handbook of Usage and pronunication. O. U. P. New Dehli.
Shastri, S. V. 1978. English word-meanings and their American and Indian variants a study of six lexical items. Unpublished Lancaster University thesis.
ADDENDA
1. Coding key:
Some of the details of coding key described in this manual are irrelevant for the users of the corpus in the form in which it now is being made available. The original version was in 64 character set i.e. the text was in all capitals; but the present version is in 96 character set i.e. the text is in upper lower case characters. Hence the asterisk (*) used as a code to indicate word initial capitals everywhere in the text is irrelevant. Similarly the code for all capitals (*2 .0) is also irrelevant, as capitals and lower case letters as lower case letters. However, the code for sentence initial capital (^ )has been retained though it is redundant. The asterisk as a word initial code has also been retained when a word with initial capital occurs at the beginning of a sentence. This has been done to facilitate machine processing of the corpus texts.
2. The material and its organisation:
The record length in the present version of the corpus is 100 characters and not 80 as in the original version. The unique location number of seven charcters has been shifted to the beginning of the line and the text begins at the 9th characters of each line. Only one blank space seperates two wordsexcept that any number of blank spaces may occur at the end of a line.
3. Basic technical information:
The file organization remains unchanged, but each line contains 100 characters and each block contains 80 lines except that the last block in any file may contain less than 80 lines. The rest of the technical details remain unchanged.
The coding for Greek letters stand modified as follows:
*Y to mark lowercase letters and *Z to mark upper case letters of the Greek alphabet.
The coding for mathematical symbols also stand modified as follows:
*Mn = mathematical notation
*Ms = mathematical symbol
*Mf = conventional mathematical figure
*Me = mathematical equation.
**[ text. a03** ] | |
0010A03 | **<*3 Creeping Detente In Africa**> $"^ *DETENTE, " said \0 Dr Bruno |
0020A03 | Kreiski, Chancellor of Austria (which alongwith Switzerland and |
0030A03 | Sweden is one of the three official neutral States in Europe ) in his |
0040A03 | address to the Royal Institute of International Affairs in London |
0050A03 | on July 4, "is not the consequence of sublime human insight but simply |
0060A03 | a result of a state of military balance". ^ This realistic definition |
0070A03 | of a state of relationship between the Soviet bloc and Western Europe, |
0080A03 | \0US and Canada, which has been widely criticised in the West |
0090A03 | as tattered by developments in Africa ( and Afghanistan and South Yemen ) |
0100A03 | explains the about-turn in Western policy on Angola that_ appears |
0110A03 | to_ be taken place now quietly and even secretively. $^ It was just a |
0120A03 | month ago, following the incursion of Katangan exiles to the mineral rich |
0130A03 | Shaba province of Zaire and the massacre of whites in Kolwezi, |
0140A03 | that Western Europe, backed by the \0US, were planning the establishment |
0150A03 | of a pan-African force ( armed and founded by the West ) to_ protect |
0160A03 | states threatened by "Soviet-Cuban" ventures. ^ President d*3Estaing |
0170A03 | of France, after his French Legionnaires repelled the Katangans and |
0180A03 | rescued the surviving whites in Kolwezi was hailed as "the \Gendarme |
0181A03 | of |
0190A03 | Africa." The \0US later supplied transport planes to_ ferry the units |
0200A03 | formed from Morocco, Senghor and some other former French colonies |
0210A03 | to the Shaba Province. ^ Meetings were held in Brussels at which the western |
0220A03 | countries considered how to _ strengthen the economy and the security |
0230A03 | forces of President Mobuto. It appeared that the detente was to_ give |
0240A03 | place to an east-west confrontation in Africa; that the Western hawks |
0250A03 | were prevailing over the doves among them the British Prime Minister |
0260A03 | Callaghan and some of his OEEC colleagues notably Holland and |
0270A03 | Denmark. $*<&*3Grave Concern*>$^ The developing situation today projects |
0280A03 | a completely different picture and is generating grave concern to |
N.B. £ symbol appears as \ ( back slash ) in this printout.
FOOTNOTES
1Kashru`s samples are drawn almost entirely from "creative writings" although, he has his own reason for doing so,
while Nihalani et al is based on "available" samples.
2The idea of compiling a parallel corpus of Indian English suggested itself, when the present author was doing a
comparative study of some lexical items in American, British and Indian English in Lancaster in 1977. He used
the Brown and the LOB Corpora for the American and British English meanings, but had to do with available
samples for the Indian English meanings (Shastri,1978)
3 Personal Communication.
4 We are thankful to the Director, National Library, Calcutta and particularly to Dr. M. N. Nagraja, Dy. Liberian, Miss Anima Das of the Processing Section , M/s V. Kotnala and A.B.Roy of the Reprography Division for their assistance in carrying out this job.
5 We are thankful to Miss Chitra Mallik for assistance in carrying out this job.
6We are thankful to the Librarian and particularly to Mr. Vaitee for agreeing to hand over these issues to us at a cost.