KOLHAPUR CORPUS

MANUAL OF INFORMATION
TO ACCOMPANY
THE KOLHAPUR CORPUS OF
INDIAN ENGLISH, FOR USE WITH
DIGITAL COMPUTERS

BY
S. V. SHASTRI
IN COLLABORATION WITH
C. T. PATILKULKARNI
GEETA S. SHASTRI

DEPARTMENT OF ENGLISH
SHIVAJI UNIVERSITY, KOLHAPUR
416004 INDIA

1986

TO
PROFESSOR S. K. VERMA

PREFACE

The present corpus of Indian English was conceived in Lancaster in 1978 when the main author of this work was researching under the supervision of Professor G.N. Leech. On his return to India he started the project with an initial grant from the Shivaji University in 1980, and carried it forward with a substantial financial assistance from the U.G.C. supplemented by support from various other sources including personal funds.
We gratefully acknowledge the financial assistance of Shivaji University and the University Grants Commission. We would like to acknowledge the support given by Chhatrapati Shahu Institute’s Vasantdada Professional Computer School in the form of access to their computer system at nominal charges, thanks to the kindness of their EDP manager, S.M.Kori. We also would like to thank the large number of people who have worked on the project at various stages. Among them: V.V. Badve and P.R. Kher who were associated with the project in the initial stages of collection of samples. J.A. Shinde and D.N. Kulkarni who assisted in proofreading of some of the texts.
Geeta S. Shastri for her secretarial assistance and together with C.T. Patilkulkarni for sharing the bulk of pre-editing and proofreading of texts.
Sumati Salunkhe, S.T. Shingate, B.N. Patil, Raju Chougule. A.Y. Shinde and Ganesh Surve who helped with secretarial and proofreading assistance at various stages.
Special thanks are due to Professor S.K. Desai, Head, Department of English, Shivaji University, for his encouragement and guidance also in his capacity as member of the Advisory Committee. We thank the other members of the Committee Professor Birje-Patil of M.S. University of Baroda, Professor C.J. Daswani of the University of Poona and Professor K. Subramanian of Central Institute of English and Foreign Languages for their unfailing support.
We are also grateful to Professor G.N. Leech, University of Lancaster, Professor W. Nelson Francis and Professor Henry Kucera, both of Brown University and Professor Stig Johansson, University of Oslo for their guidance from time to time throughout the duration of the project, and to Prof. Ramesh Mohan the former Director of the CIEFL and Professor R.N. Ghosh also of the CIEFL, for their continued support and help.
And in conclusior, we would like to thank all the copyright holders, who allowed their texts to be included, free of charge, in the corpus.

Kolhapur December, 1986. S. V. Shastri

CONTENTS

Introduction

Sources and Sampling Techniques

Distribution of the material

The American, British and Indian Corpora compared

Coding key

The material and its organization

Basic Technical Information

Note on copyright

References

List of Text Extracts

Introduction

A systematic and comprehensive description of Indian English is now overdue. Of the major national varieties of English, only the American and the British English have so far been described in some detail though several other varieties have already been indentified among the native speaker varieties. Side by side, some non-native varieties of English have also been tacitly recognized among which Indian English is a major one.

Studies of Indian English so far have been confined mainly to aspects of spoken English such as by Bansal of CIEFL, Bansal (1969), and methodological considerations such as by Kachru of Illinois University Kachru (1961, 1979), although Kachru himself has written on many isolated areas of Indian English, Kachru (1965, 1975, 1981)¹. Descriptions of aspects of Indian English wherever they have appeared have been based on selected or ‘available’ samples such as Desai (1974) and Kachru (1965, 1975), the largest one so far being Nihalani et al (1979). There is no gainsaying that a comprehensive description will have to be based on a standard corpus.

The present corpus of Indian Written English ²is comparable to the Brown and the LOB corpora. It is intended to serve as source material for comparative studies of American, British and Indian English which in its turn is excepted to lead to comprehensive description of Indian English..

Although the Indian Corpus is planned to be comparable to the Brown and the LOB corpora there are some important differences dictated mainly by logistic and practical considerations. Firstly, as far as synchronicity is concerned there is a major departure in that, while the Brown and LOB corpora draw their samples from the materials published in the calendar year 1961 the Indian corpus is drawn from materials published in the year 1978. this decision was made after consultation with authors of the earlier corpora to make sure that the comparability will not suffer much as result of this.³ On the other hand it is felt that the value of the Indian corpus is immensely enhanced in general and in particular as a source for the description of Indian English as the Independence as the Indianness of Indian English is a post-Independence phenomen and may have reached a descernible stage in the thirty years after Independence. It is argued in theory that in the same thirty years the American and British English may not have undergone such changes. The number of texts, the weightage given to different genres of material and sampling procedures are kept very close to the other two corpora. However, in one part, that is of Imaginative Prose there are differences in respect of the kinds of fiction and the proportion of texts representing books and those representing periodicals. It is not surprising to find that the amount and kind of imaginative writings in a second language situation such as in India is very different from that in a first language one such as the American or the British situation. Inspite of including samples from all the available full length novels, the proportion could not come anywhere near those of the LOB or the Brown Corpus. Some of us felt that the weightage would be reduced in order to reflect at least in part the real situation. But it was argued that this might adversely affect comparability and the value of the corpus for its projected purpose.

Sources and Sampling Techniques

The Indian Corpus is intended to be a representative corpus of sample texts printed and published in 1978. The texts were largely selected by stratified random sampling process. The composition of texts in the Indian corpus as compared to those in the other two corpora is given in Table No. 1.

Table 1: The Basic Composition of American, British and Indian Corpora.

Text Categories		No. of texts in each category
		American Corpus	British Corpus	Indian Corpus
A	Press: reportage	44	44	44
B	Press: editorial	27	27	27
C	Press: reviews	17	17	17
D	Religion	17	17	17
E	Skills, Trades and Hobbies	36	38	38
F	Popular lore	48	44	44
G	Belles Lettres	75	77	70
H	Miscellaneons (Govt. Documents, foundation reports, industry reports, College catalogue, industry house organ).	30	30	37
J	Learned and scientific writings	80	80	80
K	General fiction	29	29	58
L	Mystery and detective fiction	24	24	24
M	Science fiction	6	6	2
N	Adventure (Western fiction)	29	29	15
P	Romance and love story	29	29	18
R	Humour	9	9	9
Total		500	500	500

The sampling procedure followed is described below:

Books: While the compilers of the Brown and the LOB corpora had at their disposal ready bibliographies from which they could sample, we were handicapped in this regard. The Indian National Bibliography - INB monthly - lists all the publications received in the National Library under the provisions of the Delivery of Books, (Public Libraries) Act of 1954 as amended by Act. No. 99 of 1956. But these issues take a long time in appearing and in fact those for 1978 had not appeared even by the end of 1979. So it was decided to compile a bibliography of our own of books printed in English in 1978 which had already been received in the National Library Calcutta upto December 1979 when the work actually began.⁴ A list of all such books was compiled from Inward Register of National Library and this was used as the source bibliography for the purpose of sampling. Again we could not have recourse to stratification by reference to Dewy Decimal classification as many of the books had not been processed by the library. So we stratified the publication by manual inspection of titles initially. It must be mentioned that such a procedure was possible because of the limited number of publications in the second language situation as compared to that in a native language situation such as the American and the British.
It was found by inspection of earlier entries that about 75 % of the publications of a particular year was received by the end of the following year and the remaining kept trickling in for 3 or more years later. In order to make up for whatever deficiency might have been caused by our listing at the end of 1979 we repeated the exercise once again at the end of 1980 and 1984 and found that only about 10% more publications had arrived. It may be mentioned in passing that our 140 texts from books covering all the categories were sampled from over 1200 titles which amounts to about 8%. The sampling was done from the lists separately with the help of a random number table. Needless to say, whenever a selected book was not accessible the next available on the shelf was selected. Whenever sufficient number of texts did not turn up in this process, other texts were deliberately chosen to fill the category.

Government Documents: As in the case of books no catalogue of Government Publications in 1978 was available upto the end of 1980 and therefore the same procedure of listing from the Inward ⁵ Register and random sampling was followed. In this case the job was simplified because most of the publications of the Govt. Of India are in English. Random sampling and filling out of the texts required for various sub-categories was carried out on the lines done in the case of books.
It will be noticed that we have 37 texts in category H as compared to 30 of the American and British. The decision to alter the number was influenced by the fact that there are in India two types of Govt. Documents, the Union and the State ones. We have included 26 texts from the Union Government Documents and 7 texts from State Government Documents. It may be mentioned also that the bulk of book publication of proportional or even greater weightage could not be seriously considered in view of our commitment to the original design.

Press Materials: The sampling of news-papers and particular issues of newspapers was first carried out. The 53 English newspapers received in the Central Library ⁶, Bombay was considered to be the universe out of which all the 6 national papers were retained and 15 regional papers were selected to represent the different regions of India. From these again, from each newspaper 16 daily issues and 4 Sunday editions were sampled with the help of a random table. Of the required issues several were not in the files of the library. So as usual the next number was taken. From these the actual texts required were identified and the categories filled out.

Periodicals: Initially an attempt was made to compile sample lists of periodicals from the Source Book Press in India 1977 and follow the same procedure for sampling texts as we had done in the case of Books and Press Materials. However, it was discovered that such a procedure tended not only to exclude most or all of the most ‘popular’ and well circulated periodicals but threw up those that were either ‘unheard of’ or ‘unavailable’ or had ceased to be published. In the circumstances we took recourse to the other procedure i.e. of treating the holdings of Central Library in Bombay and the National Library in Calcutta as the universe from which to draw the samples. This proved to be wise as far as ‘popular periodicals’ was concerned; but in the case of learned Journals we had to follow a different procedure.
On a impressionistic basis, it was decided to pick on the richest known libraries: The holdings of the Tata Institute of Fundamental Research, The Indian Institute of Technology, Bombay were used for sampling materials pertaining to Sciences and Technology, the holdings of the Tata Institute of Social Sciences for materials pertaining to social sciences in addition to the holdings of University libraries- Bombay, Poona, Shivaji and Baroda. The actual procedure followed was to compile category-wise select bibliographies and then sample the texts required to fill out the categories.
From the foregoing description of selection procedures followed in building the corpus it is clear that the corpus cannot be claimed to be a "stratified random sample" of Indian English in the strict statistical sense. We, like the builders of the LOB corpus were guided by our aims - of ensuring maximum comparability with the other two corpora and creating a truly representative sample of edited and published Indian English writings.

Distribution of the material

The distribution of texts over different categories and the matching of individual texts have been kept more close to the LOB than to the Brown corpus. The widest difference is to be found in the weighting given to categories in the section, Imaginative Prose. This was inevitable as the available texts in the categories L to P were short of even the number required ! And the weighting given to short stories as against full length novels also is the result of the same handicap.
However, in the case of all the other categories the differences are very marginal as can be seen from the following break up:

Category A (Press: reportage)

A01-06	National	daily	Political
A07-08	"	"	Sports
A09-11	"	"	Society
A12-15	"	"	Spot news
A16-17	"	"	Financial
A18-19	National	weekly	Political
A20-21	"	Sunday	Sports
A22	"	"	Spot news
A23	"	"	Financial
A24-26	"	"	Social/cultural
A27-31	Regional	daily	Political
A32-33	"	"	Sports
A34-37	"	"	Spot news
A38	"	"	Financial
A39-40	"	"	Cultural
A41	"	weekly	Sports
A42	"	"	Society
A43	"	"	Spot news
A44	"	"	Cultural

Category B (Press: editorial)

B01-06	National	daily	Institutional editorial
B07-08	"	"	Personal editorial
B09-11	"	"	Letters to the editor
B12-14	"	Sunday	Institutional editorial
B15	"	"	Personal editorial
B16	"	"	Letters to the editor
B17-22	Regional	daily	Institutional editorial
B23-24	"	"	Letters to the editor
B25-26	"	Sunday	Institutional editorial
B27	"	"	Letters to the editor

Category C (Press: reviews)

C01-06	National	daily	Book, music, cinema, painting, folk art etc.
C07-13	"	Sunday
C14	"	weekly
C15-16	Regional	daily
C17	"	Sunday

The names of newspapers and the datails of texts drawn from each is shown in Table No.2.

Category D (Religion)

D01-08	Books
D09-17	Periodical and Journals

Category E (Skills, trades and hobbies)

E01-05	Homecraft, handiman
E06-10	Hobbies
E11-13	Music, dance
E14	Pets
E15-18	Sports
E19-20	Food
E21-22	Travel
E23-26	Miscellaneous
E27-35	Trade, professional journals
E36-38	Agriculture, farming

Category F (Popular lore)

F01-22	Popular	Politics, psychology, sociology
F23-30	Popular	History
F31-33	Popular	Health, medicine
F34-37	"Culture"
F38-44	Miscellaneous

Category G (Belles lettres, biography, essays)

G01-35	Biography, memoirs
G36-41	Literary essays and criticism
G42-50	Arts
G51-70	General essays

Category H (Miscellaneous)

H01-26	Central Government document
	A H01-12	Reports, department publications
	B H13-14	Acts
	C H15-20	Proceedings, debates
	D H21-26	Other Government documents
H27-32	State Government documents
H33-37	Industry reports, house organ, University catalogue.

Category J (Learned and scientific writings)

J01-12	Natural and physical sciences
J13-17	Medicine
J18-21	Mathematics
J22-35	Social, behavioural sciences
	A J22-25	Psychology
	B J26-30	Sociology
	C J31	Demography
	D J32-35	Linguistics
J36-50	Political Science, law, education, commerce
	A J36-39	Education
	B J40-47	Politics, economics, commerce
	C J48-50	Law
J51-68	Humanities
	A J51-55	Philosophy
	B J56-59	History
	C J60-63	Literary criticism
	D J64-66	Art
	E J67-68	Music
J69-80	Engineering and technology

Category K (General fiction)

K01-12	Novels
K13-58	Short stories

Category L (Mystery and detective fiction)

L01-03	Novels
L04-06	Short stories
L07-11	Novels
L12-23	Short stories
L24	Novel

Category M (Science fiction)

M01-02

Short stories

Category N (Adventure)

N01	Short story
N02-04	Novels
N05-13	Short stories
N14	Novel
N15	Short story

Category P (Romance and love story)

P01-02	Short stories
P03	Novel
P04-07	Short stories
P08	Novel
P09-15	Short stories
P16	Novel
P17-18	Short stories

Category R (Humour)

R01-05	Short stories
R06	Book
R07-09	Articles from periodicals

The American (Brown), British (LOB) and Indian Corpora compared

Categories A-C: In terms of weighting between national and regional newspapers the Indian corpus conforms more closely to the British. This has been deliberate more so because, like the British situation, Indian newspapers have a clear-cut distinction between the national and regional on the basis of both distribution and circulation figures. The proportion of texts drawn from the National and the Regional papers is 62% to 38% in the Indian corpus as compared to the British 60% to 40% (see table 2). As to the sub-categories of texts and their distribution over dailies, Sundays and weeklies, there is even more correspondence between the two (see table 3), except that the two sub-categories society and cultural had to be collapsed into one as no such hard and fast distinction could be observed in the newspaper reportage of the Indian Press. The other very marginal difference is that there are fewer personal editorials in the Indian corpus.

Category D: In terms of subcategories Religious prose is not classified either in Brown or in LOB corpus; but while sampling texts the builders of LOB corpus have by inspecting the Brown texts arrived at a decision to include ‘stylistically heterogenous texts ranging from learned to popular committed writing’. The same procedure was followed in selecting texts for the Indian corpus except that the sub-division ‘tracts’ is unrepresented. The distribution of texts from books and periodicals is nearly maintained (see table 4).

Categories E-J: Sub-categories of texts in the Indian corpus have been matched almost perfectly with the LOB corpus. However, it was not always possible to match the individual texts in terms of the type of source, book or periodical for materials in the three categories E, F and G. While books are over-represented in the case of category E, they are somewhat under-represented in the case of categories F and G (see table 4). As already stated earlier we deliberately altered the weighting between category G and H. G has been reduced by seven texts and H increased by the same number. This was done to reflect the Indian situation in which, firstly, Government documents divide themselves into Central and State govt. documents and the bulk of those far exceeds any other printed material in English except Press materials. This has been reflected in the greater representation of government documents in the Indian corpus. And foundation report is unrepresented (see table 4). J texts have been matched very closely in all the three corpora. This is the only category which has one to one correspondence of weighting to sub-categories (see table 4).

Categories K-R: This section of the corpus that is Imaginative prose is maximally mismatched. As already stated sufficient number of texts were simply not available and the possible consequences of this was discussed with the experts in the field and it was felt that comparability would suffer only marginally. We repeat some of the points here. The sub-categories K, L, M, N and P all representing fiction is a sort of cline from general fiction (K), mystery and detective (L), science fiction (M) adventure and Western fiction (N) and romance and love story (P). The classification is based on the theme/treatment and very often is bound to be overlapping; and the interest of the corpus compile is ‘style’. In the process of sampling, it is quite possible for the selected portion of the work as text to run wide of mark, especially in the case of novels. In view of all this, firstly, the sub-categories were defined negatively, that is, whatever was not broadly speaking L, M, N or P was considered to be K; and the selected texts especially from novels were inspected and placed in the categories so defined. In the case of short stories the question did not arise. The fact remains that the number of texts are matched only in the categories L and R. In the case of K we have double the number i.e. 58 in place of 29; science fiction only 2 as against 6;adventure only 15 in place of 29; and romance and love story only on 18 in place of 29. Again, mystery and detective is for the West largely detective and mystery surrounding death, murder etc., but for the Indian it includes other kinds of mystery in the sense of ‘mysterious’ or miraculous. Similarly in the case of adventure and Western fiction, there is nothing at all corresponding to ‘western fiction’ in India. So the sub-category is wholly comprised of ‘adventure’.

Now it must be stated that if the Imaginative Prose section in the Indian corpus is not on the face of it quite comparable to that of Brown or LOB, it is designed to be truly representative of Indian English.

Table No. 2 - Details of number of texts drawn from different newspapers.

No.	Name of the newspaper (National newspapers)	Number of texts drawn
		daily	Sunday/ weekly	Total
1	The Hindu, Madras	9	2	11
2	Economic Times, New Delhi	1	2	3
3	The Statesman, Calcutta	5	1	6
4	The Hindustan Times, New Delhi	4	4	8
5	The Times of India, Bombay	11	8	19
6	The Indian Express (various editions)	3	3	6
Sub-totals		33	20	53

(Regional newspaper)
1	Business Standard, Calcutta	3	-	3
2	Deccan Herald Bangalore	2	1	3
3	The Tribune, Chandigarh	3	1	4
4	National Herald, Lucknew	1	1.5	2.5
5	Searchlight, Patna	1	1	2
6	The Assam Tribune, Gauhati	2	-	2
7	Amrit Bazar Patrika, Calcutta	1	-	1
8	Deccan Chronicle, Secunderabad	3	-	3
9	The Western Times, Ahmedabad	1	-	1
10	Madhya Pradesh Chronicle, Bhopal	2	-	2
11	Nagpur Times, Nagpur	2	0.5	2.5
12	Navhind Times, Panaji	1	0.5	1.5
13	Northern India Patrika, Allahabad	-	1.5	1.5
14	Poona Herald, Poona	3	-	3
15	Blitz Weekly, Bombay	3	-	3
	Sub-totals	28	7	35
	Grand Totals	61	27	88

Table 3. Categories A-C: The American, British and Indian corpora compared

	American corpus			British corpus								Indian corpus
				National			Provencial					National			Regional
A. Press: Reportage	daily	weekly		daily	Sunday		daily		weekly 1)			daily		Sunday	daily		Sunday
Political	10	4	14	6	2		5		-		13	6		2	5		-	13
Sports	5	2	7	2	2		2		1		7	2		2	2		1	7
Society	3	-	3	2	-		-		1		3	3		3	2		2	102
Spot news	7	2	9	4	1		4		1		10	4		1	4		1	10
Financial	3	1	4	2	1		1		-		4	2		1	1		-	4
Cultural	5	2	7	3	1		2		1		7							-
Total			44	Total							44	Total						44

B. Press: Editorial
Institutional	7	3	10	4		2		3		1	10	6	3		6	1		16
Personal	7	3	10	4		2		3		1	10	2	1		-	1		4
Letters to the editor	5	2	7	3		1		2		1	7	3	1		2	1		7
Total			27	Total							27	Total						27

C. Press: reviews	14	3	17	6		5 3)		2		1	17	5	9		2	1		17
Total			17	Total							17	Total						17

1) Including Provincial Sunday

2) Including "cultural"

3) The Times Literary Supplement and The Times Educational Supplement

Table 4- Categories D-J: The American, British and Indian Corpora compared

		American corpus	British corpus	Indian corpus
D. Religion
	Books	7	9	8
	Periodicals	6	7	9
	Tracts	4	1	-
E. Skills, Trades and Hobbies
	Books	2	5	9
	Periodicals	34	33	29
F. Popular Lore
	Books	23	16	10
	Periodicals	25	28	34
G. Belles Lettres etc.
	Books	38	41	29
	Periodicals	37	36	41
H. Miscellaneous
	Govt. Documents	24	24	32
	Foundation Reports	2	2	-
	Industry Reports	2	2	2
	Univ. catalogue	1	1	1
	Ind. House Organ	1	1	2
J. Learned
	Natural Sciences	12	12	12
	Medicine	5	5	5
	Mathematics	4	4	4
	Soc. Sciences	14	14	14
	Pol. Science, Law, Education	15	15	15
	Humanities	18	18	18
	Technology and Engineering	12	12	12

Table 5 - Categories K-R: The American, British and Indian corpora compared

		American corpus	British corpus	Indian corpus
K. General Fiction
	Novels	20	20	12
	Short stories	9	9	46
L. Mystery and Detective Fiction
	Novels	20	21	9
	Short stories	4	3	16
M. Science Fiction
	Novels	3	3	-
	Short stories	3	3	2
N. Adventure and Western
	Novels	15	15	4
	Short stories	14	14	11
P. Romance and Love Story
	Novels	14	16	3
	Short stories	15	13	15
R. Humour
	Novels	3	3	-
	Short stories	-	-	5
	Essays, etc.	6	6	3
	Books	-	-	1

Coding Key:

Alphanumeric characters represent themselves. In the case of alphabet symbols, the letter represented is lower case unless otherwise specified:

A = a	B = b	C = c	Etc.
1 = 1	2 = 2	3 = 3	Etc.

Letter preceded by * = the same letter (word initial Capital)

Letter precede by = the same letter (sentence initial Capital)

NB - both and * are used when a sentence initial capital coincides with word initial capital.

*A = A Word initial only

B = B Sentence initial only

*J = J Word and sentence initial at the same time e.g. as in: John was blind.

Other Characters:

* is reserved as a prefix for a compound coding symbol. When not preceded by *, all other characters represent themselves except for £, $, , ¬

that is to say:

1	. = .	Full stop
2	: = :	Colon
3	; = ;	Semi-colon
4	, = ,	Comma
5	" = "	Double quotes: begin and end quotes are distinguished by a space before and after respectively
6	‘ = ‘	Single quotes (but not apostrophe) begin and end quotes are distinguished by a space before and after respectively
7	? = ?	Question-mark
8	! = !	Mark of exclamation
9	- = -	Minus when separated by spaces on either side, hyphen when not so separated.
10	-- = --	Dash
11	% = %	Per cent
12	& = &	(and)
13	( = (	Left brace
14	) = )	Right brace
15	+ = +	Plus
16	/ = /	Slash, oblique
17	[ = [	Left bracket
18	] = ]	Right bracket
19	@ = @	At ( the rate of)
20	= = =	Equals
21	Space	= space
22	> = >
23	< = <
24	x = x	into (represents ‘multiplied by’ when separated by spaces)
25	¬ =	grammatically marked (always follows the word)
26	=	sentence initial capital
27	* =	Word initial capital. (N.B. both * occur when word initial capital coincides with sentence initial capital)
28	£ =	begin non English word
29	$ =	new paragraph or new line

Compound Coding

*’		=		apostrophe
**<		=		begin major heading
**>		=		end major heading
*<		=		begin minor heading
*>		=		end minor heading
*@		=		IBM ASCII 248 (degree symbol)
*+		=		£ (pound)
*-		=		$ (dollar)
*/		=		* (asterisk)
*#		=		end of corpus text
**#		=		end of corpus
*?		=		uncoded character (see below)
**[		=		begin comment tag
**]		=		end comment tag
*=		=		upper case Roman numeral
**=		=		lower case Roman numeral
*;		=		begin subscript
**;		=		end subscript
*:		=		begin superscript
**:	=		end superscript
*(	=		begin hybrid word/expression
*)	=		end hybrid expression
\0	=		abbreviations. Sequence of abbreviations or initials are enclosed in (0 )
*¬	=		Included sentence
*%	=		%o (per thousand)

Type-shift

*	=	end typeshift
*1	=	begin typeshift for citation
*2	=	begin capitalization
*3	=	begin typeshift for highlighting emphasis (including italics)
*4	=	begin Indian Language word
*5	=	begin Indian Language expression or passage
*6	=	end Indian Language expression or passage
*7	=	begin foreign word
*8	=	begin foreign expression
*9	=	end foreign expression
*?0	=	. (dot) under the preceding character (to indicate retroflex)
*?00	=	. (dot) on the preceding character (to indicate retroflex)
*?1	=	-(macron) on the preceding character (to indicate vowel length)
*?2	=	´ (acute accent) on the preceding character (to indicate vowel length)
*?3	=	‘ (grave accent) on the preceding character (to indicate vowel length)
*?4	=	~ (tilde) on the preceding character (to indicate vowel length)

(NB: The code *[ applies to languages that use the Roman alphabet. The following apply to all foreign languages irrespective of what script they use.)

Foreign language materiel in non-Roman alphabet transcribed in Roman script

*[1	=	begin Assamese material
*[2	=	begin Bengali material
*[3	=	begin Gujarati material
*[4	=	begin Hindi material
*[5	=	begin Kannada material
*[6	=	begin Kashmiri material
*[7	=	begin Malayalam material
*[8	=	begin Marathi material
*[9	=	begin Oriya material
*[10	=	begin Punjabi material
*[11	=	begin Sanskrit material
*[12	=	begin Sindhi material
*[13	=	begin Tamil material
*[14	=	begin Telugu material
*[15	=	begin Urdu material

Interpretive Codes:

I Apostrophe *’

*’	=	apostrophe for possessive e.g. John’s book = JOHN’s BOOK. Students’ Union = STUDENTS*’ UNION
*'1	=	Contracted from of is e.g. John's coming = JOHN' IS Coming
*'2	=	Contracted form of has e.g. John's been ill = JOHN' 2S BEEN ILL
*'	=	apostrophe for contracted form of had e.g. He'd done well = HE*'D DONE WELL
*'1	=	Contracted form of would e.g. He'd stand for hours = HE*' 1D STAND FOR HOURS.
*'3	=	Other contractions e.g. d'Estaing = D'3ESTAING 'Tis a pity = *'3TIS A PITY
*'4	=	Contraction of not e.g. don't = DON*'4T
*'5	=	abbreviation for minutes (degree & mins) e.g. 4' 30' = 4@ 30 '5
*'6	=	abbreviation for foot/feet e.g. 3' = 3*'6
*'7	=	notation for glottal fricative

II Left arrow (¬ )

( 1) To as preposition is unmarked

To as infinite marker is marked with a left arrow e.g. to go = To ¬ GO

(2). That as a subordinating conjunction in unmarked; that as any other is marked e.g. that day = THAT DAY;

The man that you spoke of = THE MAN THAT ¬ YOU SPOKE OF

Greater than that of the other = GREATER THAN THAT ¬ OF THE OTHER

III Double Quotes ( " )

*" = inches e.g. 6’ 3" = 6*’66nbsp; 3*"

Greek letters are marked by a preceding **Y FOR LOWER for lower case and **Z for upper case characters and represented by Roman alphabets as follows :

Other notations

Mathematical symbols

**MN = mathematical notation e.g. S_{1 ,}T1

**MS = mathematical e.g. = , ± , ¹ , ®

**MF = mathematical conventional figure 10⁶, 8^-5

**ME = mathematical equation

The material and its organization.

1. The text of a sample starts with the first sentence or the first section on the first page of the sampled text in the case of books and at the beginning of the article in the case of samples drawn from periodicals, journals and newspapers etc. and ends with the sentence containing the2.000^thword.

2. Each corpus text is headed by a line in which the text number is indicated enclosed in comment tags (e.g. **[TXT.A01**] ) and is followed by a line again enclosed in comment tags indicating the number of words in that text (e.g. **[ No. of words = 02008**] ).

3. Headings are coded and included in the texts. The title of the book is often included in the texts ; but there are some inconsistencies in this regards as pre – editing was handled by various persons.

4. Sentences used as " tantalisers" etc. , are also included with blanket comment tags** [BEGIN LEADER COMMENT **] and ** [END LEADER COMMENT **]. In this also there are some inconsistencies.

5. Extra textual material such as maps , charts , diagrams, tables etc. , are excluded and represented by descriptive tags.

6. As a rule footnotes are excluded and are represented by a descriptive comment tag **[FOOT NOTE **]. But in the case of texts in which the footnotes were even longer than the body of the text , they have been included.

7. Long foreign quotations and poetry quotations are excluded and represented by descriptive comment tags.

8. Mathematical equations, long formulas etc. are excluded and represented by mathematical symbol codes (see coding key).

9. The text categories are included in the order listed in the tables above.

10. The texts are preserved in card image form (80- character). The first 72 characters of each line contain the text of the sample and the last 7 characters indicate the unique location number. The 73rd character is always a blank.

11. One or more blank spaces ( sometimes a whole blank line ) separates two words. A word is orthographically defined as a character or sequence of characters surrounded by blank spaces. Like in the case of the 1st version of the Brown Corpus , words are often broken at the end of a line( i.e. the 72nd character ). If a word is thus broken it is continued starting with the first column of the next line. If a word ends at the 72nd column , the first column of the next line is left blank.

A sample portion of a text as it appears in the print – out is reproduced in the following page.

Sample of print – out

[TXT. J52]
*<3NATURE, MAN ; AND GOD IN THE 4VEDASO*>$<31. *2 THE PROBLEM	0010J52
OF CAUSATIONO> $+3*2^ MAN IS+0 MOST CONCERNED WITH HIS ENVIRONMENT	0020J52
; THE WORLD IN SPACE AND TIME. ^ HENCE, IT IS NATURAL THAT WHEN HE BE	0030J52
COMES REFLECTIVE, HE WANTS TO_ UNDERSTAND THE NATURE OF THIS WORLD. ^ T	0040J52
HE PHYSICAL WORLD SEEMS TO HIM THE PART AND PARCEL OF HIS LIFE. ^ WHE	0050J52
N HE TRIES TO_ UNDERSTAND THE NATURE OF THE PHYSICAL WORLD, THE QUE	0060J52
STIONS THAT COME UP ARE _ _WHO HAS CREATED THIS WORLD; WHAT ARE THE	0070J52
CONSTITUENT ELEMENTS OUT OF WHICH IT IS CREATED AND HOW IT IS CREAT	0080J62
ED? ^ IN OTHER WORDS, WE WANT TO_ KNOW ITS EFFICIENT CAUSE, THE MATE	0090J52
RIAL CAUSE, AND THE PROCESS OF CREATION. $^ THUS THE PROBLEM OF CAUSA	0100J52
TION IS THE PRIMARY QUESTION IN THE UNDERSTANDING OF THE PHYSICAL WO	0110J52
RLD_ _ OR WHAT WE CALL NATURE. ^ THE 4*VEDAS, AS IS KNOWN, ARE MORE	0120J52
POETIC IN THEIR CONTENT THAN LOGICAL. ^ STILL ONE CAN TRACE CERTAIN I	0130J52
MPORTANT IDEAS REGARDING CAUSATION BEHIND THE POETIC IMAGINATIONS. $	0140J52
^{^} THE PRINCIPLE OF CAUTION IN THE 4VEDAS, THE EARLIEST LITERATURE	0150J52
OF THE HINDUS, SEEMS TO_ APPEAR IN THE CONCEPT OF 4RTA. 4^ *RTA REP	0160J52
RESENTS THE LAW, UNITY OR RIGHTNESS, UNDERLYING THE ORDERLINESS WE O	0170J52
BSERVE IN THE WORLD.4^ RTA , LITERALLY MEANS THE ´COURSE OF THINGS´	0180J52
^ THIS CONCEPTION SEEMS TO_ HAVE BEEN ORIGINALLY DERIVED FROM THE	0190J52
REGULARITY OF THE MOVEMENTS OF THE HEAVENLY BODIES LIKE THE SUN, THE	0200J52
E MOON, AND THE STARS, THE ALTERNATIONS OF DAY AND NIGHT AND OF THE	0210J52
SEASONS. $^ IN THE 4VEDAS, THERE ARE NO HYMNS ADDRESSED SPECIFICAL	0220J52
LY TO 4RTA, BUT BRIEF REFERENCES TO THE IMPORTANT CONCEPTS ARE FOU	0230J52
ND REPEATEDLY IN THE HYMNS TO 4VARUNA (WHO MANTAINS THE PHYSICAL O	0240J52
RDER), 4AGNI, 4VISVEDEVAS \OETC. ^ THE FOLLOWING HYMN WILL ILLUST	0250J52
RATE THE POINT: [VERSE] $^ GRADUALLY THE CONCEPT OF ± 4± RTA TAKES	0260J52
A NEW MEANING FROM EXTERNAL PHYSICAL ORDER OR UNIFORMITY OF NATURE_	0270J52
_ IT ACQUIRES THE SIGNIFICANCE OF A MORAL ORDER. ^ THE WHOLE WORLD WA	0280J52

Basic Technical Information.

The corpus is available at cost to bonafide researchers in India from the department of English, Shivaji University, Kolahapur, and is being made available to bonafide researchers outside India through the International Computer Archive of Modern English (ICAME), at the Norwegian Computing Centre for the Humanities, Bergen, Norway. The material is available on magnetic tape in the following format:

1. The tape has no label.

2. There are 24 files on the tape containing the entire material as shown below:

Sr.No of	Texts	Sr.No of	Texts
the file	Contained	the file	Contained
1 …	A01 – A22	13 …	H01 - H20
2 …	A23 – A44	14 …	H21 - H37
3 …	B01 – B27	15 …	J01 – J27
4 …	C01 – C17	16 …	J28 – J54
5 …	D01 – D17	17 …	J55 – J80
6 …	E01 – E20	18 …	K01 – K29
7 …	E21 – E38	19 …	K30 – K58
8 …	F01 – F22	20 …	L01 – L24
9 …	F23 – F44	21 …	M01 – M02
10 …	G01 – G25	22 …	N01 – N15
11 …	G26 - G50	23 …	P01 – P18
12 …	G51 – G70	24 …	R01 – R09

3. Each text is separated by a blank record in the file.

4. The text is divided into 80 – character lines, as follows :

a) Text : 72 characters

b) Location number : 8 characters

1 space ( character No. 73 )

4 characters line number e. g. 0010

3 characters sample code e. g. A01

5. Each tape record ( block ) contains 10 lines, except that the last block in each file may contain less than 10.

6. The character code used is ASCII and the material is recorded in 9-track, 1600 fpi density.

Note on copyright.

Permission to use materials under copyright was sought through a form letter sent under certificate of posting. We are glad to say that most copyrightholders responded promptly. Remainders were sent to those who did not respond for over three months in which option was given to them not to answer the letter if they had no objections to our using the materials.

Individual acknowledgements are made in the notes on text extracts. In the case of those who have not responded so far no such acknowledgement appears.

References

Bansal, R. K. 1969. The intelligibility of Indian English Monograph No. 4, CIEFL, Hyderabad.

Desai, S. K. 1974. Experimentation with language in Indian Writing in English (Fiction). Monograph of the Dept. of English, Shivaji University, Kohlapur.

Francis, W. N. and Henry Kucera. 1964. Rev 1979. Manual of information to accompany A standard Corpus of present-day edited American English. Dept. of Linguistics, Brown University, Providence, R.I.

Johasson, Stig, G. N. Leech and Helen Goodluck. 1978. Manual of Information to Accompany the Lancester-Oslo/Bergen corpus of British English. Dept. of English, University of Oslo, Oslo.

Kachru, B. B. 1961. An analysis of some features of Indian English: A study in linguistic Method. Unpublished Ediburgh thesis

----------- 1965. The Indianness in Indian English. Word 2: 391-410.

----------- 1975. Lexical innovations in South-East Asia. International Journal of the

Sociology of Language. Vol. 4. Mouton, The Hague.

---------- 1979. The new Englishes and an old models News letter.

January 1979, CIEFL, Hyderabad.

----------- 1981. The pragmatics of non-native varieties of English. In Smith Larry (ed) English for cross – cultural communication. Macmillan 15-39.

Nihalani, Paroo, R. K. Tongue and Priya Hosali. 1979. Indian and British English: A handbook of Usage and pronunication. O. U. P. New Dehli.

Shastri, S. V. 1978. English word-meanings and their American and Indian variants – a study of six lexical items. Unpublished Lancaster University thesis.

ADDENDA

1. Coding key:

Some of the details of coding key described in this manual are irrelevant for the users of the corpus in the form in which it now is being made available. The original version was in 64 character set i.e. the text was in all capitals; but the present version is in 96 character set i.e. the text is in upper – lower case characters. Hence the asterisk (*) used as a code to indicate word initial capitals everywhere in the text is irrelevant. Similarly the code for all capitals (*2….0) is also irrelevant, as capitals and lower case letters as lower case letters. However, the code for sentence – initial capital (^ )has been retained though it is redundant. The asterisk as a word initial code has also been retained when a word with initial capital occurs at the beginning of a sentence. This has been done to facilitate machine processing of the corpus texts.

2. The material and its organisation:

The record length in the present version of the corpus is 100 characters and not 80 as in the original version. The unique location number of seven charcters has been shifted to the beginning of the line and the text begins at the 9th characters of each line. Only one blank space seperates two wordsexcept that any number of blank spaces may occur at the end of a line.

3. Basic technical information:

The file organization remains unchanged, but each line contains 100 characters and each block contains 80 lines except that the last block in any file may contain less than 80 lines. The rest of the technical details remain unchanged.

The coding for Greek letters stand modified as follows:

*Y to mark lowercase letters and *Z to mark upper case letters of the Greek alphabet.

The coding for mathematical symbols also stand modified as follows:

*Mn = mathematical notation

*Ms = mathematical symbol

*Mf = conventional mathematical figure

*Me = mathematical equation.

	[ text. a03 ]
0010A03	*<3 Creeping Detente In Africa*> $"^ DETENTE, " said \0 Dr Bruno
0020A03	Kreiski, Chancellor of Austria (which alongwith Switzerland and
0030A03	Sweden is one of the three official neutral States in Europe ) in his
0040A03	address to the Royal Institute of International Affairs in London
0050A03	on July 4, "is not the consequence of sublime human insight but simply
0060A03	a result of a state of military balance". ^ This realistic definition
0070A03	of a state of relationship between the Soviet bloc and Western Europe,
0080A03	\0US and Canada, which has been widely criticised in the West
0090A03	as tattered by developments in Africa ( and Afghanistan and South Yemen )
0100A03	explains the about-turn in Western policy on Angola that_ appears
0110A03	to_ be taken place now quietly and even secretively. $^ It was just a
0120A03	month ago, following the incursion of Katangan exiles to the mineral rich
0130A03	Shaba province of Zaire and the massacre of whites in Kolwezi,
0140A03	that Western Europe, backed by the \0US, were planning the establishment
0150A03	of a pan-African force ( armed and founded by the West ) to_ protect
0160A03	states threatened by "Soviet-Cuban" ventures. ^ President d*’3Estaing
0170A03	of France, after his French Legionnaires repelled the Katangans and
0180A03	rescued the surviving whites in Kolwezi was hailed as "the \Gendarme
0181A03	of
0190A03	Africa." The \0US later supplied transport planes to_ ferry the units
0200A03	formed from Morocco, Senghor and some other former French colonies
0210A03	to the Shaba Province. ^ Meetings were held in Brussels at which the western
0220A03	countries considered how to _ strengthen the economy and the security
0230A03	forces of President Mobuto. It appeared that the detente was to_ give
0240A03	place to an east-west confrontation in Africa; that the Western hawks
0250A03	were prevailing over the doves among them the British Prime Minister
0260A03	Callaghan and some of his OEEC colleagues notably Holland and
0270A03	Denmark. $<&3Grave Concern*>$^ The developing situation today projects
0280A03	a completely different picture and is generating grave concern to

N.B. £ symbol appears as \ ( back slash ) in this printout.

FOOTNOTES

¹Kashru`s samples are drawn almost entirely from "creative writings" although, he has his own reason for doing so,

while Nihalani et al is based on "available" samples.

²The idea of compiling a parallel corpus of Indian English suggested itself, when the present author was doing a

comparative study of some lexical items in American, British and Indian English in Lancaster in 1977. He used

the Brown and the LOB Corpora for the American and British English meanings, but had to do with available

samples for the Indian English meanings (Shastri,1978)

³Personal Communication.

⁴We are thankful to the Director, National Library, Calcutta and particularly to Dr. M. N. Nagraja, Dy. Liberian, Miss Anima Das of the Processing Section , M/s V. Kotnala and A.B.Roy of the Reprography Division for their assistance in carrying out this job.

⁵ We are thankful to Miss Chitra Mallik for assistance in carrying out this job.

⁶We are thankful to the Librarian and particularly to Mr. Vaitee for agreeing to hand over these issues to us at a cost.

LIST OF TEXT EXTRACTS

A B C D E F G H J K L M N P R

MANUAL OF INFORMATION TO ACCOMPANY THE KOLHAPUR CORPUS OF INDIAN ENGLISH, FOR USE WITH DIGITAL COMPUTERS

MANUAL OF INFORMATION
TO ACCOMPANY
THE KOLHAPUR CORPUS OF
INDIAN ENGLISH, FOR USE WITH
DIGITAL COMPUTERS