Kristine Hasund & Gisle Andersen
Dept. of English, University of Bergen, Sydnesplass 7, N-5007 Bergen, Norway
KEYWORD: TACTweb search program
AFFILIATION: Dept. of English, University of Bergen
E-MAIL: gisle.andersen@eng.uib.no FAX NUMBER: +47 55 58 94 55 PHONE NUMBER: +47 55 58 31 50
The Bergen Corpus of London Teenage Language (COLT) is the first large
English Corpus focusing on the speech of teenagers. It was collected in 1993 and
consists of the spoken language of 13 to 17-year-old boys and girls from
different boroughs of London.
The aim of the COLT project is to compile a 500.000 word corpus of spoken teenage language, and make it available for students of English at the University of Bergen, as well as for language researchers world-wide.
This poster presents the use of TACTweb on the COLT corpus. TACTweb, which connects the text-retrieval program TACT to the World Wide Web, enables the user to search in a database of spoken conversations for the location of words, word combinations and word formation patterns. In the COLT database, TACTweb is applied to give the distribution of an item in relation to certain non-linguistic variables.
Searches in the corpus are made possible through the indexing of the texts in the database. The COLT database has the following indices:
Four different types of display systems are available for searches in the corpus:
KWIC - Key Words In Context
A KWIC display lists all the
occurrences of a word with one line of context. Here is an example that shows
the occurrences of the word "Peter" see Figure 1.
The number in parentheses in the top line shows the total number of occurrences of "Peter" in the entire corpus. The numbers at the front of each line give the reference number, and then the turn number where the word can be found. The target word appears in the middle of the line. Clicking on the target word shows the full text, which allows a closer study of each occurrence.
The KWIC display allows the user to quickly browse a large number of occurrences to see how a particular word is used, or to search for a word which has many occurrences.
peter (50) B132403 i=12 Without th= without the stupid Peter interrupts | | |w1 B132404 i=6 Our registrations are funny [Peter!] | | |w13 [Yeah] B132404 i=13 | | |w1 No. | | |w14 Don't Peter, it hurts. | | |w12 B132407 i=2 way! Aargh! | | |w4 Come on Peter let's go and get on B132503 i=7 <unclear>. | | |w1 Oh. Is that Peter? Oh! He's got school?
Figure 1
Variable Context Display
Whereas the KWIC display gives only one
line of context, the Variable Context Display allows the user to control the
amount of context in which a word is to be displayed. For example, one can ask
for the word "Peter" to be displayed in a context of 3 lines before
and 3 lines after the occurrence:
... I'm not giving you half! ... See you should be taping everything, that's what I am. Let's just really bore them! ... Peter, I think you're the most wonderful boy in the whole entire world.
---------------------------------------------------------- B132901 i=333 |w1 Here, i= i= this one, they all the same. They gave us all the same thing. It's Peter, er Grace, big Anthony, some other people, oh yeah, Josie's doing it, er who else, what's the I dunno the other girls' ---------------------------------------------------------- B133101 i=27
Distribution
This display allows the user to search for the
occurrence of a word as it is distributed across the variables speaker identity,
age, gender, socio-economic group, location, setting, occupation, and number of
participants. Here is an example of how the word "shit" is distributed
according to age
10 | 2|* 11 | 2|* 12 | 14|******* 13 | 65|********************************* 14 | 75|************************************** 15 | 81|***************************************** 16 | 92|********************************************** 17 | 33|***************** 18 | 9|***** 19 | 2|*
Word List
The Word List display gives a list of all the words that
match a particular pattern. For instance, it is possible to produce a list of
all words ending in a particular letter or sequence of letters. This is
particularily useful for a researcher who is interested in the productivity of
certain morphemes, such as -able:
unavailable (1) unbelievable (4) uncomfortable (2) unfuckingtouchable (2) unreliable (1) unscrewable (1) unsociable (1) untouchable (1) up-gradable (2) vulnerable (4)
The purpose of the poster presentation is to demonstrate these and other facilities, focusing on TACTweb as a useful tool for the linguistic researcher. Moreover, an overview of ongoing research will be given.