| Developing Guidelines and Ensuring Consistency for Chinese Text Annotation | ||
| Xia Fei
(Department of Computer and Information Science, University of
Pennsylvania, Philadelphia, PA 19104, USA, fxia@linc.cis.upenn.edu) Palmer Martha (Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA, mpalmer@linc.cis.upenn.edu) Xue Nianwen (Linguistics Department, University of Delaware, Newark, DE 19716, USA, xueniwen@UDel.Edu) Okurowski Mary Ellen (US Department of Defense, Ft. Meade, MD 20755, USA, meokuro@super.org) Kovarik John (US Department of Defense, Ft. Meade, MD 20755, USA, kovariks@worldnet.att.net) Chiou Fu-Dong (Linguistics Department, University of Pennsylvania, Philadelphia, PA 19104, USA, chioufd@linc.cis.upenn.edu) Huang Shizhe (East Asian Studies Program, Haverford College, Haverford, PA 19041, USA, shuang@haverford.edu) Kroch Tony (Linguistics Department, University of Pennsylvania, Philadelphia, PA 19104, USA, kroch@linc.cis.upenn.edu) Marcus Mitch (Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA, mitch@linc.cis.upenn.edu) |
||
| With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on the corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a 100-thousand-word bracketed corpus since late 1998 and plan to release it to the public summer 2000. In this paper, we will address several challenges in building the corpus, namely, creating annotation guidelines, ensuring annotation accuracy and maintaining a high level of community involvement. | ||
| Keywords: Annotation Guidelines, Bracketed Corpus (Treebank), Chinese Language Processing, Quality Control | ||
| LREC2000 Proceedings: Session WO1 - Corpus Tagging, pages 3-10 | ||
| Files: 287.ps, 287.pdf | ||
|
|
||
| Using Machine Learning Methods to Improve Quality of Tagged Corpora and Learning Models | ||
| Matsumoto Yuji
(Graduate School of Information Science, Nara Institute Science and
Technology, 8916-5 Takayama, Ikoma, Nara 630-0101, Japan, matsu@is.aist-nara.ac.jp) Yamashita Tatsuo (Graduate School of Information Science, Nara Institute Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0101, Japan, tatuo-yg@is.aist-nara.ac.jp) |
||
| Corpus-based learning methods for natural language processing now provide a consistent way to achieve systems with good performance. A number of statistical learning models have been proposed and are used in most of the tasks which used to be handled by rule-based systems. When the learning systems come to such a level as competitive as manually constructed systems, both large scale training corpora and good learning models are of great importance. In this paper, we first discuss that the main hindrances to the improvement of corpus-based learning systems are the inconsistencies or the errors existing in the training corpus and the defectiveness in the learning model. We then show that some machine learning methods are useful for effective identification of the erroneous source in the training corpus. Finally, we discuss how the various types of errors should be coped with so as to improve the learning environments. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO1 - Corpus Tagging, pages 11-16 | ||
| Files: 211.ps, 211.pdf | ||
|
|
||
| Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers | ||
| Zavrel Jakub (CNTS
/ Language Technology Group, University of Antwerp, Universiteitsplein
1, 2610 Wilrijk, Belgium, zavrel@uia.ua.ac.be) Daelemans Walter (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, daelem@uia.ua.ac.be) |
||
| This paper describes a new method, COMBI-BOOTSTRAP, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. COMBI-BOOTSTRAP uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that COMBI-BOOTSTRAP: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample. | ||
| Keywords: Combining Systems, Machine Learning, Reuse of Resources, Tagging | ||
| LREC2000 Proceedings: Session WO1 - Corpus Tagging, pages 17-20 | ||
| Files: 155.ps, 155.pdf | ||
|
|
||
| Something Borrowed, Something Blue: Rule-based Combination of POS Taggers | ||
| Borin Lars
(Department of Linguistics, Uppsala University, Box 527, SE–751 20
Uppsala, SWEDEN, Lars.Borin@ling.uu.se) |
||
| Linguistically annotated text resources are still scarce for many languages and for many text types, mainly because their creation repre-sents a major investment of work and time. For this reason, it is worthwhile to investigate ways of reusing existing resources in novel ways. In this paper, we investigate how off-the-shelf part of speech (POS) taggers can be combined to better cope with text material of a type on which they were not trained, and for which there are no readily available training corpora. We indicate—using freely avail-able taggers for German (although the method we describe is not language-dependent)—how such taggers can be combined by using linguistically motivated rules so that the tagging accuracy of the combination exceeds that of the best of the individual taggers. | ||
| Keywords: Knowledge-Rich NLP, Machine Learning, Multilingual Corpora, Parallel Corpora, POS Tagging | ||
| LREC2000 Proceedings: Session WO1 - Corpus Tagging, pages 21-26 | ||
| Files: 158.ps, 158.pdf | ||
|
|
||
| Determining the Tolerance of Text-handling Tasks for MT Output | ||
| White John
(Litton PRC, 1500 PRC Drive, McLean, Virginia, USA, white_john@prc.com) Doyon Jennifer (Litton PRC, 1500 PRC Drive, McLean, Virginia, USA, doyon_jennifer@prc.com) Talbott Susan (Litton PRC, 1500 PRC Drive, McLean, Virginia, USA, talbott_susan@prc.com) |
||
| With the explosion of the internet and access to increased amounts of information provided by international media, the need to process this abundance of information in an efficient and effective manner has become critical. The importance of machine translation (MT) in the stream of information processing has become apparent. With this new demand on the user community comes the need to assess an MT system before adding such a system to the user’s current suite of text-handling applications. The MT Functional Proficiency Scale project has developed a method for ranking the tolerance of a variety of information processing tasks to possibly poor MT output. This ranking allows for the prediction of an MT system’s usefulness for particular text-handling tasks. | ||
| Keywords: Evaluation, Exercise(s), Machine Translation, Text-Handling tasks, Topic(s) of Interest, Users | ||
| LREC2000 Proceedings: Session EO1 - Evaluation of Machine Translation, pages 29-32 | ||
| Files: 139.ps, 139.pdf | ||
|
|
||
| Evaluating Translation Quality as Input to Product Development | ||
| Bohan Niamh
(Sail Labs GmbH, Balanstr. 57, D-81541 Munchen, nbohan@ireland.com) Breidt Elisabeth (Sail Labs GmbH, Balanstr. 57, D-81541 Munchen, elisabeth.breidt@sail-labs.de) Volk Martin (Department of Computer Science, University of Zürich, Winterthurerstr. 190, CH-8057 Zürich, volk@ifi.unizh.ch) |
||
| In this paper we present a corpus-based method to evaluate the translation quality of machine translation (MT) systems. We start with a shallow analysis of a large corpus and gradually focus the attention on the translation problems. The method constitutes an efficient way to identify the most important grammatical and lexical weaknesses of an MT system and to guide development towards improved translation quality. The evaluation described in the paper was carried out as a cooperation between an MT technology developer, Sail Labs, and the Computational Linguistics group at the University of Zürich. | ||
| Keywords: Corpora, Evaluation, Machine Translation, Rating Scales, Translation Quality | ||
| LREC2000 Proceedings: Session EO1 - Evaluation of Machine Translation, pages 33-38 | ||
| Files: 136.ps, 136.pdf | ||
|
|
||
| An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research | ||
| Nießen Sonja
(Lehrstuhl für Informatik VI, RWTH Aachen – University of
Technology, D-52056 Aachen, Germany, niessen@informatik.rwth-aachen.de) Och Franz Josef (Lehrstuhl für Informatik VI, RWTH Aachen – University of Technology, D-52056 Aachen, Germany, och@informatik.rwth-aachen.de) Leusch Gregor (Lehrstuhl für Informatik VI, RWTH Aachen – University of Technology, D-52056 Aachen, Germany, och@informatik.rwth-aachen.de) Ney Hermann (Lehrstuhl für Informatik VI, RWTH Aachen – University of Technology, D-52056 Aachen, Germany, ney@informatik.rwth-aachen.de) |
||
| In this paper we present a tool for the evaluation of translation quality. First, the typical requirements of such a tool in the framework of machine translation (MT) research are discussed. We define evaluation criteria which are more adequate than pure edit distance and we describe how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using our tool and the corresponding graphical user interface. | ||
| Keywords: Evaluation, Machine Translation | ||
| LREC2000 Proceedings: Session EO1 - Evaluation of Machine Translation, pages 39-46 | ||
| Files: 278.ps, 278.pdf | ||
|
|
||
| Issues in Corpus Creation and Distribution: The Evolution of the Linguistic Data Consortium | ||
| Cieri Christopher
(Linguistic Data Consortium, University of Pennsylvania, Philadelphia,
Pennsylvania, USA, ccieri@ldc.upenn.edu) Liberman Mark (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, myl@ldc.upenn.edu) |
||
| The Linguistic Data Consortium (LDC) is a non-profit consortium of universities, companies and government research laboratories that supports education, research and technology development in language related disciplines by collecting or creating, distributing and archiving language resources including data and accompanying tools, standards and formats. LDC was founded in 1992 with a grant from the Defense Advanced Research Projects Agency (DARPA) to the University of Pennsylvania as host organization. LDC publication and distribution activities self-support from membership fees and data sales while new data creation is supported primarily by grants from DARPA and the National Science Foundation. Recent developments in the creation and use of language resources demand new roles for international data centers. Since our report at the last Language Resource and Evaluation Conference in Granada in 1998, LDC has observed growth in the demand for language resources along multiple dimensions: larger corpora with more sophisticated annotation in a wider variety of languages are used in an increasing number of language related disciplines. There is also increased demand for reuse of existing corpora. Most significantly, small research groups are taking advantage of advances in microprocessor technology, data storage and internetworking to create their own corpora. This has lead to the birth of new annotation practices whose very variety creates barriers to data sharing. This paper will describe recent LDC efforts to address emerging issues in the creation and distribution of language resources. | ||
| Keywords: Annotation, Data Centers, Data Collection and Distribution, Language Resources, Reuse, Standards and Tools | ||
| LREC2000 Proceedings: Session SO1 - Data Centers / Major Projects, pages 49-56 | ||
| Files: 209.ps, 209.pdf | ||
|
|
||
| The Establishment of Motorola's Human Language Data Resource Center: Addressing the Criticality of Language Resources in the Industrial Setting | ||
| Talley Jim
(Motorola Labs, Human Interface Laboratory, 7700 W. Parmer Ln., MD:
PL26; Austin, TX 78729; USA, James_Talley@email.mot.com) |
||
| Within the human language technology (HLT) field it is widely understood that the availability (and effective utilization) of voluminous, high quality language resources is both a critical need and a critical bottleneck in the advancement and deployment of cutting edge HLT applications. Recently formed (inter-)national human language resource (HLR) consortia (e.g., LDC, ELRA,...) have made great strides in addressing this challenge by distributing a rich array of pre-competitive HLRs. However, HLT application commercialization will continue to demand that HLRs specific to target products (and complementary to consortially available resources) be created. In recognition of the general criticality of HLRs, Motorola has recently formed the Human Language Data Resource Center (HLDRC) to streamline and leverage our HLR creation and utilization efforts. In this paper, we use the specific case of the Motorola HLDRC to help examine the goals and range of activities which fall into the purview of a company- internal HLR organization, look at ways in which such an organization differs from (and is similar to) HLR consortia, and explore some issues with respect to implementation of a wholly within-company HLR organization like the HLDRC. | ||
| Keywords: Corporate Usage of HLR, Formation of Company (Motorola) Organization, Industrial Human Language Resource Center | ||
| LREC2000 Proceedings: Session SO1 - Data Centers / Major Projects, pages 57-62 | ||
| Files: 260.ps, 260.pdf | ||
|
|
||
| A Platform for Dutch in Human Language Technologies | ||
| D'Halleweyn
Elisabeth (Nederlandse Taalunie, Postbus 10595, 2501 HN Den Haag,
The Netherlands, Edhalleweyn@ntu.nl, http://www.taalunie.org) Dewallef Erwin (Nederlandse Taalunie, c/o Ministry of the Flemish Community, Science and Innovation Administration, Boudewijlaan 30, B-1000Brussel, Belgium, Erwin.dewallef@wim.vlaanderen.be, http://www.innovatie.vlaanderen.be) Beeken Jeannine (Nederlandse Taalunie, Postbus 10595, 2501 HN Den Haag, The Netherlands, jeannine.beeken@skynet.be) |
||
| As ICT increasingly forms a part of our daily life it becomes more and more important that all citizens can make use of their native languages in all communicative situations. For the development of successful applications and products for Dutch basic provisions are required. The development of the basic material that is lacking, is an expensive undertaking which exceeds the capacity of the individuals involved. Collaboration between the various agents (policy, knowledge infrastructure and industry) in the Netherlands and Flanders is required. The existence of the Dutch Language Union (Nederlandse Taalunie) facilitates this co-operation. The responsible ministers decided to set up a Dutch-Flemish platform for Dutch in Human Language Technologies. The purpose of the platform is the further construction of an adequate digital language infrastructure for Dutch so that the industry develops the required applications which must guarantee that the citizens in Holland and Flanders can use their own language in their communication within the information society and the Dutch language area remains a full player in a multi-lingual Europe. This paper will show some of the efforts that have been taken | ||
| Keywords: Binational Policies, Evaluation, Legal Issues, Maintenance, Organisational Issues, Priorities | ||
| LREC2000 Proceedings: Session SO1 - Data Centers / Major Projects, pages 63-66 | ||
| Files: 348.ps, 348.pdf | ||
|
|
||
| Recent Developments within the European Language Resources Association (ELRA) | ||
| Choukri Khalid
(European Language Resources Association (ELRA) &, European Language
resources - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin,
75013 Paris France, choukri@elda.fr) Mance Audrey (European Language Resources Association (ELRA) &, European Language resource - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin, 75013 Paris France, mance@elda.fr) Mapelli Valérie (European Language Resources Association (ELRA) &, European Language resource - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin, 75013 Paris France, mapelli@elda.fr) |
||
| The main achievement of ELRA (the most visible) is the growth of its catalogue. The ELRA catalogue as of April 2000 lists 111 speech resources, 50 monolingual lexica, 113 multilingual lexica, 24 written corpora and 275 terminological databases. However, many Language Resources (LRs) need to be identified and/or produced. To this effect, ELRA is active in promoting and funding the co-production of new LRs through several calls for proposals. As for the validity of the existence of ELRA for the distribution of language resources, the statistics from the past two years speak for themselves. The 1999 fiscal report showed a rise with the sale of 217 LRs (122 for research and 95 for commercial purposes; with speech databases representing nearly 45%), compared to the sale of 180 LRs (90 for research and 90 for commercial purposes; with speech databases representing nearly 65%), in 1998 and to 33 sold in 1997. The other visible action of ELRA is its membership drive: since its foundation, ELRA has attracted an increasing number of members (from 63 in 1995 to 95 in 1999). This article is updated from a paper presented at Eurospeech'99. | ||
| Keywords: Distribution, ELDA, ELRA, Information Dissemination, Languages Resources, Legal Issues, Membership, Partnership, Surveys, Validation | ||
| LREC2000 Proceedings: Session SO1 - Data Centers / Major Projects, pages 67-72 | ||
| Files: 377.ps, 377.pdf | ||
|
|
||
| COCOSDA - a Progress Report | ||
| Campbell Nick (ATR
Spoken Language Translation Research Laboratories, Kyoto, Japan, nick@slt.atr.co.jp) |
||
| This paper presents a review of the activities of COCOSDA, the International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques for Speech Input/Output. COCOSDA has a history of innovative actions which spawn national and regional consortia for the co-operative development of speech corpora and for the promotion of research in related topics. COCOSDA has recently undergone a change of organisation in order to meet the developing needs of the speech- and language-processing technologies and this paper summarises those changes. | ||
| Keywords: Assesment Techniques, Speech Databases, Speech Input/Output, Standardisation | ||
| LREC2000 Proceedings: Session SO1 - Data Centers / Major Projects, pages 73-76 | ||
| Files: 364.ps, 364.pdf | ||
|
|
||
| Survey of Language Engineering Needs: a Language Resources Perspective | ||
| Allen Jeffrey
(European Language Resources Association (ELRA) &, European Language
resources - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin,
75013 Paris France, jeff@elda.fr) Choukri Khalid (European Language Resources Association (ELRA) &, European Language resources - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin, 75013 Paris France, choukri@elda.fr) |
||
| This paper describes the current state of an on-going survey that aims at determining the needs of users with respect to available and potentially available Language Resources (LRs). Following market monitoring strategies that have been outlined within the Language Resources- Packaging and Production project (LRsP&P LE4-8335), the main objective of this survey is to provide concrete figures for developing a more reliable and workable business plan for the European Language Resources Association (ELRA) and its Distribution Agency (ELDA), and to determine investment plans for sponsoring the production of new resources. | ||
| Keywords: Domains, ELDA, ELRA, Language Engineering, Language Requirements, LR Domains, Multi-Media and Multi-Modal Language Resources, Natural Language Processing, Questionnaires, Speech Processing, Statistics, Surveys, User Needs | ||
| LREC2000 Proceedings: Session SO1 - Data Centers / Major Projects, pages 77-84 | ||
| Files: 317.ps, 317.pdf | ||
|
|
||
| Building a Treebank for French | ||
| Abeillé Anne
(TALaNa, Université Paris 7, 75251 Paris cedex 05, FRANCE,
abeille@linguist.jussieu.fr) Clément Lionel (TALaNa, Université Paris 7, 75251 Paris cedex 05, FRANCE, clement@linguist.jussieu.fr) Kinyon Alexandra (University of Pennsylvania, Philadelphia, USA, kiyon@linguist.jussieu.fr) |
||
| Very few gold standard annotated corpora are currently available for French. We present an ongoing project to build a reference treebank for French starting with a tagged newspaper corpus of 1 Million words (Abeillé et al., 1998), (Abeillé and Clément, 1999). Similarly to the Penn TreeBank (Marcus et al., 1993), we distinguish an automatic parsing phase followed by a second phase of systematic manual validation and correction. Similarly to the Prague treebank (Hajicova et al., 1998), we rely on several types of morphosyntactic and syntactic annotations for which we define extensive guidelines. Our goal is to provide a theory neutral, surface oriented, error free treebank for French. Similarly to the Negra project (Brants et al., 1999), we annotate both constituents and functional relations. | ||
| Keywords: Corpus Annotation, Corpus Linguistics, Parsing, Shalow Parsing, Tagging, Treebank | ||
| LREC2000 Proceedings: Session WO2 - Treebanks, pages 87-94 | ||
| Files: 230.ps, 230.pdf | ||
|
|
||
| Semantico-syntactic Tagging of Very Large Corpora: the Case of Restoration of Nodes on the Underlying Level | ||
| Hajičová
Eva (Faculty of Mathematics and Physics, Charles University,
Malostranské námêstí 25, 1180 Praha 1,
Czechia, hajicova@ufal.mff.cuni.cz) Sgall Petr (Faculty of Mathematics and Physics, Charles University, Malostranské námêstí 25, 1180 Praha 1, Czechia, sgall@ufal.mff.cuni.cz) |
||
| The Prague Dependency Treebank has been conceived of as a semi-automatic three-layer annotation system, in which the layers of morphemic and 'analytic' (surface-syntactic) tagging are followed by the layer of tectogrammatical tree structures. Two types of deletions are recognized: (i) those licensed by the grammatical properties of the given sentence, and (ii) those possible only if the preceding context exhibits certain specific properties. Within group (i), either the position itself in the sentence structure is determined, but its lexical setting is 'free' (as e.g. with a deleted subject in Czech as a pro-drop language), or both the position and its 'filler' are determined. Group (ii) reflects the typological differences between English and Czech; the rich morphemics of the latter is more favorable for deletions. Several steps of the tagging procedure are carried out automatically, but most parts of the restoration of deleted nodes still have to be done ''manually''. If along with the node that is being restored, also nodes depending on it are deleted, then these are restored only if they function as arguments or obligatory adjuncts. The large set of annotated utterances will make it possible to check and amend the present results, also with applications of statistic methods. Theoretical linguistics will be enabled to check its descriptive framework; the degree of automation of the procedure will then be raised, and the treebank will be useful for most different tasks in language processing. | ||
| Keywords: Corpus, Deletions, Dependency, Syntax | ||
| LREC2000 Proceedings: Session WO2 - Treebanks, pages 95-98 | ||
| Files: 18.ps, 18.pdf | ||
|
|
||
| Building a Treebank for Italian: a Data-driven Annotation Schema | ||
| Bosco Cristina (Dipartimento
di Informatica, Università di Torino, c.so Svizzera 185, 10149,
Torino (Italy), bosco@di.unito.it) Lombardo Vincenzo (DISTA – Università del Piemonte Orientale “A. Avogadro”, c.so Borsalino 54, 15100 Alessandria, Italy, Centro di Scienza Cognitiva – Università di Torino, via Lagrange 3, 10123 Torino, Italy, vincenzo@di.unito.it) Vassallo Daniela (Dipartimento di Informatica, Università di Torino, c.so Svizzera 185, 10149, Torino (Italy), vassallo@di.unito.it) Lesmo Leonardo (Dipartimento di Informatica, Università di Torino, c.so Svizzera 185, 10149, Torino (Italy), lesmo@di.unito.it) |
||
| Many natural language researchers are currently turning their attention to treebank development and trying to achieve accuracy and corpus data coverage in their representation formats. This paper presents a data-driven annotation schema developed for an Italian treebank ensuring data coverage and consistency between annotation of linguistic phenomena. The schema is a dependency-based format centered upon the notion of predicate-argument structure augmented with traces to represent discontinuous constituents. The treebank development involves an annotation process performed by a human annotator helped by an interactive parsing tool that builds incrementally syntactic representation of the sentence. To increase the syntactic knowledge of this parser, a specific data-driven strategy has been applied. We describe the cyclical development of the annotation schema highlighting the richness and flexibility of the format, and we present some representational issues. | ||
| Keywords: Annotation Schema, Corpus, Dependency Format, Italian, Treebank | ||
| LREC2000 Proceedings: Session WO2 - Treebanks, pages 99-106 | ||
| Files: 220.ps, 220.pdf | ||
|
|
||
| A Treebank of Spanish and its Application to Parsing | ||
| Moreno Antonio (Universidad
de Málaga, F. Filosofía y Letras, Campus de Teatinos,
29071 Málaga, Spain, amo@uma.es) Grishman Ralph (Department of Computer Science, New York University, U.S.A, grishman@cs.nyu.edu) López Susana (Laboratorio de Lingüística Informática, Universidad Autónoma de Madrid, Spain, susana@maria.lllf.uam.es) Sánchez Fernando (Laboratorio de Lingüística Informática, Universidad Autónoma de Madrid, Spain, fernando@maria.lllf.uam.es) Sekine Satoshi (Department of Computer Science, New York University, U.S.A, sekine@cs.nyu.edu) |
||
| This paper presents joint research between a Spanish team and an American one on the development and exploitation of a Spanish treebank. Such treebanks for other languages have proven valuable for the development of high-quality parsers and for a wide variety of language studies. However, when the project started, at the end of 1997, there was no syntactically annotated corpus for Spanish. This paper describes the design of such a treebank and its initial application to parser construction. | ||
| Keywords: Grammar Acquisition, Parsing, Spanish, Syntax, Treebank | ||
| LREC2000 Proceedings: Session WO2 - Treebanks, pages 107-112 | ||
| Files: 66.ps, 66.pdf | ||
|
|
||
| Shallow Parsing and Functional Structure in Italian Corpora | ||
| Delmonte Rodolfo
(Ca' Garzoni-Moro, San Marco 3417, Università ''Ca Foscari'',
30124 - VENEZIA, Tel. 39-41-2578464/52/19, E-mail: delmont@unive.it,
Website: http//byron.cgm.unive.it) |
||
| In this paper we argue in favour of an integration between statistically and syntactically based parsing by presenting data from a study of a 500,000 word corpus of Italian. Most papers present approaches on tagging which are statistically based. None of the statistically based analyses, however, produce an accuracy level comparable to the one obtained by means of linguistic rules [1]. Of course their data are strictly referred to English, with the exception of [2, 3, 4]. As to Italian, we argue that purely statistically based approaches are inefficient basically due to great sparsity of tag distribution - 50% or less of unambiguous tags when punctuation is subtracted from the total count. In addition, the level of homography is also very high: readings per word are 1.7 compared to 1.07 computed for English by [2] with a similar tagset. The current work includes a syntactic shallow parser and a ATN-like grammatical function assigner that automatically classifies previously manually verified tagged corpora. In a preliminary experiment we made with automatic tagger, we obtained 99,97% accuracy in the training set and 99,03% in the test set using combined approaches: data derived from statistical tagging is well below 95% even when referred to the training set, and the same applies to syntactic tagging. As to the shallow parser and GF-assigner we shall report on a first preliminary experiment on a manually verified subset made of 10,000 words. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO2 - Treebanks, pages 113-120 | ||
| Files: 82.ps, 82.pdf | ||
|
|
||
| An XML-based Representation Format for Syntactically Annotated Corpora | ||
| Mengel Andreas (IMS,
University of Stuttgart, Azenbergstr.12, D-70174 Stuttgart, mengel@ims.uni-stuttgart.de) Lezius Wolfgang (IMS, University of Stuttgart, Azenbergstr.12, D-70174 Stuttgart, lezius@ims.uni-stuttgart.de) |
||
| This paper discusses a general approach to the description and encoding of linguistic corpora annotated with hierarchically structured syntactic information. A general format can be motivated by the variety and incompatibility of existing annotation formats. By using XML as a representation format the theoretical and technical problems encountered can be overcome. | ||
| Keywords: Annotation, Syntax, XML | ||
| LREC2000 Proceedings: Session WO2 - Treebanks, pages 121-126 | ||
| Files: 59.ps, 59.pdf | ||
|
|
||
| Modern Greek Corpus Taxonomy | ||
| Mikros George
(Institute for Language and Speech Processing, Epidavrou & Artemidos
6, 151 25 Maroussi, Greece, gmikros@ilsp.gr) Carayannis George (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, gcara@ilsp.gr) |
||
| The aim of this paper is to explore the way in which different kind of linguistic variables can be used in order to discriminate text type in 240 preclassified press texts. Modern Greek (MG) language due to its past diglossic status exhibits extended variation in written texts across all linguistic levels and can be exploited in text categorization tasks. The research presented used Discriminant Function Analysis (DFA) as a text categorization method and explores the way different variable groups contribute to the text type discrimination. | ||
| Keywords: Corpus Analysis, Discriminant Function Analysis, Language Variation, Statistical Linguistics, Stylistic Analysis, Text Categorization | ||
| LREC2000 Proceedings: Session WO3 - Corpus Categorisation, pages 129-134 | ||
| Files: 351.ps, 351.pdf | ||
|
|
||
| Automatic Style Categorisation of Corpora in the Greek Language | ||
| Tambouratzis George
(Institute for Language and Speech Processing, Epidavrou & Artemidos
6, 151 25 Maroussi, Greece, giorg_t@ilsp.gr) Markantonatou Stella (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, marks@ilsp.gr) Hairetakis Nikolaos (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, nhaire@ilsp.gr) Carayannis George (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, gcara@ilsp.gr) |
||
| In this article, a system is proposed for the automatic style categorisation of text corpora in the Greek language. This categorisation is based to a large extent on the type of language used in the text, for example whether the language used is representative of formal Greek or not. To arrive to this categorisation, the highly inflectional nature of the Greek language is exploited. For each text, a vector of both structural and morphological characteristics is assembled. Categorisation is achieved by comparing this vector to given archetypes using a statistical-based method. Experimental resu | ||
| Keywords: Automated Style Categorisation, Grammatical Rules, Greek Language, Masking-and-Matching Technique, Morphological Processing | ||
| LREC2000 Proceedings: Session WO3 - Corpus Categorisation, pages 135-140 | ||
| Files: 301.ps, 301.pdf | ||
|
|
||
| TyPTex: Inductive Typological Text Classification by Multivariate Statistical Analysis for NLP Systems Tuning/Evaluation | ||
| Folch Helka
(UMR8503 : Analyses de corpus linguistiques, usages et traitements, CNRS
/ ENS Fontenay/Saint-Cloud, 92211 Saint-Cloud, France, folch@ens-fcl.fr) Heiden Serge (UMR8503 : Analyses de corpus linguistiques, usages et traitements, CNRS / ENS Fontenay/Saint-Cloud, 92211 Saint-Cloud, France, slh@ens-fcl.fr) Habert Benoît (UMR8503 : Analyses de corpus linguistiques, usages et traitements, CNRS / ENS Fontenay/Saint-Cloud, 92211 Saint-Cloud, Francer, habert@limsi.fr) Fleury Serge (UMR8503 : Analyses de corpus linguistiques, usages et traitements, CNRS / ENS Fontenay/Saint-Cloud, 92211 Saint-Cloud, France, fleury@ens-fcl.fr) Illouz Gabriel (LIMSI, CNRS / Université Paris Sud, Orsay, France, illouz@limsi.fr) Lafon Pierre (UMR8503 : Analyses de corpus linguistiques, usages et traitements, CNRS / ENS Fontenay/Saint-Cloud, 92211 Saint-Cloud, France, lafon@ens-fcl.fr) Nioche Julien (UMR8503 : Analyses de corpus linguistiques, usages et traitements, CNRS / ENS Fontenay/Saint-Cloud, 92211 Saint-Cloud, France, nioche@ens-fcl.fr) Prévost Sophie (UMR8503 : Analyses de corpus linguistiques, usages et traitements, CNRS / ENS Fontenay/Saint-Cloud, 92211 Saint-Cloud, France, prevost@ens-fcl.fr) |
||
| The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or ''profiling'' within the ELRA benchmark called ''Contribution to the construction of contemporary french corpora'' based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations. | ||
| Keywords: Corpus Based Linguistics, Evaluation, Multidimensional Statistics, Natural Language Processing, Software Architectur, Tagging, Text Typology | ||
| LREC2000 Proceedings: Session WO3 - Corpus Categorisation, pages 141-148 | ||
| Files: 254.ps, 254.pdf | ||
|
|
||
| Language Resources as by-Product of Evaluation: The MULTITAG Example | ||
| Paroubek Patrick
(LIMSI - CNRS, Batiment 508 Universite Paris XI, 91403 Orsay Cedex,
Email: pap@limsi.fr URL: http://www.limsi.fr/Individus/pap) |
||
| In this paper, we show how the paradigm of evaluation can function as language resource producer for high quality and low cost validated language resources. First the paradigm of evaluation is presented, the main points of its history are recalled, from the first deployment that took place in the USA during the DARPA/NIST evaluation campaigns, up to latest efforts in Europe (SENSEVAL2/ROMANSEVAL2, CLEF, CLASS etc.). Then the principle behind the method used to produce high-quality validated language at low cost from the by-products of an evaluation campaign is exposed. It was inspired by the experiments (Recognizer Output Voting Error Recognition) performed during speech recognition evaluation campaigns in the USA and consists of combining the outputs of the participating sys-tems with a simple voting strategy to obtain higher performance results. Here we make a link with the existing strategies for system combination studied in machine learning. As an illustration we describe how the MULTITAG project funded by CNRS has built from the by-products of the GRACE evaluation campaign (French Part-Of-Speech tagging system evaluation campaign) a corpus of around 1 million words, annotated with a fine grained tagset derived from the EAGLES and MULTEXT projects. A brief presentation of the state of the art in Part-Of-Speech (POS) tagging and of the problem posed by its evaluation is given at the beginning, then the corpus itself is presented along with the procedure used to produce and validate it. In particular, the cost reduction brought by using this method instead of more classical methods is presented and its generalization to other control task is discussed in the conclusion. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO4 - Reusability Issues, pages 151-154 | ||
| Files: 353.ps, 353.pdf | ||
|
|
||
| Enabling Resource Sharing in Language Generation: an Abstract Reference Architecture | ||
| Cahill Lynne
(Information Technology Research Institute, University of Brighton,
Lewes Rd, Brighton, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) Doran Christy (Information Technology Research Institute, University of Brighton, Lewes Rd, Brighton, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) Evans Roger (Information Technology Research Institute, University of Brighton, Lewes Rd, Brighton, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) Kibble Rodger (Information Technology Research Institute, University of Brighton, Lewes Rd, Brighton, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) Mellish Chris (Division of Informatics, University of Edinburgh, 80 South Bridge, Edinburgh, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) Paiva D. (Information Technology Research Institute, University of Brighton, Lewes Rd, Brighton, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) Reape Mike (Division of Informatics, University of Edinburgh, 80 South Bridge, Edinburgh, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) Scott Donia (Information Technology Research Institute, University of Brighton, Lewes Rd, Brighton, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) Tipper Neil (Information Technology Research Institute, University of Brighton, Lewes Rd, Brighton, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) |
||
| The RAGS project aims to develop a reference architecture for natural language generation,to facilitate modular development of NLG systams as well as evaluation of components, systems and algorithms. This paper gives an overview of the proposed framework, describing an abstract data model with five levels of representation: Conceptual, Semantic, Rhetorical, Document and Syntactic. We report on a re-implementation of an existing system using the RAGS data model. | ||
| Keywords: Architectures, Natural Language Generation, Reusability | ||
| LREC2000 Proceedings: Session WO4 - Reusability Issues, pages 155-160 | ||
| Files: 244.ps, 244.pdf | ||
|
|
||
| Experiences of Language Engineering Algorithm Reuse | ||
| Gambäck Björn
(Information and Language Engineering Group, Swedish Institute of
Computer Science, Box 1263, S 164 29 Kista, Sweden, gamback@sics.se) Olsson Fredrik (Information and Language Engineering Group, Swedish Institute of Computer Science, Box 1263, S 164 29 Kista, Sweden, fredriko@sics.se) |
||
| Traditionally, the level of reusability of language processing resources within the research community has been very low. Most of the recycling of linguistic resources has been concerned with reuse of data, e.g., corpora, lexica, and grammars, while the algorithmic resources far too seldom have been shared between di˙erent projects and institutions. As a consequence, researchers who are willing to reuse somebody else's processing components have been forced to invest major e˙orts into issues of integration, inter-process communication, and interface design. In this paper, we discuss the experiences drawn from the svensk project regarding the issues on reusability of language engineering software as well as some of the challenges for the research community which are prompted by them. Their main characteristics can be laid out along three dimensions; technical/software challenges, linguistic challenges, and `political' challenges. In the end, the unavoidable conclusion is that it de˝nitely is time to bring more aspects of engineering into the Computational Linguistic community! | ||
| Keywords: Language Processing Resource Reusability, LE Platforms | ||
| LREC2000 Proceedings: Session WO4 - Reusability Issues, pages 161-166 | ||
| Files: 151.ps, 151.pdf | ||
|
|
||
| Dialogue and Prompting Strategies Evaluation in the DEMON System | ||
| Lavelle Carine-Alexia
(Institut de Recherche en Informatique de Toulouse, Université
Paul Sabatier, 118, route de Narbonne, 31062 Toulouse, France, lavelle@irit.fr) De Calmès Martine (Institut de Recherche en Informatique, Université Paul Sabatier, 118,route de Narbonne, 31062 Toulouse Cedex France, decalmes@irit.fr) Pérennou Guy (Institut de Recherche en Informatique, Université Paul Sabatier, 118,route de Narbonne, 31062 Toulouse Cedex France, perennou@irit.fr) |
||
| In order to improve usability and efficiency of dialogue systems a major issue is of better adapting dialogue systems to intended users. This requires a good knowledge of users’ behaviour when interacting with a dialogue system. With this regard we based evaluations of dialogue and prompting strategies performed on our system on how they influence users answers. In this paper we will describe the measure we used to evaluate the effect of the size of the welcome prompt and a set measures we defined to evaluate three different confirmation strategies. We will then describe five criteria we used to evaluate system’s question complexity and their effect on users’ answers. The overall aim is to design a set of metrics that could be used to automatically decide which of the possible prompts at a given state in a dialogue should be uttered. | ||
| Keywords: Prompting Strategy, Spoken Dialogue Dystems, Usability | ||
| LREC2000 Proceedings: Session SO2 - Dialogue Evaluation Methods, pages 169-176 | ||
| Files: 38.ps, 38.pdf | ||
|
|
||
| Predictive Performance of Dialog Systems | ||
| Bonneau-Maynard H.
(LIMSI-CNRS, BP 133 91403 Orsay cedex, FRANCE, hbm,devil,rossetg@limsi.fr) Devillers L. (LIMSI-CNRS, BP 133 91403 Orsay cedex, FRANCE, devil@limsi.fr) Rosset S. (LIMSI-CNRS, BP 133 91403 Orsay cedex, FRANCE, rosset@limsi.fr) |
||
| This paper relates some of our experiments on the possibility of predictive performance measures of dialog systems. Experimenting dialog systems is often a very high cost procedure due to the necessity to carry out user trials. Obviously it is advantageous when evaluation can be carried out automatically. It would be helpfull if for each application we were able to measure the system performances by an objective cost function. This performance function can be used for making predictions about a future evolution of the systems without user interaction. Using the PARADISE paradigm, a performance function derived from the relative contribution of various factors is first obtained for one system developed at LIMSI: PARIS-SITI (kiosk for tourist information retrieval in Paris). A second experiment with PARIS-SITI with a new test population confirms that the most important predictors of user satisfaction are understanding accuracy, recognition accuracy and number of user repetitions. Futhermore, similar spoken dialog features appear as important features for the Arise system (train timetable telephone information system). We also explore different ways of measuring user satisfaction. We then discuss the introduction of subjective factors in the predictive coefficients. | ||
| Keywords: Dialog System, Evaluation, Performances Measures | ||
| LREC2000 Proceedings: Session SO2 - Dialogue Evaluation Methods, pages 177-182 | ||
| Files: 303.ps, 303.pdf | ||
|
|
||
| A Methodology for Evaluating Spoken Language Dialogue Systems and Their Components | ||
| Bernsen Niels Ole
(Natural Interactive Systems Laboratory, Science Park 10,5230 Odense
M,Denmark, nob@nis.sdu.dk) Dybkjær Laila (Natural Interactive Systems Laboratory, Science Park 10,5230 Odense M,Denmark, laila@nis.sdu.dk) |
||
| As spoken language dialogue systems (SLDSs)proliferate in the market place,the issue of SLDS evaluation has come to attract wide interest from research and industry alike.Yet it is only recently that spoken dialogue engineering researchers have come to face SLDSs evaluation in its full complexity.This paper presents results of the European DISC project concerning technical evaluation and usability evaluation of SLDSs and their components.The paper presents a methodology for complete and correct evaluation of SLDSs and components together with a generic evaluation template for describing the evaluation criteria needed. | ||
| Keywords: Best Practice, Completeness, Correctness, Evaluation, Spoken Dialogue Systems | ||
| LREC2000 Proceedings: Session SO2 - Dialogue Evaluation Methods, pages 183-188 | ||
| Files: 135.ps, 135.pdf | ||
|
|
||
| Developing and Testing General Models of Spoken Dialogue System Peformance | ||
| Walker Marilyn
(AT& T Labs - Research, 180 Park Ave, Florham Park, N.J. 07932,
U.S.A., walker@research.att.com) Kamm Candace (AT& T Labs - Research, 180 Park Ave, Florham Park, N.J. 07932, U.S.A, cak@research.att.com) Boland Julie (AT& T Labs - Research, 180 Park Ave, Florham Park, N.J. 07932, U.S.A, boland@louisiana.edu) |
||
| The design of methods for performance evaluation is a major open research issue in the area of spoken language dialogue systems. This paper presents the PARADISE methodology for developing predictive models of spoken dialogue performance, and shows how to evaluate the predictive power and generalizability of such models. To illustrate the methodology, we develop a number of models for predicting system usability (as measured by user satisfaction), based on the application of PARADISE to experimental data from two different spoken dialogue systems. We compare both linear and tree-based models. We then measure the extent to which the models generalize across different systems, different experimental conditions, and different user populations, by testing models trained on a subset of the corpus against a test set of dialogues. The results show that the models generalize well across the two systems, and are thus a first approximation towards a general performance model of system usability. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SO2 - Dialogue Evaluation Methods, pages 189-196 | ||
| Files: 349.ps, 349.pdf | ||
|
|
||
| A Framework for Cross-Document Annotation | ||
| Day David (The
MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, http://www.mitre.org/technology/nlp,
day@mitre.org) Goldschen Alan (The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, http://www.mitre.org/technology/nlp, alang@mitre.org) Henderson John (The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA, http://www.mitre.org/technology/nlp, jhndrsn@mitre.org) |
||
| We introduce a cross-document annotation toolset that serves as a corpus-wide knowledge base for linguistic annotations. This imple-mented system is designed to address the unique cognitive demands placed on human annotators who must relate information that is expressed across document boundaries. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO5 - Corpus Tools, pages 199-204 | ||
| Files: 201.ps, 201.pdf | ||
|
|
||
| Providing Internet Access to Portuguese Corpora: the AC/DC Project | ||
| Santos Diana (SINTEF
Telecom and Informatics, Postboks 1024 Blindern, N-0314 Oslo, Norway,
Diana.Santos@informatics.sintef.no) Bick Eckhard (SINTEF Telecom and Informatics, Postboks 1024 Blindern, N-0314 Oslo, Norway, lineb@hum.au.dk) |
||
| In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do portugues) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilizacao de Corpora, roughly ''Access and Availability of Corpora'') allows a user to query around 40 million words of Portuguese text. After describing the aims of the service, which is still being subject to regular improvements, we focus on the process of tagging and parsing the underlying corpora, using a Constraint Grammar parser for Portuguese. | ||
| Keywords: Constraint Grammar, Corpora, Language Resource Creation, Parsing, Web Interfaces | ||
| LREC2000 Proceedings: Session WO5 - Corpus Tools, pages 205-210 | ||
| Files: 85.ps, 85.pdf | ||
|
|
||
| Annotating a Corpus to Develop and Evaluate Discourse Entity Realization Algorithms: Issues and Preliminary Results | ||
| Poesio Massimo
(University of Edinburgh, HCRC and Informatics, Massimo.Poesio@ed.ac.uk) |
||
| We are annotating a corpus with information relevant to discourse entity realization, and especially the information needed to decide which type of NP to use. The corpus is being used to study correlations between NP type and certain semantic or discourse features, to evaluate hand-coded algorithms, and to train statistical models. We report on the development of our annotation scheme, the problems we have encountered, and the results obtained so far. | ||
| Keywords: Anaphora, Corpus Annotation, Empirical Methods, Evaluation, Generation, Referential Expressions | ||
| LREC2000 Proceedings: Session WO5 - Corpus Tools, pages 211-218 | ||
| Files: 193.ps, 193.pdf | ||
|
|
||
| Using Few Clues Can Compensate the Small Amount of Resources Available for Word Sense Disambiguation | ||
| de Loupy Claude
(Laboratoire Informatique d’Avignon, B.P. 1228, Agroparc, 339 chemin
des Meinajaries, 84911 Avignon, cedex 9, FRANCE, claude.de.loupy@lia.univ-avignon.fr) El-Bèze Marc (Laboratoire Informatique d’Avignon, B.P. 1228, Agroparc, 339 chemin des Meinajaries, 84911 Avignon, cedex 9, FRANCE, marc.elbeze@lia.univ-avignon.fr) |
||
| Word Sense Disambiguation (WSD) is considered as one of the most difficult tasks in Natural Language Processing. Probabilistic methods have shown their efficiency in many NLP tasks, but they imply a training phase and very few resources are available for WSD. This paper aims at showing how to make the most of size-limited resources in order to partially overcome the knowledge acquisition bottleneck. Experiments are performed within the SENSEVAL test framework in order to evaluate the advantage of a lemmatized or stemmed context over an original context (inflected forms as they are observed in the rough text). Then, we measure the precision improvement (about 6 %) when looking at the inflected form of the word to be disambiguated. Lastly, we show that it is possible to reduce the ambiguity if the word to be disambiguated has a particular inflected form or occurs as part of a compound. | ||
| Keywords: Probabilistic Method, Size-Limited Resources, Word Sense Disambiguation | ||
| LREC2000 Proceedings: Session WO5 - Corpus Tools, pages 219-224 | ||
| Files: 350.ps, 350.pdf | ||
|
|
||
| Learning Verb Subcategorization from Corpora: Counting Frame Subsets | ||
| Zeman Daniel (Ústav
formální a aplikované lingvistiky, Univerzita Karlova, Praha) Sarkar Anoop (Department of Computer and Information Science, University of Pennsylvania, Philadelphia) |
||
| We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we are able to achieve 88 % accuracy on unseen parsed text. | ||
| Keywords: Corpus, Frames, Subcategorization, Syntax, Valency, Verb | ||
| LREC2000 Proceedings: Session WO6 - Acquisition of Lexical Information, pages 227-234 | ||
| Files: 145.ps, 145.pdf | ||
|
|
||
| Tuning Lexicons to New Operational Scenarios | ||
| Basili Roberto
(University of Rome Tor Vergata, Department of Computer Science, Systems
and Production, Via di Tor Vergata 110, 00133 Roma (Italy), basili@info.uniroma2.it) Pazienza Maria Teresa (University of Rome Tor Vergata, Department of Computer Science, Systems and Production, Via di Tor Vergata 110, 00133 Roma (Italy), pazienza@info.uniroma2.it) Vindigni Michele (University of Rome Tor Vergata, Department of Computer Science, Systems and Production, Via di Tor Vergata 110, 00133 Roma (Italy), vindigni@info.uniroma2.it) Zanzotto Fabio Massimo (University of Rome Tor Vergata, Department of Computer Science, Systems and Production, Via di Tor Vergata 110, 00133 Roma (Italy), zanzotto@info.uniroma2.it) |
||
| In this paper the role of the lexicon within typical application tasks based on NLP is analysed. A large scale semantic lexicon is studied within the framework of a NLP application. The coverage of the lexicon with respect the target domain and a (semi)automatic tuning approach have been evaluated. The impact of a corpus-driven inductive architecture aiming to compensate lacks in lexical information are thus measured and discussed. | ||
| Keywords: Event Recognition, Induction, Lexical Acquisition, Lexical Tuning, Lexicon, Word Sense Disambiguation | ||
| LREC2000 Proceedings: Session WO6 - Acquisition of Lexical Information, pages 235-240 | ||
| Files: 330.ps, 330.pdf | ||
|
|
||
| A Flexible Infrastructure for Large Monolingual Corpora | ||
| Quasthoff Uwe
(Leipzig University, Computer Science Institute, NLP Dept.,
Augustusplatz 10/11, 04109 Leipzig, Germany, quasthoff@informatik.uni-leipzig.de) Wolff Christian (Leipzig University, Computer Science Institute, NLP Dept., Augustusplatz 10/11, 04109 Leipzig, Germany, wolff@informatik.uni-leipzig.de) |
||
| In this paper we describe a flexible and portable infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the basis of a sentence-based text segmentation algorithm. We describe the entry structure of the corpus database as well as various query types and tools for information extraction. Among them, the extraction and usage of sentence-based word collocations is discussed in detail. Finally we give an overview of different application for this language resource. A WWW interface allows for public access to most of the data and information extraction tools (http://wortschatz.uni-leipzig.de). | ||
| Keywords: Collocations, Information Extraction, Monolingual Corpora, Web Search | ||
| LREC2000 Proceedings: Session WO6 - Acquisition of Lexical Information, pages 241-246 | ||
| Files: 226.ps, 226.pdf | ||
|
|
||
| Automatic Generation of Dictionary Definitions from a Computational Lexicon | ||
| Labropoulou Penny
(Institute for Language and Speech Processing, Epidavrou & Artemidos
6, 151 25 Maroussi, Greece, penny@ilsp.gr) Mantzari Elena (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, elena@ilsp.gr) Papageorgiou Harris (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, xaris@ilsp.gr) Gavrilidou Maria (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, maria@ilsp.gr) |
||
| This paper presents an automatic Generator of dictionary definitions for concrete entities, based on information extracted from a Computational Lexicon (CL) containing semantic information. The aim of the adopted approach, combining NLG techniques with the exploitation of the formalised and systematic lexical information stored in CL, is to produce well formed dictionary definitions free from the shortcomings of traditional dictionaries. The architecture of the system is presented, focusing on the adaptation of the NLG techniques to the specific application requirements, and on the interface between the CL and the Generator. Emphasis is given on the appropriateness of the CL for the application purposes. | ||
| Keywords: Dictionary Definitions, Natural Language Generation, Semantic Lexicon, SIMPLE Computational Lexicon, XML Representation | ||
| LREC2000 Proceedings: Session WO6 - Acquisition of Lexical Information, pages 247-254 | ||
| Files: 306.ps, 306.pdf | ||
|
|
||
| MHATLex: Lexical Resources for Modelling the French Pronunciation | ||
| Pérennou Guy
(Institut de Recherche en Informatique, Université Paul Sabatier,
118,route de Narbonne, 31062 Toulouse Cedex France, perennou@irit.fr) De Calmès Martine (Institut de Recherche en Informatique, Université Paul Sabatier, 118,route de Narbonne, 31062 Toulouse Cedex France, decalmes@irit.fr) |
||
| The aim of this paper is to introduce the lexical resources and environment, called MHATLex, and intended for speech and text processing. A particular attention is paid to a pronunciation modelling which can be used in automatic speech processing as well as in phonological/phonetic description of languages. In our paper we will introduce a pronunciation model, the MHAT model (Markovian Harmonic Adaptation and Transduction), which copes with free and context-dependent variants. At the same time, we will present the MHATLex resources. They include 500,000 inflected forms and tools allowing the generation of various lexicons through phonological tables. Finally, some illustrations of the use of MHATLex in ASR will be shown. | ||
| Keywords: ASR, Lexical Resources, Pronunciation Model | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 257-264 | ||
| Files: 37.ps, 37.pdf | ||
|
|
||
| PLEDIT - A New Efficient Tool for Management of Multilingual Pronunciation Lexica and Batchlists | ||
| Vlaj Damjan
(Research and Studies Center, Universityof Maribor, Razlagova 22, 2000
Maribor, Slovenia, demjan.vlaj@uni-mb.si) Kaiser Janez (Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, Slovenia, janez.kaizer@uni-mb.si) Wilhelm Ralph (Siemens AG, ZT IK 5, Otto Hahn Ring 6, 81739 Munich, Germany, ralph.wilhelm@mch.siemens.de) Ziegenhain Ute (Siemens AG, ZT IK 5, Otto Hahn Ring 6, 81739 Munich, Germany, ute.ziegenhain@mchp.siemens.de) |
||
| The program tool PLEDIT - Pronunciation Lexica Editor - has been created for efficient handling with pronunciation lexica and batchlists. PLEDIT is designed as a GUI, which incorporates tools for fast and efficient management of pronunciation lexica and batchlists. The tool is written in cl/Tk/Tix and can thus be easily ported to different platforms. PLEDIT supports three lexicon format types, which are Siemens, SpeechDat and CMU lexicon formats. PLEDIT enables full editing capability for lexica and batchlists and supports work with multilingual resources. Some functions have been built in as external programs written in the C program language. With these external programs higher speed and efficiency of the PLEDIT have been achieved. | ||
| Keywords: Graphic User Interface, Language Resources, Pronunciation Lexica Editor, Pronunciation Lexicon | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 265-268 | ||
| Files: 53.ps, 53.pdf | ||
|
|
||
| Object-oriented Access to the Estonian Phonetic Database | ||
| Meister Einar
(Institute of Cybernetics, Tallinn Technical University, Akadeemia tee
21, Tallinn 12618, Estonia, Einar.Meister@ioc.ee) Eek Arvo (Institute of Cybernetics, Tallinn Technical University, Akadeemia tee 21, Tallinn 12618, Estonia, Einar.Meister@ioc.ee) Altosaar Toomas (Acoustics Lab., Helsinki University of Technology, Otakaari 5A, SF-02150 Espoo, Finland, Toomas.Altosaar@hut.fi) Vainio Martti (Department of Phonetics, University of Helsinki, P.O.Box 35, FIN-0014 Helsinki, Finland, Martti.Vainio@helsinki.fi) |
||
| The paper introduces the Estonian Phonetic Database developed at the Laboratory of Phonetics and Speech Technology of the Institute of Cybernetics at the Tallinn Technical University, and its integration into QuickSig – an object-oriented speech processing environment developed at the Acoustics Laboratory of the Helsinki University of Technology. Methods of database access are discussed, relations between different speech units – sentences, words, phonemes – are defined, examples of predicate functions are given to perform searches for different contexts, and the advantage of an object-oriented paradigm is demonstrated. The introduced approach has been proven to be a flexible research environment allowing studies to be performed in a more efficient way. | ||
| Keywords: Database Access, Object-Oriented Approach, Phonetic Database, Phonetic Knowledge, Speech Processing | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 269-272 | ||
| Files: 128.ps, 128.pdf | ||
|
|
||
| A French Phonetic Lexicon with Variants for Speech and Language Processing | ||
| de Mareüil
Philippe Boula (LIMSI-CNRS, BP 133, F-91403 Orsay cedex, mareuil@limsi.fr) d'Alessandro Christophe (LIMSI-CNRS, BP 133, F-91403 Orsay cedex, cda@limsi.fr) Yvon François (ENST – Dept. Informatique, et Réseaux, 46 rue Barrault, F-75634 Paris cedex 13, yvon@enst.fr) Aubergé Véronique (ICP – Univ. Stendhal, 1180 av. Centrale, BP25, F-38040 Grenoble cedex 9, auberge@icp.inpg.fr) Vaissière Jacqueline (ILPGA – Univ. Paris III, 19 rue des Bernardins, F-75 005 Paris, jvaiss@msh-paris.fr) Amelot Angélique (ILPGA – Univ. Paris III, 19 rue des Bernardins, F-75 005 Paris, jvaiss@msh-paris.fr) |
||
| This paper reports on a project aiming at the semi-automatic development of a large orthographic-phonetic lexicon for French, based on the Multext dictionary. It details the various stages of the project, with an emphasis on the methodological and design aspects. Information regarding the lexicon’s content is also given, together with a description of interface tools which should facilitate its exploitation. | ||
| Keywords: French Phonetic Lexicon, Pronunciation Variants | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 273-276 | ||
| Files: 133.ps, 133.pdf | ||
|
|
||
| A Computational Platform for Development of Morphologic and Phonetic Lexica | ||
| Rojc Matej
(Faculty of Electrical Engineering and Computer Science, University of
Maribor, Smetanova 17, 2000 Maribor, matej.rojc@uni-mb.si) Kačič Zdravko (Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, kacic@uni-mb.si) |
||
| Statistic approaches in speech technology, either based on statistical language models, trees, hidden Markov models or neural networks, represent the driving forces for the creation of language resources (LR), e.g. text corpora, pronunciation lexica and speech databases. This paper presents the system architecture for rapid construction of morphologic and phonetic lexica for Slovenian language. The integrated graphic user interface focuses in morphologic and phonetic aspects of the Slovenian language and allows the experts good performance in analysis time. | ||
| Keywords: Grapheme-to-Phoneme Conversion, Lexica, Morphology, Text Processing | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 277-282 | ||
| Files: 175.ps, 175.pdf | ||
|
|
||
| An Optimised FS Pronunciation Resource Generator for Highly Inflecting Languages | ||
| Gibbon Dafydd (Fakultät
für Linguistik und Literaturwissenschaft, Universitä+C122t
Bielefeld, Postfach 100 131, D–33501 Bielefeld, Germany, gibbon@spectrum.uni-bielefeld.de) Quirino Simões Ana Paula (CSLI, Stanford University, CA 94305-4115, USA, aquirino@stanford.edu) Matthiesen Martin (Lingsoft, Inc., Tehtaankatu 27-29 D, FIN-00150 Helsinki, Finland) |
||
| We report on a new approach to grapheme-phoneme transduction for large-scale German spoken language corpus resources using explicit morphotactic and graphotactic models. Finite state optimisation techniques are introduced to reduce lexicon development and production time, with a speed increase factor of 10. The motivation for this tool is the problem of creating large pronunciation lexica for highly inflecting languages using morphological out of vocabulary (MOOV) word modelling, a subset of the general OOV problem of non-attested word forms. A given spoken language system which uses fully inflected word forms performs much worse with highly inflecting languages (e.g. French, German, Russian) for a given stem lexicon size than with less highly inflecting languages (e.g. English) because of the `morphological handicap' (ratio of stems to inflected word forms), which for German is about 1:5. However, the problem is worse for current speech recogniser development techniques, because a specific corpus never contains all the inflected forms of a given stem. Non-attested MOOV forms must therefore be `projected' using a morphotactic grammar, plus table lookup for irregular forms. Enhancement with statistical methods is possible for regular forms, but does not help much with large, heterogeneous technical vocabularies, where extensive manual lexicon construction is still used. The problem is magnified by the need for defining pronunciation variants for inflected word forms; we also propose an efficient solution to this problem. | ||
| Keywords: Finite State Technologies, Grapheme-Phoneme Conversion, Morphology, Morphophonology, Pronunciation, xfst | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 283-290 | ||
| Files: 251.ps, 251.pdf | ||
|
|
||
| Design Methodology for Bilingual Pronunciation Dictionary | ||
| Kim Jong-mi
(Department of English Language and Literature, Kangwon National
University, Hyoja 2 Dong, Chuncheon City, S. Korea, kimjm@kangwon.ac.kr) |
||
| This paper presents the design methodology for the bilingual pronunciation dictionary of sound reference usage, which reflects the cross-linguistic, dialectal, first language (L1) interfered, biological and allophonic variations. The design methodology features 1) the comprehensive coverage of allophonic variation, 2) concise data entry composed of a balanced distribution of dialects, genders, and ages of speakers, 3) bilingual data coverage including L1-interfered speech, and 4) eurhythmic arrangements of the recording material for temporal regularity. The recording consists of the triple way comparison of 1) English sounds spoken by native English speakers, 2) Korean sounds spoken by native Korean speakers, and 3) English sounds spoken by Korean speakers. This paper also presents 1) the quality controls and 2) the structure and format of the data. The intended usage of this “sound-based” bilingual dictionary aims at 1) cross-linguistic and acoustic research, 2) application to speech recognition, synthesis and translation, and 3) foreign language learning including exercises. | ||
| Keywords: Bilingual Pronunciation Dictionary, Design Methodology, English and Korean, English Pronunciation, Korean Pronunciation, Pronunciation Dictionary, SORIDA | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 291-296 | ||
| Files: 269.ps, 269.pdf | ||
|
|
||
| Labeling of Prosodic Events in Slovenian Speech Database GOPOLIS | ||
| Mihelič France
(Faculty of Electrical Engineering, University of Ljubljana, Tržaška
25, 1001 Ljubljana, Slovenia, mihelicf@fe.uni-lj.si) Gros Jerneja (Faculty of Electrical Engineering, University of Ljubljana, Tržaška 25, 1001 Ljubljana, Slovenia, nejka@fe.uni-lj.si) Nöth Elmar (Lehrstuhl für Mustererkennung (Informatik 5), Universität Erlangen-Nürnberg, Martensstrasse 3, 91058 Erlangen, BRD, noeth@informatik.uni-erlangen.de) Warnke Volker (Lehrstuhl für Mustererkennung (Informatik 5), Universität Erlangen-Nürnberg, Martensstrasse 3, 91058 Erlangen, BRD, warnke@informatik.uni-erlangen.de) |
||
| The paper describes prosodic annotation procedures of the GOPOLIS Slovenian speech data database and methods for automatic classi-fication of different prosodic events. Several statistical parameters concerning duration and loudness of words, syllables and allophones were computed for the Slovenian language, for the first time on such a large amount of speech data. The evaluation of the annotated data showed a close match between automatically determined syntactic-prosodic boundary marker positions and those obtained by a rule-based approach. The obtained knowledge on Slovenian prosody can be used in Slovenian speech recognition and understanding for automatic prosodic event determination and in Slovenian speech synthesis for prosody prediction. | ||
| Keywords: Labeling, Prosodic Boundaries, Prosody, Speech Database, Word Accent | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 297-300 | ||
| Files: 292.ps, 292.pdf | ||
|
|
||
| Regional Pronunciation Variants for Automatic Segmentation | ||
| Beringer Nicole
(Institut fuer Phonetik und Sprachliche Kommunikation, Schellingstr. 3,
D-80799 Muenchen, Germany, beringer@phonetik.uni-muenchen.de) Neff Marcia (Institut fuer Phonetik und Sprachliche Kommunikation, Schellingstr. 3, D-80799 Muenchen, Germany, maneff@phonetik.uni-muenchen.de) |
||
| The goal of this paper is to create an extended rule corpus with approximately 2300 phonetic rules which model segmental variation of regional variants of German. The phonetic rules express at a broad-phonetic level phenomena of phonetic reduction in German that occurs within words and across word boundaries. In order to get an improvement in automatic segmentation of regional speech variants, these rules are clustered and implemented depending on regional specification in the Munich Automatic Segmentation System. | ||
| Keywords: ASR, German Dialectal Regions, MAUS, Regional Pronunciation Rules, Variants | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 301-306 | ||
| Files: 307.ps, 307.pdf | ||
|
|
||
| Le Programme Compalex (COMPAraison LEXicale) | ||
| Ndamba Josué
(B.P. 1486, Brazzaville – Congo, GRELI – CONGO, Brazzaville,
jondamba@yahoo.fr) Bayamboussa Jean Silence (B.P. 1486, Brazzaville – Congo, GRELI – CONGO, Brazzaville, jondamba@yahoo.fr) |
||
| Keywords: | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 307-310 | ||
| Files: 281.ps, 281.pdf | ||
|
|
||
| Perceptual Evaluation of Text-to-Speech Implementation of Enclitic Stress in Greek | ||
| Fotinea
Stavroula-Evita (Institute for Language and Speech Processing,
Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, evita@ilsp.gr) Protopapas Athanassios (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece) Dimitriadis Dimitris (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece) Carayannis George (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, gcara@ilsp.gr) |
||
| This paper presents a perceptual evaluation of a text to speech (TTS) synthesizer in Greek with respect to acoustic registration of enclitic stress and related naturalness and intelligibility. Based on acoustical measurements and observations of naturally recorded utterances, the corresponding output of a commercially available formant-based speech synthesizer was altered and the results were subjected to perceptual evaluation. Pitch curve, intensity, and duration of the syllable bearing enclitic stress, were acoustically manipulated, while a phonetically identical phrase contrasting only in stress served as control stimulus. Ten listeners judged the perceived naturalness and preference (in pairs) and the stress pattern of each variant of a base phrase. It was found that intensity modification adversely affected perceived naturalness while increasing perceived stress prominence. Duration modification had no appreciable effect. Pitch curve modification tended to produce an improvement in perceived naturalness and preference but the results failed to achieve statistical significance. The results indicated that the current prosodic module of the speech synthesizer reflects a good balance between prominence of stress assignment, intelligibility, and naturalness. | ||
| Keywords: Perceptual Evaluation, Prosody, Speech Synthesis, Stress, Text-To-Speech | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 311-314 | ||
| Files: 48.ps, 48.pdf | ||
|
|
||
| Etude et Evaluation de la Di-Syllabe comme Unité Acoustique pour le Système de Synthèse Arabe PARADIS | ||
| Chenfour N. (Faculté
des Sciences de Fès, chenfour@yahoo.fr) Benabbou A. (Faculté des Sciences et Techniques Fès, abenabbou@yahoo.fr) Mouradi A. (ENSIAS Rabat, mouradi@ensias.um5soussi.ac.ma) |
||
| L' étude que nous présentons dans cet article s' inscrit dans le cadre de la réalisation d' un système de synthèse de la parole à partir du texte pour la langue arabe. Notre système PARADIS est basé sur la concaténation des di-syllabes avec TD-PSOLA comme technique de synthèse. Nous présentons dans cet article l' intérêt du choix de la di-syllabe comme unité de concaténation pour le synthétiseur et son apport au niveau de la qualité de synthèse. En effet, la di-syllabe permet d' améliorer amplement la qualité de synthèse et de réduire les problèmes de discontinuité temporelle lors de la concaténation. Cependant, on est confronté à plusieurs problèmes causés par la taille considérable de l' ensemble des di-syllabes et leur adaptation aux modèles prosodiques qui sont d' habitude associés à la syllabe comme unité rythmique. Nous décrivons alors le principe sur lequel nous nous sommes basés pour réduire le nombre de di-syllabes. Nous présentons ensuite la démarche que nous avons mise au point pour la génération et l' étiquetage automatique du dictionnaire de di-syllabes. Ainsi, nous avons choisi des logatomes ayant des formes particulièrement appropriées à l' automatisation de la procédure de génération du corpus des logatomes et à l' opération de segmentation automatique. Par ailleurs, nous présentons une technique d' organisation du dictionnaire acoustique parfaitement adaptée à la forme de la di-syllabe arabe. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 315-320 | ||
| Files: 32.ps, 32.pdf | ||
|
|
||
| Design of Optimal Slovenian Speech Corpus for Use in the Concatenative Speech Synthesis System | ||
| Rojc Matej
(Faculty of Electrical Engineering and Computer Science, University of
Maribor, Smetanova 17, 2000 Maribor, matej.rojc@uni-mb.si) Kačič Zdravko (Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, kacic@uni-mb.si) |
||
| In the paper the development of Slovenian speech corpus for use in concatenative speech synthesis system being developed at University of Maribor, Slovenia, will be presented. The emphasis in the paper is the issue of maximising the usefulness of the defined speech corpus for concatenation purposes. Usefulness of the speech corpus very much depends on the corresponding text and can be increased if the appropriate text is chosen. In the approach we used, detailed statistics of the text corpora has been done, to be able to define the sentences, rich with non-uniform units like monophones, diphones and triphones. | ||
| Keywords: Grapheme-to-Phoneme Conversion, Non-Uniform Units, Text Processing | ||
| LREC2000 Proceedings: Session SP1 - Phonetic Issues and Speech Synthesis, pages 321-326 | ||
| Files: 177.ps, 177.pdf | ||
|
|
||
| The Bank of Swedish | ||
| Gellerstam Martin
(Språkbanken, Dept. for Swedish, Göteborg University, Box
200, SE–405 30, Sweden, svemg@svenska.gu.se) Cederholm Yvonne (Språkbanken, Dept. for Swedish, Göteborg University, Box 200, SE–405 30, Sweden, sveyc@svenska.gu.se) Rasmark Torgny (Språkbanken, Dept. for Swedish, Göteborg University, Box 200, SE–405 30, Sweden, svetr@svenska.gu.se) |
||
| The Bank of Swedish is described: affiliation, organisation, linguistic resources and tools. A point is made of the close connection between lexical research and corpus data, the broad textual coverage from Modern Swedish to Old Swed-ish, the official status of the organisation and its connection to Göteborg University. The relation to the broader scope of the comprehensive Language Database of Swedish is discussed. A few current issues of the Bank of Swedish are presented: parallell corpora production, the construction of a Swedish morphology database and sense tagging of text corpora. Finally, the updating of the Bank of Swedish concordance system is mentioned. | ||
| Keywords: Alignment, Corpora Workbench, Language Bank, Lexical Database, Parallell Corpora, Part-of-Speech Tagging | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 329-334 | ||
| Files: 300.ps, 300.pdf | ||
|
|
||
| The Multi-layer Language Knowledge Base of Chinese NLP | ||
| Junfeng Hu (The
Institute of Computational Linguistics, Dept. of Computer Science and
Technology, Peking University, Beijing, 100871, P.R. China, hujf@pku.edu.cn) Shiwen Yu (The Institute of Computational Linguistics, Dept. of Computer Science and Technology, Peking University, Beijing, 100871, P.R. China, yusw@pku.edu.cn) |
||
| This paper introduced the effort to build a multi-layer knowledge base of Chinese NLP which combined with list-based, rule-based and corpus-based language information. Different kinds of information are designed to solve different kind of problems that encountered in the Chinese NLP. The whole knowledge base is designed with theoretical consistency and can easily be put into practice in the application systems. | ||
| Keywords: Grammatical Dictionary, Language Knowledge, Large-Scale Corpus, Word Formation Rules, Word Segmentation | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 335-340 | ||
| Files: 29.ps, 29.pdf | ||
|
|
||
| Producing LRs in Parallel with Lexicographic Description: the DCC project | ||
| Soler i Bou Joan
(Institut d’Estudis Catalans, Carme, 47, 08001 Barcelona, SPAIN,
jsoler@iec.es) |
||
| This paper is a brief presentation of some aspects of the most important lexicographical project that is being carried out in Catalonia: the DCC (Dictionary of Contemporary Catalan) project. After making a general description of the aims of the project, the specific goal of my contribution is to present the general strategy of our lexicographical description, consisting in the production of an electronic dictionary able to be the common repository from which we will obtain different derived products (the human dictionary, among them). My concern is to show to which extent human and computer lexicography can share descriptions, and the results of lexicographic work can be taken as a language resource in this new perspective. I will present different aspects and criteria of our dictionary, taking the different layers (morphology, syntax, semantics) as a guideline. | ||
| Keywords: Lexical Description, Lexicography, Lexicon, Syntactic Patterns | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 341-346 | ||
| Files: 112.ps, 112.pdf | ||
|
|
||
| Some Language Resources and Tools for Computational Processing of Portuguese at INESC | ||
| Wittmann Luzia (Instituto
de Engenharia de Sistemas e Computadores, Rua Alves Redol, 9 - Apartado
13069 - 1000-029 LISBOA - Portugal, luzia.wittmann@inesc.pt) Ribeiro Ricardo Daniel (Instituto de Engenharia de Sistemas e Computadores, Rua Alves Redol, 9 - Apartado 13069 - 1000-029 LISBOA - Portugal, ricardo.ribeiro@inesc.pt) Pêgo Tânia (Instituto de Engenharia de Sistemas e Computadores, Rua Alves Redol, 9 - Apartado 13069 - 1000-029 LISBOA - Portugal, tania.pego@inesc.pt) Batista Fernando (Instituto de Engenharia de Sistemas e Computadores, Rua Alves Redol, 9 - Apartado 13069 - 1000-029 LISBOA - Portugal, fernando.batista@inesc.pt) |
||
| In the last few years automatic processing tools and studies based on corpora have became of a great importance for the community. The possibility of evaluating and developing such tools and studies depends on the availability of language resources. For the Portuguese language in its several national varieties these resources are not enough to meet the community needs. In this paper some valuable resources are presented, such as a multifunctional lexicon, general-purpose lexicons for European and Brazilian Portuguese and corpus processing tools. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 347-350 | ||
| Files: 257.ps, 257.pdf | ||
|
|
||
| Screffva: A Lexicographer's Workbench | ||
| Mills Jon
(University of Luton, Vicarage Street, Luton, Bedfordshire, LU1 3JU, UK,
jon.mills@luton.ac.uk) |
||
| This paper describes the implementation of Screffva, a computer system written in Prolog that employs a parallel corpus for the automatic generation of bilingual dictionary entries. Screffva provides a lemmatised interface between a parallel corpus and its bilingual dictionary. The system has been trialled with a parallel corpus of Cornish-English bitext. Screffva is able to retrieve any given segment of text, and uniquely identifies lexemes and the equivalences that exist between the lexical items in a bitext. Furthermore the system is able to cope with discontinuous multiword lexemes. The system is thus able to find glosses for individual lexical items or to produce longer lexical entries which include part-of-speech, glosses and example sentences from the corpus. The corpus is converted to a Prolog text database and lemmatised. Equivalents are then aligned. Finally Prolog predicates are defined for the retrieval of glosses, part-of-speech and example sentences to illustrate usage. Lexemes, including discontinuous multiword lexemes, are uniquely identified by the system and indexed to their respective segments of the corpus. Insofar as the system is able to identify specific translation equivalents in the bitext, the system provides a much more powerful research tool than existing concordancers such as ParaConc, WordSmith, XCorpus and Multiconcord. The system is able to automatically generate a bilingual dictionary which can be exported and used as the basis for a paper dictionary. Alternatively the system can be used directly as an electronic bilingual dictionary. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 351-354 | ||
| Files: 159.ps, 159.pdf | ||
|
|
||
| The Concede Model for Lexical Databases | ||
| Erjavec Tomaž
(Dept. for Intelligent Systems, Jožef Stefan Institute, Ljubljana,
Slovenia, tomaz.erjavecg@ijs.si) Evans Roger (Information Technology Research Institute, University of Brighton, Lewes Rd, Brighton, UK, rags@itri.brighton.ac.uk, http:/www.itri.brighton.ac.uk/projects/rags) Ide Nancy (Department of Computer Science, Vassar College, Poughkeepsie, NY 12604-0520 USA, ide@cs.vassar.edu) Kilgarriff Adam (ITRI, University of Brighton, Brighton, England, adam@itri.bton.ac.uk) |
||
| The value of language resources is greatly enhanced if they share a common markup with an explicit minimal semantics. Achieving this goal for lexical databases is difficult, as large-scale resources can realistically only be obtained by up-translation from pre-existing dictionaries, each with its own proprietary structure. This paper describes the approach we have taken in the Concede project, which aims to develop compatible lexical databases for six Central and Eastern European languages. Starting with sample entries from original presentation-oriented electronic representations of dictionaries, we transformed the data into an intermediate TEI-compatible represen-tation to provide a common baseline for evaluating and comparing the dictionaries. We then developed a more restrictive encoding, formalised as an XML DTD with a clearly-defined semantic interpretation. We present this DTD and discuss a sample conversion from TEI, together with an application which hyperlinks a HTML representation of the dictionary to on-line concordancing over a corpus. | ||
| Keywords: Dictionary, Lexical Database, TEI, Up-Translation, XML | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 355-362 | ||
| Files: 335.ps, 335.pdf | ||
|
|
||
| Automatically Expansion of Thesaurus Entries with a Different Thesaurus | ||
| Kashioka Hideki
(ATR Spoken Language Translation Research Laboratories, 2-2-2 Hikaridai
Seika-cho Soraku-gun Kyoto 619-0288 Japan, kashioka@slt.atr.co.jp) Shirai Satosi (ATR Spoken Language Translation Research Laboratories, 2-2-2 Hikaridai Seika-cho Soraku-gun Kyoto 619-0288 Japan, shirai@slt.atr.co.jp) |
||
| We propose a method for expanding the entries in a thesaurus using a di erent thesaurus constructed with another concept. This method constructs a mapping table between the concept codes of these two di erent thesauri. Then, almost all of the entries of the latter thesaurus are assigned the concept codes of the former thesaurus with the mapping table between them. To con rm whether this method is e ective or not,we construct a mapping table between the ''Kadokawa- shin-ruigo'' thesaurus (hereafter, ''ShinRuigo'') and ''Nihongo-goitaikei'' (hereafter, ''Goitaikei''), and assigne about 350 thousand entries with the mapping table. About 10% of the entries cannot be assigned automatically. It is shown that this method can save cost in expanding a thesaurus. | ||
| Keywords: Automatically Category Code Assignment, Expanding Entry, Thesaurus | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 363-366 | ||
| Files: 142.ps, 142.pdf | ||
|
|
||
| Electronic Language Resources for Polish: POLEX, CEGLEX and GRAMLEX | ||
| Vetulani Zygmunt
(Adam Mickiewicz University, Department of Computer Linguistics and
Artificial Intelligence, ul. Matejki 48/49, PL-60769 Pozna, Poland,
http://main.amu.edu.pl/~vetulani, vetulani@amu.edu.pl) |
||
| We present theoretical results and resources obtained within three projects: national project POLEX, Copernicus 1 Project CEGLEX (1032) and Copernicus Project GRAMLEX (632). Morphological resources obtained within these projects contribute to fill-in the gap on the map of available electronic language resources for Polish. After a short presentation of some common methodological bases defined within the POLEX project, we proceed to present methodology and data obtained in CEGLEX and GRAMLEX projects. The intention of the Polish language part of CEGLEX was to test formats proposed by the GENELEX project against Polish data. The aim of the GRAMLEX project was to create a corpus-based morphological resources for Polish. GRAMLEX refers directly to the morphological part of the CEGLEX project. Large samples of data presented here are accessible at http://main.amu.edu.pl/~zlisi/projects.htm. | ||
| Keywords: Dictionary Formats, Electronic Dictionaries, NLP Tools, Polish Morphology, Resources | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 367-374 | ||
| Files: 62.ps, 62.pdf | ||
|
|
||
| Turkish Electronic Living Lexicon (TELL): A Lexical Database | ||
| Inkelas Sharon
(University of California at Berkeley, Department of Linguistics,
Berkeley, CA 94720, USA, inkelas@socrates.berkeley.edu) Küntay Aylin (Koç University, Department of Psychology, Istinye, Istanbul 80860, Turkey, akuntay@ku.edu.tr) Orgun C. Orhan (University of California at Davis, Department of Linguistics, Davis, CA 95616-8177, ocorgun@ucdavis.edu) Sprouse Ronald (University of California at Berkeley, Department of Linguistics, Berkeley, CA 94720, USA, Ronald@uclink.berkeley.edu) |
||
| The purpose of the TELL project is to create a database of Turkish lexical items which reflects actual speaker knowledge, rather than the normative and phonologically incomplete dictionary representations on which most of the existing phonological literature on Turkish is based. The database, accessible over the internet, should greatly enhance phonological, morphological, and lexical research on the language. The current version of TELL consists of the following components: • Some 15,000 headwords from the 2d and 3d editions of the Oxford Turkish-English dictionary, orthographically represented. • Proper names, including 175 place names from a guide of Istanbul, and 5,000 place names from a telephone area code directory of Turkey. • Phonemic transcriptions of the pronunciations of the same headwords and place names embedded in various morphological contexts. (Eliciting suffixed forms along with stems exposes any morphophonemic alternations that the headwords in question are subject to.) • Etymological information, garnered from a variety of etymological sources. • Roots for a number of morphologically complex headwords. The paper describes the construction of the current structure of the TELL database, points out potential questions that could be addressed by putting the database into use, and specifies goals for the next phase of the project. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 375-382 | ||
| Files: 86.ps, 86.pdf | ||
|
|
||
| Tools for the Generation of Morphological Entries in Dictionaries | ||
| Viks Ülle
(Institute of the Esthonian Language, Rooskrantsi 6, EE 10119 Tallinn,
Esthonia, ylle@eki.ee) |
||
| he lexicographer's tool introduced in the report represents a semiautomatic system to generate the section of morphological information for Estonian words in dictionary entries. Estonian is a language with a complicated morphology featuring (1) rich inflection and (2) marked and diverse morpheme variation, applying both to stems and formatives. The kernel of the system is a rule-based automatic morphology with separate program modules for every linguistic subsystem such as syllabification, recognition of part of speech and type of inflection, stem variation, morpheme and allomorph combinatorics. The modules function as rule interpreters applying formal grammars in an editable text format. The system enables generation of the following: (1) part of speech, (2) type of inflection, (3) inflected forms, (4) morphonological marking: degree of quantity, morpheme boundaries (stem+formative, component boundaries in compounds), (5) morphological references for inflected forms considerably different from the headword. The system permits of set-up, so that the inflected forms to be generated, the style of morphonological marking and the criteria for reference selection are all up to the user to choose. Full automation of the system application is restricted mainly by morphological homonymy. | ||
| Keywords: Formal Descriptions, Grammatical Data, Linguistic Software, Rule-Based Morphology, Traditional Dictionaries | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 383-388 | ||
| Files: 338.ps, 338.pdf | ||
|
|
||
| Design and Construction of Knowledge base for Verb using MRD and Tagged Corpus | ||
| Chae Young-Soog
(Korea Terminology Research Center for Language and Knowledge
Engineering, Department of Computer Science, Korea Advanced Institute of
Science and Technology, 373-1 Kusong-dong Yusong-gu Taejon 305-701
Korea, yschae@korterm.kaist.ac.kr) Choi Key-Sun (Korea Terminology Research Center for Language and Knowledge Engineering, Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1 Kusong-dong Yusong-gu Taejon 305-701 Korea, kschoi@korterm.kaist.ac.kr) |
||
| This paper represents the procedure of building syntactic knowledge base. This study is to construct basic sentence pattern automatically by using the POS-tagged corpus in balanced KAIST corpus, and electronic dictionary for Korean, and to construct syntactic knowledge base with specific information added to the lexicographer's analysis. The summary of work process will be as follows: 1) Extraction of characteristic verb targeting the high frequency verb from KAIST corpus 2) Constructing sentence pattern from each verb case frame structure extracted from MRD 3) Making out the noun categories of sentence pattern through KCP examples 4) Semantic classification of selected verb suitable for classified sentence pattern 5) Description of hyper concept to individual noun categories 6) Putting the translated words in Japanese to each noun and verb | ||
| Keywords: Case Frame, Knowledge Base, MRD, Tagged Corpus, Verb | ||
| LREC2000 Proceedings: Session WP1 - Lexicon, pages 389-394 | ||
| Files: 237.ps, 237.pdf | ||
|
|
||
| Recruitment Techniques for Minority Language Speech Databases: Some Observations | ||
| Jones Rhys James
(Speech and Image Processing Research Group, University of Wales,
Singleton Park, Swansea SA2 8PP, Wales, UK, R.J.Jones@swansea.ac.uk) Mason John S. (Speech and Image Processing Research Group, University of Wales, Singleton Park, Swansea SA2 8PP, Wales, UK, J.S.D.Mason@swansea.ac.uk) Helliker Louise (BT Advanced Communications Engineering, Adastral Park, Martlesham Heath, Ipswich, Suffolk IP5 3RE, UK, louise.helliker@bt.com) Pawlewski Mark (BT Advanced Communications Engineering, Adastral Park, Martlesham Heath, Ipswich, Suffolk IP5 3RE, UK, mark.pawlewski@bt.com) |
||
| This paper describes the collection efforts for SpeechDat Cymru, a 2000-speaker database for Welsh, a minority language spoken by about 500,000 of the Welsh population. The database is part of the SpeechDat(II) project. General database details are discussed insofar as they affect recruitment strategies, and likely differences between minority language spoken language resource (SLR) and general SLR collection are noted. Individual recruitment techniques are then detailed, with an indication of their relative successes and relevance to minority language SLR collection generally. It is observed that no one technique was sufficient to collect the entire database, and that those techniques involving face-to-face recruitment by an individual closely involved with the database collection produced the best yields for effort expended. More traditional postal recruitment techniques were less successful. The experiences during collection underlined the importance of utilising enthusiastic recruiters, and taking advantage of the speaker networks present in the community. | ||
| Keywords: Minority Languages, Recruitment, Speech Databases, SpeechDat, Welsh | ||
| LREC2000 Proceedings: Session SP2 - Spoken Language Resources Issues from Construction to Validation, pages 397-402 | ||
| Files: 167.ps, 167.pdf | ||
|
|
||
| Enhancing Speech Corpus Resources with Multiple Lexical Tag Layers | ||
| Witt Andreas (Fakultät
für Linguistik und Literaturwissenschaft, Universität
Bielefeld, witt@lili.uni-bielefeld.de, Postfach 10 01 31, 33501
Bielefeld, Germany) Lüngen Harald (Fakultät für Linguistik und Literaturwissenschaft, Universität Bielefeld, luengen@spectrum.uni-bielefeld.de, Postfach 10 01 31, 33501 Bielefeld, Germany) Gibbon Dafydd (Fakultät für Linguistik und Literaturwissenschaft, Universitä+C122t Bielefeld, Postfach 100 131, D–33501 Bielefeld, Germany, gibbon@spectrum.uni-bielefeld.de) |
||
| We describe a general two-stage procedure for re-using a custom corpus for spoken language system development involving a transfor-mation from character-based markup to XML, and DSSSL stylesheet-driven XML markup enhancement with multiple lexical tag trees. The procedure was used to generate a fully tagged corpus; alternatively with greater economy of computing resources, it can be employed as a parametrised ‘tagging on demand’ filter. The implementation will shortly be released as a public resource together with the corpus (German spoken dialogue, about 500k word form tokens) and lexicon (about 75k word form types). | ||
| Keywords: DSSSL, Morphology, Speech Corpora, Speech Lexica, Text Technology, XML | ||
| LREC2000 Proceedings: Session SP2 - Spoken Language Resources Issues from Construction to Validation, pages 403-408 | ||
| Files: 183.ps, 183.pdf | ||
|
|
||
| What are Transcription Errors and Why are They made? | ||
| Oppermann Daniela
(Institute of Phonetics and Speech Communication, Schellingstr. 3,80799
Munich, Germany, daniela.oppermann@phonetik.uni-muenchen.de) Burger Susanne (Interactive Systems Laboratories, Carnegie Mellon Univeristy Pittsburgh, USA, University of Karlsruhe, Germany, sburger@cs.cmu.edu) Weilhammer Karl (Institute of Phonetics and Speech Communication, Schellingstr. 3,80799 Munich, Germany, karl.weilhammer@phonetik.uni-muenchen.de) |
||
| In recent work we compared transcriptions of German spontaneous dialogues of the VERBMOBIL corpus to ascertain differences between transcribers and quality. A better understanding of where and what kind of inconsistencies occur will help us to improve the working environment for transcribers, to reduce the effort on correction passes, and will finally result in better transcription quality. The results show that transcribers have different levels of perception of spontaneous speech phenomena, mainly prosodic phenomena such as pauses in speech and lengthening. During the correction pass 80% of these labels had to be inserted. Additionally, the annotation of non-grammatical phrases and pronunciation comments seems to need a better explanation in the convention manual. Here the correcting transcribers had to change 20% of the annotations. | ||
| Keywords: Annotation Errors, Data-Collection, Spontaneous Speech, Transcription | ||
| LREC2000 Proceedings: Session SP2 - Spoken Language Resources Issues from Construction to Validation, pages 409-414 | ||
| Files: 205.ps, 205.pdf | ||
|
|
||
| Quality Control in Large Annotation Projects Involving Multiple Judges: The Case of the TDT Corpora | ||
| Strassel Stephanie
(Linguistic Data Consortium, 3615 Market Street, Philadelphia, PA 19104,
USA, strassel@ldc.upenn.edu) Graff David (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, graff@ldc.upenn.edu) Martey Nii (Linguistic Data Consortium, 3615 Market Street, Philadelphia, PA 19104, USA, nmartey@ldc.upenn.edu) Cieri Christopher (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, ccieri@ldc.upenn.edu) |
||
| The Linguistic Data Consortium at the University of Pennsylvania has recently been engaged in the creation of large-scale annotated corpora of broadcast news materials in support of the ongoing Topic Detection and Tracking (TDT) research project. The TDT corpora were designed to support three basic research tasks: segmentation, topic detection, and topic tracking in newswire, television and radio sources from English and Mandarin Chinese. The most recent TDT corpus, TDT3, added two tasks, story link and first story detection. Annotation of the TDT corpora involved a large staff of annotators who produced millions of human judgements. As with any large corpus creation effort, quality assurance and inter-annotator consistency were a major concern. This paper reports the quality control measures adopted by the LDC during the creation of the TDT corpora, presents techniques that were utilized to evaluate and improve the consistency of human annotators for all annotation tasks, and discusses aspects of project administration that were designed to enhance annotation consistency. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SP2 - Spoken Language Resources Issues from Construction to Validation, pages 415-422 | ||
| Files: 212.ps, 212.pdf | ||
|
|
||
| A New Methodology for Speech Corpora Definition from Internet Documents | ||
| Vaufreydaz D. (Laboratoire
CLIPS-IMAG, équipe GEOD, Université Joseph Fourier, Campus
scientifique, B.P. 53, 38041 Grenoble cedex 9, France,
Dominique.Vaufreydaz@imag.fr) Bergamini C. (Laboratoire CLIPS-IMAG, équipe GEOD, Université Joseph Fourier, Campus scientifique, B.P. 53, 38041 Grenoble cedex 9, France, Carole.Bergamini@imag.fr) Serignat J.F. (Laboratoire CLIPS-IMAG, équipe GEOD, Université Joseph Fourier, Campus scientifique, B.P. 53, 38041 Grenoble cedex 9, France, Jean-Francois.Serignat@imag.fr) Besacier L. (Laboratoire CLIPS-IMAG, équipe GEOD, Université Joseph Fourier, Campus scientifique, B.P. 53, 38041 Grenoble cedex 9, France, Laurent.Besacier@imag.fr) Akbar M. (Laboratoire CLIPS-IMAG, équipe GEOD, Université Joseph Fourier, Campus scientifique, B.P. 53, 38041 Grenoble cedex 9, France, Mohammad.Akbar@imag.fr) |
||
| In this paper, a new methodology for speech corpora definition from internet documents is described, in order to record a large speech database, dedicated to the training and testing of acoustic models for speech recognition. In the first section, the Web robot which is in charge of collecting Web pages from Internet is presented, then the web text to French sentences filtering mechanism is explained. Some information about the corpus organization (90% for training and 10% for test) is given. In the third section, the phoneme distribution of the corpus is presented and comparison is made with others French language studies. Finally tools and planning for recording the speech database with more than one hundred speakers are described. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SP2 - Spoken Language Resources Issues from Construction to Validation, pages 423-426 | ||
| Files: 235.ps, 235.pdf | ||
|
|
||
| Many Uses, Many Annotations for Large Speech Corpora: Switchboard and TDT as Case Studies | ||
| Graff David
(Linguistic Data Consortium, University of Pennsylvania, Philadelphia,
Pennsylvania, USA, graff@ldc.upenn.edu) Bird Steven (LDC, 3615 Market Street, Suite 200, Philadelphia, PA, 19104-2608, USA, sb@unagi.cis.upenn.edu) |
||
| This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out as separate projects that were dispersed both geographically and chronologically. The TDT2 corpus has also received a variety of annotations, but all directly created or managed by a core group. In both cases, issues arise involving the propagation of repairs, consistency of references, and the ability to integrate annotations having different formats and levels of detail. We describe a general framework whereby these issues can be addressed successfully. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SP2 - Spoken Language Resources Issues from Construction to Validation, pages 427-434 | ||
| Files: 282.ps, 282.pdf | ||
|
|
||
| SLR Validation: Present State of Affairs and Prospects | ||
| van den Heuvel Henk
(SPEX, Nijmegen, Netherlands, e-mail: H.v.d.Heuvel@let.kun.nl) Boves Lou (SPEX, Nijmegen, Netherlands) Choukri Khalid (European Language Resources Association (ELRA) &, European Language resources - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin, 75013 Paris France, choukri@elda.fr) Goddijn Simo (Speech Processing Expertise Centre (SPEX), Department of Language and Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands, s.goddijn@let.kun.nl) Sanders Eric (SPEX, Nijmegen, Netherlands) |
||
| This paper deals with the quality evaluation (validation) and improvement of Spoken Language Resources (SLR). We discuss a number of aspects of SLR validation. We review the work done so far in this field. The most important validation check points and our view on their rank order are listed. We propose a strategy for validation and improvement of SLR that is presently considered at the European Language Resources Association, ELRA. And finally, we show some of our future plans in these directions. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SP2 - Spoken Language Resources Issues from Construction to Validation, pages 435-440 | ||
| Files: 39.ps, 39.pdf | ||
|
|
||
| On the Usage of Kappa to Evaluate Agreement on Coding Tasks | ||
| Di Eugenio Barbara
(Electrical Engineering and Computer Science, University of Illinois at
Chicago, Chicago, IL 60607, USA, dieugeni@eecs.uic.edu) |
||
| In recent years, the Kappa coefficient of agreement has become the de facto standard to evaluate intercoder agreement in the discourse and dialogue processing community. Together with the adoption of this standard, researchers have adopted one specific scale to evaluate Kappa values, the one proposed in (Krippendorff, 1980). In this paper, I highlight some issues that should be taken into account when evaluating Kappa values. Finally, I speculate on whether Kappa could be used as a measure to evaluate a system’s performance. | ||
| Keywords: Annotated Corpora, Intercoder Reliability | ||
| LREC2000 Proceedings: Session SP2 - Spoken Language Resources Issues from Construction to Validation, pages 441-444 | ||
| Files: 206.ps, 206.pdf | ||
|
|
||
| A Word-level Morphosyntactic Analyzer for Basque | ||
| Aduriz I. (Universidad
de Barcelona, Gran Vía de las Cortes Catalanas, 585, E-08007
Barcelona) Agirre E. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Aldezabal I. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Arregi X. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Arriola J. M. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Artola X. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Gojenola K. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Maritxalar A. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Sarasola K. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Urkia M. (UZEI, Aldapeta 20 , E-20009 Donostia, Basque Country, jipgogak@si.ehu.es) |
||
| This work presents the development and implementation of a full morphological analyzer for Basque, an agglutinative language. Several problems (phrase structure inside word-forms, noun ellipsis, multiplicity of values for the same feature and the use of complex linguistic representations) have forced us to go beyond the morphological segmentation of words, and to include an extra module that performs a full morphosyntactic parsing of each word-form. A unification-based word-level grammar has been defined for that purpose. The system has been integrated into a general environment for the automatic processing of corpora, using TEI-conformant SGML feature structures. | ||
| Keywords: Agglutinative Languages, Morphology, Morphosyntax | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 447-452 | ||
| Files: 44.ps, 44.pdf | ||
|
|
||
| Interactive Corpus Annotation | ||
| Brants Thorsten
(Computational Linguistics, Saarland University, 66041 Saarbrücken,
Germany, brants@coli.uni-sb.de) Plaehn Oliver (Computational Linguistics, Saarland University, 66041 Saarbrücken, Germany, plaehn@coli.uni-sb.de) |
||
| We present an easy-to-use graphical tool for syntactic corpus annotation. This tool, Annotate, interacts with a part-of-speech tagger and a parser running in the background. The parser incrementally suggests single phrases bottom-up based on cascaded Markov models. A human annotator confirms or rejects the parser’s suggestions. This semi-automatic process facilitates a very rapid and efficient annotation. | ||
| Keywords: Annotation Tools, Combination of Automatic and Manual Annotation, Corpus Annotation | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 453-460 | ||
| Files: 334.ps, 334.pdf | ||
|
|
||
| Semi-automatic Construction of a Tree-annotated Corpus Using an Iterative Learning Statistical Language Model | ||
| Shirai Kiyoaki
(Department of Computer Science, Graduate School of Information Science
and Engineering, Tokyo Institute of Technology, kshirai@cl.cs.titech.ac.jp) Tanaka Hozumi (Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, tanaka@cl.cs.titech.ac.jp) Tokunaga Takenobu (Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, take@cl.cs.titech.ac.jp) |
||
| In this paper, we propose a method to construct a tree-annotated corpus, when a certain statistical parsing system exists and no tree-annotated corpus is available as training data. The basic idea of our method is to sequentially annotate plain text inputs with syntactic trees using a parser with a statistical language model, and iteratively retrain the statistical language model over the obtained annotated trees. The major characteristics of our method are as follows: (1)in the first step of the iterative learning process, we manually construct a tree-annotated corpus to initialize the statistical language model over, and (2) at each step of the parse tree annotation process, we use both syntactic statistics obtained from the iterative learning process and lexical statistics pre-derived from existing language resources, to choose the most probable parse tree. | ||
| Keywords: Human Intervention, Iterative Learning, Statistical Language Model, Tree-Annotated Coprpus | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 461-466 | ||
| Files: 341.ps, 341.pdf | ||
|
|
||
| A Robust Parser for Unrestricted Greek Text | ||
| Boutsis Sotiris
(Institute for Language and Speech Processing, Artemidos 6 &
Epidavrou, 151 25 Maroussi, Greece, sboutsis@ilsp.gr) Prokopidis Prokopis (Institute for Language and Speech Processing, Artemidos 6 & Epidavrou, 151 25 Maroussi, Greece, prokopis@ilsp.gr) Giouli Voula (Institute for Language and Speech Processing, Artemidos 6 & Epidavrou, 151 25, Athens, Greece, tel: +301 6875300, fax: +301 6854270, voula@ilsp.gr) Piperidis Stelios (Institute for Language and Speech Processing, Artemidos 6 & Epidavrou, 151 25, Athens, Greece, tel: +301 6875300, fax: +301 6854270, spip@ilsp.gr) |
||
| In this paper we describe a method for the efficient parsing of real-life Greek texts at the surface syntactic level. A grammar consisting of non-recursive regular expressions describing Greek phrase structure has been compiled into a cascade of finite state transducers used to recognize syntactic constituents. The implemented parser lends itself to applications where large scale text processing is involved, and fast, robust, and relatively accurate syntactic analysis is necessary. The parser has been evaluated against a ca 34000 word corpus of financial and news texts and achieved promising precision and recall scores. | ||
| Keywords: Finite State Transducers, Greek, Partial Parsing | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 467-474 | ||
| Files: 174.ps, 174.pdf | ||
|
|
||
| Automatic Assignment of Grammatical Relations | ||
| Lesmo Leonardo (Dipartimento
di Informatica, Università di Torino, c.so Svizzera 185, 10149,
Torino (Italy), lesmo@di.unito.it) Lombardo Vincenzo (DISTA – Università del Piemonte Orientale “A. Avogadro”, c.so Borsalino 54, 15100 Alessandria, Italy, Centro di Scienza Cognitiva – Università di Torino, via Lagrange 3, 10123 Torino, Italy, vincenzo@di.unito.it) |
||
| This paper presents a method for the assignment of grammatical relation labels in a sentence structure. The method has been implemented in the software tool AGRA (Automatic Grammatical Relation Assigner), which is part of a project for the development of a treebank of Italian sentences, and a knowledge base of Italian subcategorization frames. The annotation schema implements a notion of underspecification, that arranges grammatical relations from generic to specific one onto a hierarchy; the software tool works with hand-coded rules, which apply heuristic knowledge (on syntactic and semantic cues) to distinguish between complements and modifiers. | ||
| Keywords: Adjuncts, Complements, Dependency Tree, Grammatical Relations, Subcategorization | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 475-482 | ||
| Files: 218.ps, 218.pdf | ||
|
|
||
| Resources for Lexicalized Tree Adjoining Grammars and XML Encoding: TagML | ||
| Bonhomme Patrice
(LORIA, BP 239, F-54506 Vandoeuvre-lès-Nancy, bonhomme@loria.fr) Lopez Patrice (DFKI GmbH, Stuhlsatzenhausweg 3, D-66123 Saarbrücken, lopez@dfki.de) |
||
| This work addresses both practical and theorical purposes for the encoding and the exploitation of linguistic resources for feature based Lexicalized Tree Adjoining grammars (LTAG). The main goals of these specifications are the following ones: 1. Define a recommendation by the way of an XML (Bray et al., 1998) DTD or schema (Fallside, 2000) for encoding LTAG resources in order to exchange grammars, share tools and compare parsers. 2. Exploit XML, its features and the related recommendations for the representation of complex and redundant linguistic structures based on a general methodology. 3. Study the resource organisation and the level of generalisation which are relevant for a lexicalized tree grammar. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 483-490 | ||
| Files: 182.ps, 182.pdf | ||
|
|
||
| CLinkA A Coreferential Links Annotator | ||
| Orăsan
Constantin (School of Humanities, Languages and Social Sciences,
Stafford Street, University of Wolverhampton, Wolverhampton, WV1 1SB,
United Kingdom, in6093@wlv.ac.uk) |
||
| The annotation of coreferential chains in a text is a difficult task, which requires a lot of concentration. Given its complexity, without an appropriate tool it is very difficult to produce high quality coreferentially annotated corpora. In this paper we discus the requirements for developing a tool for helping the human annotator in this task. The annotation scheme used by our program is derived from the one proposed by MUC-7 Coreference Task Annotation, but is not restricted only to that one. Using a very simple language the user is able to define his/her own annotation scheme. The tool has a user-friendly interface and is language and platform independent. | ||
| Keywords: Anaphora Resolution, Coreference Annotation, Coreferential Chain, Evaluation, MUC-7, Multilingual Tool, User Interface | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 491-496 | ||
| Files: 179.ps, 179.pdf | ||
|
|
||
| Coreference in Annotating a Large Corpus | ||
| Hajičová
Eva (Faculty of Mathematics and Physics, Charles University,
Malostranské námêstí 25, 1180 Praha 1,
Czechia, hajicova@ufal.mff.cuni.cz) Panenová Jarmila (Faculty of Mathematics and Physics, Charles University, Malostranské námêstí 25, 1180 Praha 1, Czechia, panevova@ufal.mff.cuni.cz) Sgall Petr (Faculty of Mathematics and Physics, Charles University, Malostranské námêstí 25, 1180 Praha 1, Czechia, sgall@ufal.mff.cuni.cz) |
||
| The Prague Dependency Treebank (PDT) is a part of the Czech National Corpus, annotated with disambiguated structural descriptions representing the meaning of every sentence in its environment. To achieve that aim, it is necessary i.a. to make explicit (at least some basic) coreferential relations within the sentence boundaries and also beyond them. The PDT scenario includes both automatic and 'manual' procedures; among the former type, there is one that concerns coreference, indicating the lemma of the subject in a specific attribute of the label belonging to a node for a reflexive pronoun, and assigning the deleted nodes in coordinated constructions the lemmas of their counterparts in the given construction. 'Manual' operations restore nodes for the deleted items mostly as pronouns. The distinction between grammatical and textual coreference is reflected. In order to get a possibility of handling textual coreference, specific attributes reflect the linking of sentences to each other and to the context of situation, and the development of the degrees of activation of the 'stock of shared knowledge' will be registered in so far as they are derivable from the use of nouns in subsequent utterances in a discourse. | ||
| Keywords: Coreference, Corpus, Dependency, Syntax | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 497-500 | ||
| Files: 19.ps, 19.pdf | ||
|
|
||
| FAST - Towards a Semi-automatic Annotation of Corpora | ||
| Barbu Cătălina
(School of Humanities, Languages and Social Sciences, University of
Wolverhampton, Stafford Road, Wolverhampton, UK, in6465@wlv.ac.uk) |
||
| As the use of annotated corpora in natural language processing applications increases, we are aware of the necessity of having flexible annotation tools that would not only support the manual annotation, but also enable us to perform post-editing on a text which has already been automatically annotated using a separate processing tool and even to interact with the tool during the annotation process. In practice, we have been confronted with the problem of converting the output of different tools to SGML format, while preserving the previous annotation, as well as with the difficulty of post-editing manually an annotated text. It has occurred to us that designing an interface between an annotation tool and any automatic tool would not only provide an easy way of taking advantage of the automatic annotation but it would also allow an easier interactive manual editing of the results. FAST was designed as a manual tagger that can also be used in conjunction with automatic tools for speeding up the human annotation. | ||
| Keywords: Annotation tool, Corpus, Corpus Annotation | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 501-506 | ||
| Files: 130.ps, 130.pdf | ||
|
|
||
| Layout Annotation in a Corpus of Patient Information Leaflets | ||
| Bouayad-Agha Nadjet
(Information Technology Research Institute, University of Brighton,
Lewes Road, Brighton BN2 4GJ, UK, nadjet@itri.bton.ac.uk) |
||
| We discuss the problems and issues that arised during the development of a procedure for annotating layout in a corpus of Patient Information Leaflets. We show how the genre of the corpus as well as the aim of the annotation influenced the annotation scheme. We also describe the automatic annotation procedure. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP2 - Corpus Annotation, pages 507-510 | ||
| Files: 234.ps, 234.pdf | ||
|
|
||
| Designing a Tool for Exploiting Bilingual Comparable Corpora | ||
| Bennison Peter
(Dublin City University, Dublin 9, Ireland, pbenn@hotmail.com) Bowker Lynne (Dublin City University, Dublin 9, Ireland, lynne.bowker@dcu.ie) |
||
| Translators have a real need for a tool that will allow them to exploit information contained in bilingual comparable corpora. ExTrECC is designed to be a semi-automatic tool that processes bilingual comparable corpora and presents a translator with a list of potential equivalents (in context) of the search term. The task of identifying translation equivalents in a non-aligned, non-translated corpus is a difficult one, and ExTrECC makes use of a number of techniques, some of which are simple and others more sophisticated. The basic design of ExTrECC (graphical user interface, architecture, algorithms) is outlined in this paper. | ||
| Keywords: Bilingual Comparable Corpora, Computer-Assisted Translation Tools, Corpus Design, ExTrECC, Term Extraction, Translation Equivalents | ||
| LREC2000 Proceedings: Session WP3 - Multilingual Corpora, pages 513-516 | ||
| Files: 20.ps, 20.pdf | ||
|
|
||
| A Word Sense Disambiguation Method Using Bilingual Corpus | ||
| Jie Zheng
(Institute for Language and Speech Processing, Department of Automation,
Tsinghua University, P.R.China, zhj@mail.au.tsinghua.edu.cn) Yuhang Mao (Institute for Language and Speech Processing, Department of Automation, Tsinghua University, P.R.China) |
||
| This paper proposes a word sense disambiguation (WSD) method using bilingual corpus in English-Chinese machine translation system. A mathematical model is constructed to disambiguate word in terms of context phrasal collocation. A rules learning algorithm is proposed, and an application algorithm of the learned rules is also provided, which can increase the recall ratio. Finally, an analysis is given by an experiment on the algorithm. Its application gives an increase of 10% in precision. | ||
| Keywords: Machine Translation, Natural Language Processing, Word Sense Tagging | ||
| LREC2000 Proceedings: Session WP3 - Multilingual Corpora, pages 517-522 | ||
| Files: 15.ps, 15.pdf | ||
|
|
||
| Building the Croatian-English Parallel Corpus | ||
| Tadić Marko
(Department of general linguistics and oriental studies, Faculty of
philosophy, University of Zagreb, Ivana, Croatia, marko.tadic@ffzg.hr) |
||
| The contribution gives a survey of procedures and formats used in building the Croatian-English parallel corpus which is being collected in the Institute of Linguistics at the Philosophical Faculty, University of Zagreb. The primary text source is newspaper Croatia Weekly which has been published from the beginning of 1998 by HIKZ (Croatian Institute for Information and Culture). After quick survey of existing English-Croatian parallel corpora, the article copes with procedures involved in text conversion and text encoding, particularly the alignment. There are several recent suggestions for alignment encoding and they are elaborated. Preliminary statistics on numbers of S and W elements in each language is given at the end of the article. | ||
| Keywords: Alignment, Corpus Linguistics, Croatian, English, Parallel Corpus, XCES, XML | ||
| LREC2000 Proceedings: Session WP3 - Multilingual Corpora, pages 523-530 | ||
| Files: 119.ps, 119.pdf | ||
|
|
||
| A Parallel Corpus of Italian/German Legal Texts | ||
| Gamper Johann
(European Academy Bolzano, Scientific Area “Language and Law”,
Weggensteinstr. 12a, 39100 Bozen, Italy, jgamper@eurac.edu) |
||
| This paper presents the creation of a parallel corpus of Italian and German legal documents which are translations of one another. The corpus, which contains approximately 5 mio. words, is primarily intended as a resource for (semi-)automatic terminology acquisition. The guidelines of the Corpus Encoding Standard have been applied for encoding structural information, segmentation information, and sentence alignment. Since the parallel texts have a one-to-one correspondence on the sentence level, building a perfect sentence alignment is rather straightforward. As a result of this the corpus constitutes also a valuable testbed for the evaluation of alignment algorithms. The paper discusses the intended use of the corpus, the various phases of corpus compilation, and basic statistics. | ||
| Keywords: CES, Corpus Encoding, Parallel Corpus | ||
| LREC2000 Proceedings: Session WP3 - Multilingual Corpora, pages 531-538 | ||
| Files: 140.ps, 140.pdf | ||
|
|
||
| Lexical and Translation Equivalence in Parallel Corpora | ||
| Váradi Tamás
(Linguistics Institute, Hungarian Academy of Sciences, H-1014 Budapest
Színház u 5-9, varadi@nytud.hu) |
||
| In the present paper we intend to investigate to what extent use of parallel corpora can help to eliminate some of the difficulties noted with bilingual dictionaries. The particular issues addressed are the bidirectionality of translation equivalence, the coverage of multiword units, and the amount of implicit knowledge presupposed on the part of the user in interpreting the data. Three lexical items belonging to different word classes were chosen for analysis: the noun head, the verb give and the preposition with. George Orwell's novel 1984 was used as source material, which is available in English-Hungarian sentence aligned form. It is argued that the analysis of translation equivalents displayed in sets of concordances with aligned sentences in the target language holds important implications for bilingual lexicography and automatic word alignment methodology. | ||
| Keywords: Multilingual Alignment, Parallel Corpora, Sense Discrimination | ||
| LREC2000 Proceedings: Session WP3 - Multilingual Corpora, pages 539-544 | ||
| Files: 122.ps, 122.pdf | ||
|
|
||
| Some Technical Aspects about Aligning Near Languages | ||
| de Yzaguirre Lluís
(Institute for Applied Linguistic. Universitat Pompeu Fabra, La Rambla,
30-32. 08002, Barcelona, Spain, de_yza@upf.es) Ribas Marta (Institute for Applied Linguistic. Universitat Pompeu Fabra, La Rambla, 30-32. 08002, Barcelona, Spain) Vivaldi Jordi (Institute for Applied Linguistics, Universitat Pompeu Fabra, Rambla Santa Mònica, 30, 08002 Barcelona, Spain, jorge.vivaldi@info.upf.es) Cabré M. Teresa (Institute for Applied Linguistics, Universitat Pompeu Fabra, Rambla Santa Mònica, 30, 08002 Barcelona, Spain, teresa.cabre@trad.upf.es) |
||
| IULA at UPF has developed an aligner that benefits from corpus processing results to produce an accurate and robust alignment, even with noisy parallel corpora. It compares lemmata and part-of-speech tags of analysed texts but it has two main characteristics. First, apparently it only works for near languages and second it requires morphological taggers for the compared languages. These two characteristics prevent this technique from being used for any pair of languages. Whevener it its applicable, a high quality of results is achieved. | ||
| Keywords: Lemma and Part-of-Speech Based Aligment, Sentence Aligment | ||
| LREC2000 Proceedings: Session WP3 - Multilingual Corpora, pages 545-548 | ||
| Files: 186.ps, 186.pdf | ||
|
|
||
| Cairo: An Alignment Visualization Tool | ||
| Smith Noah A.
(University of Maryland, College Park, Maryland, USA, nasmith@cs.umd.edu) Jahr Michael E. (Stanford University, Stanford, California, USA, mjahr@stanford.edu) |
||
| While developing a suite of tools for statistical machine translation research, we recognized the need for a visualization tool that would allow researchers to examine and evaluate specific word correspondences generated by a translation system. We developed Cairo to fill this need. Cairo is a free, open-source, portable, user-friendly, GUI-driven program written in Java that provides a visual representation of word correspondences between bilingual pairs of sentences, as well as relevant translation model parameters. This program can be easily adapted for visualization of correspondences in bi-texts based on probability distributions. | ||
| Keywords: Java, Statistical Machine Translation, Visualization, Word Alignment | ||
| LREC2000 Proceedings: Session WP3 - Multilingual Corpora, pages 549-552 | ||
| Files: 58.ps, 58.pdf | ||
|
|
||
| GREEK ToBI: A System for the Annotation of Greek Speech Corpora | ||
| Arvaniti Amalia
(Department of Foreign Languages and Literatures, University of Cyprus,
P.O. Box 20537, Nicosia 1678, Cyprus, amalia@ucy.ac.cy) Baltazani Mary (Department of Linguistics, UCLA, 405 Hilgard Avenue, Los Angeles, CA 90095-1543, USA) |
||
| Greek ToBI is a system for the annotation of (Standard) Greek spoken corpora, that encodes intonational, prosodic and phonetic information. It is used to develop a large and publicly available database of prosodically annotated utterances for research, engineering and educational purposes. Greek ToBI is based on the system developed for American English (ToBI), but includes novel features (“tiers”) designed to address particularities of Greek prosody that merit annotation, such as stress and juncture. Thus Greek ToBI includes five tiers: the Tone Tier shows the intonational analysis of the utterance; the Prosodic Words Tier is a phonetic transcription; the Break Index Tier shows indices of cohesion; the Words Tier gives the text in romanization; the Miscellaneous Tier is used to encode other relevant information (e.g., disfluency or pitch-halving). The development of GRToBI is largely based on the transcription and analysis of a corpus of spoken Greek, that includes data from several speakers and speech styles, but also draws on existing quantitative research on Greek prosody. | ||
| Keywords: Annotation, Greek, Intonation, Prosody, Spoken Corpora, ToBI | ||
| LREC2000 Proceedings: Session SO3 - Speech Synthesis, pages 555-562 | ||
| Files: 7.ps, 7.pdf | ||
|
|
||
| EULER: an Open, Generic, Multilingual and Multi-platform Text-to-Speech System | ||
| Dutoit Thierry (Faculté
Polytechnique de Mons, Circuits Theory and Signal Processing Lab, Bâtiment
Multitel, Parc Initialis, av. Copernic, B7000 Mons, BELGIUM, Tel: +32 65
374733 Fax: +32 65 374729, Web: http://tcts.fpms.ac.be, Email: fdutoit@tcts.fpms.ac.be) Bagein Michel (Faculté Polytechnique de Mons, Circuits Theory and Signal Processing Lab, Bâtiment Multitel, Parc Initialis, av. Copernic, B7000 Mons, BELGIUM, Tel: +32 65 374733 Fax: +32 65 374729, Web: http://tcts.fpms.ac.be, Email: bagein@tcts.fpms.ac.be) Malfrère Fabrice (Faculté Polytechnique de Mons, Circuits Theory and Signal Processing Lab, Bâtiment Multitel, Parc Initialis, av. Copernic, B7000 Mons, BELGIUM, Tel: +32 65 374733 Fax: +32 65 374729, Web: http://tcts.fpms.ac.be, Email: bagein@tcts.fpms.ac.be) Pagel Vincent (Faculté Polytechnique de Mons, Circuits Theory and Signal Processing Lab, Bâtiment Multitel, Parc Initialis, av. Copernic, B7000 Mons, BELGIUM, Tel: +32 65 374733 Fax: +32 65 374729, Web: http://tcts.fpms.ac.be, Email: pagel@tcts.fpms.ac.be) Ruelle Alain (Faculté Polytechnique de Mons, Circuits Theory and Signal Processing Lab, Bâtiment Multitel, Parc Initialis, av. Copernic, B7000 Mons, BELGIUM, Tel: +32 65 374733 Fax: +32 65 374729, Web: http://tcts.fpms.ac.be, Email: ruelle@tcts.fpms.ac.be) Tounsi Nawfal (Faculté Polytechnique de Mons, Circuits Theory and Signal Processing Lab, Bâtiment Multitel, Parc Initialis, av. Copernic, B7000 Mons, BELGIUM, Tel: +32 65 374733 Fax: +32 65 374729, Web: http://tcts.fpms.ac.be, Email: tounsig@tcts.fpms.ac.be) Wynsberghe Dominique (Faculté Polytechnique de Mons, Circuits Theory and Signal Processing Lab, Bâtiment Multitel, Parc Initialis, av. Copernic, B7000 Mons, BELGIUM, Tel: +32 65 374733 Fax: +32 65 374729, Web: http://tcts.fpms.ac.be, Email: tounsig@tcts.fpms.ac.be) |
||
| The aim of the collaborative project presented in this paper is to obtain a set of highly modular Text-To-Speech synthesizers for as many voices, languages and dialects as possible, free for use in non-commercial and non-military applications. This project is an extension of the MBROLA project: MBROLA is a speech synthesizer, freely distributed for non-commercial purposes, which uses diphone databases provided by users (19 languages in year 2000). Euler extends this idea to whole TTS systems by providing a backbone structure (MLC) and several generic algorithms for POS tagging, grapheme-to-phoneme conversion, and prosody generation. To demonstrate the potentials of the architecture and draw developpers’ interest we provide a full EULER-based TTS in French and in Arabic. Euler currently runs on Windows and Linux, and it is an open project: many of its components (and certainly its kernel) are provided as GNU C++ sources. It also incorporates, as much as possible, components and data derived from other TTS-related projects. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SO3 - Speech Synthesis, pages 563-566 | ||
| Files: 41.ps, 41.pdf | ||
|
|
||
| POSCAT: A Morpheme-based Speech Corpus Annotation Tool | ||
| Kim Byeongchang
(Department of Computer Science & Engineering, Pohang University of
Science & Technology, Pohang, 790-784, South Korea, bckim@nlp.postech.ac.kr) Cha Jeongwon (Department of Computer Science & Engineering, Pohang University of Science & Technology, Pohang, 790-784, South Korea, jwcha@nlp.postech.ac.kr) Lee Geunbae (Department of Computer Science & Engineering, Pohang University of Science & Technology, Pohang, 790-784, South Korea, gblee@nlp.postech.ac.kr) Lee Jin-seok (Department of Computer Science & Engineering, Pohang University of Science & Technology, Pohang, 790-784, South Korea, wolfpack@nlp.postech.ac.kr) |
||
| As more and more speech systems require linguistic knowledge to accommodate various levels of applications, corpora that are tagged with linguistic annotations as well as signal-level annotations are highly recommended for the development of today’s speech systems. Among the linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech corpora for most modern spoken language applications of morphologically complex agglutinative languages such as Korean. Considering the above demands, we have developed a single unified speech corpus annotation tool that enables corpus builders to link linguistic annotations to signal-level annotations using a morphological analyzer and a POS tagger as basic morpheme-based linguistic engines. Our tool integrates a syntactic analyzer, phrase break detector, grapheme-to-phoneme converter and automatic phonetic aligner together. Each engine automatically annotates its own linguistic and signal knowledge, and interacts with the corpus developers to revise and correct the annotations on demand. All the linguistic/phonetic engines were developed and merged with an interactive visualization tool in a client-server network communication model. The corpora that can be constructed using our annotation tool are multi-purpose and applicable to both speech recognition and text-to-speech (TTS) systems. Finally, since the linguistic and signal processing engines and user interactive visualization tool are implemented within a client-server model, the system loads can be reasonably distributed over several machines. | ||
| Keywords: Client-Server Model, Linguistic Annotation, Morpheme-Based Annotation, Signal-Level Annotation, Speech Corpus Annotation Tool | ||
| LREC2000 Proceedings: Session SO3 - Speech Synthesis, pages 567-572 | ||
| Files: 224.ps, 224.pdf | ||
|
|
||
| A Strategy for the Syntactic Parsing of Corpora: from Constraint Grammar Output to Unification-based Processing | ||
| Badia Toni (Institut
Universitari de Lingüística Aplicada, Universitat Pompeu
Fabra, La Rambla, 30-32, 08002 Barcelona, Spain, badia toni@trad.upf.es) Egea Àngels (Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra, La Rambla, 30-32, 08002 Barcelona, Spain, angels@slc.ub.es) |
||
| This paper presents a strategy for syntactic analysis based on the combination of two different parsing techniques: lexical syntactic tagging and phrase structure syntactic parsing. The basic proposal is to take advantage of the good results on lexical syntactic tagging to improve the whole performance of unification-based parsing. The syntactic functions attached to every word by the lexical syntactic tagging are used as head features in the unification-based grammar, and are the base for grammar rules. | ||
| Keywords: Processing, Syntactic Analysis | ||
| LREC2000 Proceedings: Session WO7 - Syntantic Parsing, pages 575-582 | ||
| Files: 111.ps, 111.pdf | ||
|
|
||
| Learning Preference of Dependency between Japanese Subordinate Clauses and its Evaluation in Parsing | ||
| Utsuro Takehito
(Department of Information and Computer Sciences, Toyohashi University
of Technology, Tenpaku-cho, Toyohashi, 441-8580, Japan, utsuro@ics.tut.ac.jp) |
||
| (Utsuro et al., 2000) proposed statistical method for learning dependency preference of Japanese subordinate clauses, in which scopeembedding preference of subordinate clauses is exploited as a useful information source for disambiguating dependencies between subordinate clauses. Following (Utsuro et al., 2000), this paper presents detailed results of evaluating the proposed method by comparing it with several closely related existing techniques and shows that the proposed method outperforms those existing techniques. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO7 - Syntantic Parsing, pages 583-590 | ||
| Files: 213.ps, 213.pdf | ||
|
|
||
| An Open Source Grammar Development Environment and Broad-coverage English Grammar Using HPSG | ||
| Copestake Ann (CSLI,
Ventura Hall, Stanford University, Stanford, CA 94305-4115, USA, aac@csli.stanford.edu) Flickinger Dan (CSLI, Ventura Hall, Stanford University, Stanford, CA 94305-4115, USA, danf@csli.stanford.edu) |
||
| The LinGO (Linguistic Grammars Online) project's English Resource Grammar and the LKB grammar development environment are language resources which are freely available for download for any purpose, including commercial use (see http://lingo.stanford.edu). Executable programs and source code are both included. In this paper, we give an outline of the LinGO English grammar and LKB system, and discuss the ways in which they are currently being used. The grammar and processing system can be used independently or combined to give a central component which can be exploited in a variety of ways. Our intention in writing this paper is to encourage more people to use the technology, which supports collaborative development on many levels. | ||
| Keywords: Grammars, LR Tools, Monolingual LR | ||
| LREC2000 Proceedings: Session WO7 - Syntantic Parsing, pages 591-598 | ||
| Files: 371.ps, 371.pdf | ||
|
|
||
| Controlled Bootstrapping of Lexico-semantic Classes as a Bridge between Paradigmatic and Syntagmatic Knowledge: Methodology and Evaluation | ||
| Allegrini Paolo
(Istituto di Linguistica Computazionale – CNR, Via Alfieri 1 - Pisa
56010 - ITALY, allegrip@ilc.pi.cnr.it) Montemagni Simonetta (Istituto di Linguistica Computazionale – CNR, Via Alfieri 1 - Pisa 56010 - ITALY, simo@ilc.pi.cnr.it) Pirrelli Vito (Istituto di Linguistica Computazionale – CNR, Via Alfieri 1 - Pisa 56010 - ITALY, vito@ilc.pi.cnr.it) |
||
| Semantic classification of words is a highly context sensitive and somewhat moving target, hard to deal with and even harder to evaluate on an objective basis. In this paper we suggest a step–wise methodology for automatic acquisition of lexico–semantic classes and delve into the non trivial issue of how results should be evaluated against a top–down reference standard. | ||
| Keywords: Analogy-Based Aquisition, Functional Annotation, Semantic Classification | ||
| LREC2000 Proceedings: Session WO8 - Acquisition of Semantic Information, pages 601-608 | ||
| Files: 99.ps, 99.pdf | ||
|
|
||
| Automatic Extraction of Semantic Similarity of Words from Raw Technical Texts | ||
| Thanopoulos
Aristomenis (Wire Communications Laboratory, Electrical &
Computer Engineering Dept., University of Patras, 265 00 Rion, Patras,
Greece, aristom@wcl.ee.upatras.gr) Fakotakis Nikos (Wire Communications Laboratory, Electrical & Computer Engineering Dept., University of Patras, 261 10 Rion, Patras, Greece, fakotaki@wcl.ee.upatras.gr) Kokkinakis George (Wire Communications Laboratory, Electrical & Computer Engineering Dept., University of Patras, 261 10 Rion, Patras, Greece, gkokkin@wcl.ee.upatras.gr) |
||
| In this paper we address the problem of extracting semantic similarity relations between lexical entities based on context similarities as they appear in specialized text corpora. Only general-purpose linguistic tools are utilized in order to achieve portability across domains and languages. Lexical context is extended beyond immediate adjacency but is still confined by clause boundaries. Morfological and collocational information are employed in order to exploit the most of the contextual data. The extracted semantic similarity relations are transformed to semantic clusters which is a primal form of a domain-specific term thesaurus. | ||
| Keywords: Corpus Processing, Lexical Semantics, NLP, Word Clustering | ||
| LREC2000 Proceedings: Session WO8 - Acquisition of Semantic Information, pages 609-614 | ||
| Files: 302.ps, 302.pdf | ||
|
|
||
| Abstraction of the EDR Concept Classification and its Effectiveness in Word Sense Disambiguation | ||
| Kazuhiro Kimura
(Human Interface Laboratory, Corporate Research & Development
Center, TOSHIBA, Komukai-Toshiba-cho, Saiwai-ku, KAWASAKI 212-8582
JAPAN, kazu.kimura@toshiba.co.jp) Hideki Hirakawa (Human Interface Laboratory, Corporate Research & Development Center, TOSHIBA, Komukai-Toshiba-cho, Saiwai-ku, KAWASAKI 212-8582 JAPAN, hideki.hirakawa@toshiba.co.jp) |
||
| The relation between the degree of abstraction of a concept and the explanation capability (validity and coverage) of conceptual description which is the constraint held between concepts is clarified experimentally by performing the operation called concept abstraction. This is the procedure that chooses a set certain of lower level concepts in a concept hierarchy and maps the set to one or more upper level (abstract) concepts. We took the three abstraction techniques of a flat depth, a flat size, and a flat probability method for the degree of abstraction. By taking these methods and degrees as a parameter, we applied the concept abstraction to the EDR Concept Classifications and performed word sense disambiguation test. The test set and the disambiguation knowledge were extracted as a co-occurrence expression from the EDR Corpora. Through the test, we found that the flat probability method gives the best result. We also carried out an evaluation by comparing the abstracted hierarchy with that of human introspection and found the flat size method gives the most similar results to human. These results would contribute to clarify the appropriate detailed-ness of a concept when given an application purpose of a concept hierarchy. | ||
| Keywords: Concept Abstraction, Concept Description, Concept Hierarchy, Word Sense Disambiguation | ||
| LREC2000 Proceedings: Session WO8 - Acquisition of Semantic Information, pages 615-622 | ||
| Files: 75.ps, 75.pdf | ||
|
|
||
| Where Opposites Meet. A Syntactic Meta-scheme for Corpus Annotation and Parsing Evaluation | ||
| Lenci Alessandro
(Istituto di Linguistica Computazionale – CNR, Via Alfieri 1 - Pisa
56010 - ITALY, lenci@ilc.pi.cnr.it) Montemagni Simonetta (Istituto di Linguistica Computazionale – CNR, Via Alfieri 1 - Pisa 56010 - ITALY, simo@ilc.pi.cnr.it) Pirrelli Vito (Istituto di Linguistica Computazionale – CNR, Via Alfieri 1 - Pisa 56010 - ITALY, vito@ilc.pi.cnr.it) Soria Claudia (Istituto di Linguistica Computazionale – CNR, Via Alfieri 1 - Pisa 56010 - ITALY, soria@ilc.pi.cnr.it) |
||
| The paper describes the use of FAME, a functional annotation meta–scheme for comparison and evaluation of syntactic annotation schemes, i) as a flexible yardstick in multi–lingual and multi–modal parser evaluation campaigns and ii) for corpus annotation. We show that FAME complies with a variety of non–trivial methodological requirements, and has the potential for being effectively used as an “interlingua” between different syntactic representation formats. | ||
| Keywords: Construction of Linguistic Resources, Parsing Evaluation, Syntactic Annotation of Corpora | ||
| LREC2000 Proceedings: Session EO2 - Evaluation of Tools, pages 625-632 | ||
| Files: 98.ps, 98.pdf | ||
|
|
||
| A Comparison of Summarization Methods Based on Task-based Evaluation | ||
| Hajime Mochizuki
(School of Information Science, Japan Advanced Institute of Science and
Technology, Tatsunokuchi, Ishikawa 923-1292, Japan, E-mail:motizuki@jaist.ac.jp) Manabu Okumura (School of Information Science, Japan Advanced Institute of Science and Technology, Tatsunokuchi, Ishikawa 923-1292, Japan, E-mail: oku@jaist.ac.jp) |
||
| A task-based evaluation scheme has been adopted as a new method of evaluation for automatic text summarization systems. It evaluates the performance of a summarization system in a given task, such as information retrieval and text categorization. This paper compares ten different summarization methods based on information retrieval tasks. In order to evaluate the system performance, the subjects’ speed and accuracy are measured in judging the relevance of texts using summaries. We also analyze the similarity of summaries in order to investigate the similarity of the methods. Furthermore, we analyze what factors can affect evaluation results, and describe the problems that arose from our experimental design, in order to establish a better evaluation scheme. | ||
| Keywords: Automatic Text Summarization, Information Retrieval, Task-Based Evaluation | ||
| LREC2000 Proceedings: Session EO2 - Evaluation of Tools, pages 633-640 | ||
| Files: 14.ps, 14.pdf | ||
|
|
||
| Evaluation of TRANSTYPE, a Computer-aided Translation Typing System: A Comparison of a Theoretical- and a User-oriented Evaluation Procedures | ||
| Langlais Philippe
(Laboratoire RALI, Université de Montréal, Montreal,
Canada, felipe@iro.umontreal.ca) Sauvé Sébastien (RALI / DIRO, Université de Montréal, C.P. 6128, succursale Centre-ville, Montréal (Québec), Canada, H3C 3J7, http:www-rali.iro.umontreal.ca) Foster George (RALI / DIRO, Université de Montréal, C.P. 6128, succursale Centre-ville, Montréal (Québec), Canada, H3C 3J7, http:www-rali.iro.umontreal.ca) Macklovitch Elliott (Laboratoire RALI, Université de Montréal, Montreal, Canada, macklovi@iro.umontreal.ca) Lapalme Guy (RALI / DIRO, Université de Montréal, C.P. 6128, succursale Centre-ville, Montréal (Québec), Canada, H3C 3J7, http:www-rali.iro.umontreal.ca) |
||
| We describe and compare two protocols —one theoretical and the other in-situs —for evaluating the TRANSTYPE system, a target-text mediated interactive machine translation prototype which predicts in real time the words of the ongoing translation. | ||
| Keywords: Evaluation Protocols, In-situs Evaluation, Interactive Machine Translation, Statistical Machine Translation | ||
| LREC2000 Proceedings: Session EO2 - Evaluation of Tools, pages 641-648 | ||
| Files: 34.ps, 34.pdf | ||
|
|
||
| The Cost258 Signal Generation Test Array | ||
| Bailly Gérard
(ICP - UMR CNRS no5009, INPG & U3, 46 av. Félix Viallet,
38031 Grenoble CEDEX, France) Banga Eduardo R. (ETSI Telecomunicacion, Campus Universitario, Universidad de Vigo, 36200 Vigo, Spain) Monaghan Alex (National Centre for Language Technology, Dublin City University, Dublin 9, Ireland) Rank Erhard (INTHF, Vienna University of Technology, Gusshausstrasse 25/E389, A-1040 Vienna, Austria) |
||
| This paper describes a benchmark for Analysis-Modification-Synthesis Systems (AMSS) that are back-ends of all concatenative speech synthesis systems. After introducing the motivations and principles underlying this initiative, we present here a first anonymous objective evaluation comparing the performance of 5 such AMSS. | ||
| Keywords: Analysis-Modification-Synthesis systems, Objective evaluation, Server, Speech | ||
| LREC2000 Proceedings: Session SO4 - Speech Synthesis Evaluation, pages 651-654 | ||
| Files: 1.ps, 1.pdf | ||
|
|
||
| Guidelines for Japanese Speech Synthesizer Evaluation | ||
| Itahashi Shuichi
(Institute of Inform tion Sciences nd Electronics,University of Tsukub,
1-1-1 Tennod i,Tsukub ,305-8573,Japan, it hashi@mil b.is.tsukub .c.jp) |
||
| Speech synthesis technology is one of the most important elements required for better human interfaces for communication and information systems.This paper describes the ''Guidelines for Speech Synthesis System Performance Evaluation Methods''created by the Speech Input/Output Systems Expert Committee of the Japan Electronic Industry Development Association (JEIDA).JEIDA has been investigating speech synthesizer evaluation methods since 1993 and previously reported the provisional version of the guidelines. The guidelines comprise six chapters: General rules,Text analysis evaluation,Syllable articulation test,Word intelligibility test, Sentence intelligibility test,and Over ll quality evaluation. | ||
| Keywords: Evaluation, Japanese Text Analysis, Speech, Synthesizer | ||
| LREC2000 Proceedings: Session SO4 - Speech Synthesis Evaluation, pages 655-660 | ||
| Files: 77.ps, 77.pdf | ||
|
|
||
| Perception and Analysis of a Reiterant Speech Paradigm: a Functional Diagnostic of Synthetic Prosody | ||
| Rilliard Albert
(Institut de la Communication Parlée, 1180,Av.Centrale -38040
Grenoble Cedex 9, rilliard@icp.inpg.fr) Aubergé Véronique (ICP – Univ. Stendhal, 1180 av. Centrale, BP25, F-38040 Grenoble cedex 9, auberge@icp.inpg.fr) |
||
| A set of perception experiments,using reiterant speech,were designed to carry out a diagnostic of the segmentation /hierarchisation linguistic function of prosody.The prosodic parameters of F0,syllabic duration and intensity of the stimuli used during this experiment were extracted.Several dissimilarity measures (Correlation,root-mean-square distance and mutual information)were used to match the results of the subjective experiment.This comparison of the listeners ’perception with acoustic parameters is intended to underline the acoustic keys used by listeners to judge the adequacy of prosody to perform a given linguistic function. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SO4 - Speech Synthesis Evaluation, pages 661-664 | ||
| Files: 94.ps, 94.pdf | ||
|
|
||
| Looking for Errors: A Declarative Formalism for Resource-adaptive Language Checking | ||
| Bredenkamp Andrew
(Deutsches Forschungszentrum Künstliche Intelligenz (DFKI) GmbH,
Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany, andrewb@dfki.de) Crysmann Berthold (Deutsches Forschungszentrum Künstliche Intelligenz (DFKI) GmbH, Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany, crysmann@dfki.de) Petrea Mirela (Deutsches Forschungszentrum Künstliche Intelligenz (DFKI) GmbH, Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany, mirela@dfki.de) |
||
| The paper describes a phenomenon-based approach to grammar checking, which draws on the integration of different shallowNLP technologies, including morphological and POS taggers, as well as probabilistic and rule-based partial parsers. We present a declarative specification formalism for grammar checking and controlled language applications which greatly facilitates the development of checking components. | ||
| Keywords: Controlled Languages, Grammar Checking, Shallow Processing | ||
| LREC2000 Proceedings: Session WO9 - Applications in the Written Area, pages 667-674 | ||
| Files: 299.ps, 299.pdf | ||
|
|
||
| An Architecture for Document Routing in Spanish: Two Language Components, Pre-processor and Parser | ||
| Rojo Guillermo
(Dept. of Spanish Language, University of Santiago de Compostela, Burgo
das Nacións, s/n., E-15771 Santiago de Compostela, Spain, fegrojo@usc.es) Álvarez Maria Concepción (Dept. of Spanish Language, University of Santiago de Compostela, Burgo das Nacións, s/n., E-15771 Santiago de Compostela, Spain, femcal@usc.es) Alvariño Pilar (Dept. of Spanish Language, University of Santiago de Compostela, Burgo das Nacións, s/n., E-15771 Santiago de Compostela, Spain, fepili@usc.es) Gil Adelaida (Dept. of Spanish Language, University of Santiago de Compostela, Burgo das Nacións, s/n., E-15771 Santiago de Compostela, Spain, iagilma@usc.es) Santalla María Paula (Dept. of Spanish Language, University of Santiago de Compostela, Burgo das Nacións, s/n., E-15771 Santiago de Compostela, Spain, fempsr@usc.es) Sotelo Susana (Dept. of Spanish Language, University of Santiago de Compostela, Burgo das Nacións, s/n., E-15771 Santiago de Compostela, Spain, fesdocio@usc.es) |
||
| This paper describes the language components of a system for Document Routing in Spanish. The system identifies relevant terms for classification within involved documents by means of natural language processing techniques. These techniques are based on the isolation and normalization of syntactic unities considered relevant for the classification, especially noun phrases, but also other constituents built around verbs, adverbs, pronouns or adjectives. After a general introduction about the research project, the second Section relates our approach to the problem with other previous and current approaches, the third one describes corpora used for evaluating the system. The linguistic analysis architecture, including pre-processing and two different levels of syntactic analysis, is described in following fourth and fifth Sections, while the last one is dedicated to a comparative analysis of results obtained from the processing of corpora introduced in third Section. Certain future developments of the system are also included in this Section. | ||
| Keywords: Document Routing, Information Retrieval, Parsing, Syntactic Normalization | ||
| LREC2000 Proceedings: Session WO9 - Applications in the Written Area, pages 675-682 | ||
| Files: 91.ps, 91.pdf | ||
|
|
||
| Extraction of Unknown Words Using the Probability of Accepting the Kanji Character Sequence as One Word | ||
| Shinnou Hiroyuki
(Ibaraki University Dept. of Systems Engineering, 4-12-1 Nakanarusawa,
Hitachi, Ibaeaki, 216-8511, Japan, shinnou@nlp.dse.ibaraki.ac.jp) Ikeya Masanori (Ibaraki University Dept. of Systems Engineering, 4-12-1 Nakanarusawa, Hitachi, Ibaeaki, 216-8511, Japan, ikeya@nlp.dse.ibaraki.ac.jp) |
||
| In this paper, we propose a method to extract unknown words, which are composed of two or three kahji characters, from Japanase text. Generally the known word composed of kanji characters are segmented into other words by the morphological analysis. Moreover, the appearance probability of each segmented word is small. By these features, we can define the measure of accepting two or three kanji character sequence as an unknown word. On the other hand, we can find some segmentation patterns of unknown words. By applying our measure to kanji character sequences which have these patterns, we can extract unknown words. In the experiment, the F-measuer for extraction of known words composed of two and three kanji characters was about 0.7 and 0.4 respectively. Our method does not need to use the frequency of the word in the training corpus to judge whether its word is the unknown word or not. Therefore, our method has the advantage that low frequent unknown words are extracted. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO9 - Applications in the Written Area, pages 683-688 | ||
| Files: 79.ps, 79.pdf | ||
|
|
||
| An Experiment of Lexical-Semantic Tagging of an Italian Corpus | ||
| Corazzari Ornella
(Consorzio Pisa Ricerche, P.za A. D’Ancona, 1 - 56100 Pisa, corazzar@ilc.pi.cnr.it) Calzolari Nicoletta (Istituto di Linguistica Computazionale, CNR, Area della Ricerca di Pisa, Via Alfieri 1, Loc. S. Cataldo, Ghezzano 56010 (PI) – ITALY, glottolo@ilc.pi.cnr.it) Zampolli Antonio (Istituto di Linguistica Computationale. CNR parole@ilc.pi.cnr.it) |
||
| The availability of semantically tagged corpora is becoming a very important and urgent need for training and evaluation within a large number of applications but also they are the natural application and accompaniment of semantic lexicons of which they constitute both a useful testbed to evaluate their adequacy and a repository of corpus examples for the attested senses. It is therefore essential that sound criteria are defined for their construction and a specific methodology is set up for the treatment of various semantic phenomena relevant to this level of description. In this paper we present some observations and results concerning an experiment of manual lexical-semantic tagging of a small Italian corpus performed within the framework of the ELSNET project. The ELSNET experimental project has to be considered as a feasibility study. It is part of a preparatory and training phase, started with the Romanseval/Senseval experiment (Calzolari et al., 1998), and ending up with the lexical-semantic annotation of larger quantities of semantically annotated texts such as the syntactic-semantic Treebank which is going to be annotated within an Italian National Project (SI-TAL). Indeed, the results of the ELSNET experiment have been of utmost importance for the definition of the technical guidelines for the lexical-semantic level of description of the Treebank. | ||
| Keywords: Semantic Tagging, Sense Tagging, Word-Sense Disambiguation | ||
| LREC2000 Proceedings: Session WO10 - Semantic Annotation of Corpora, pages 691-698 | ||
| Files: 60.ps, 60.pdf | ||
|
|
||
| Semantic Tagging for the Penn Treebank | ||
| Palmer Martha
(Department of Computer and Information Science, University of
Pennsylvania, Philadelphia, PA 19104, USA, mpalmer@linc.cis.upenn.edu) Trang Dang Hoa (University of Pennsylvania, 200 South 33rd Street, Philadelphia, PA, USA, htd@linc.cis.upenn.edu) Rosenzweig Joseph (University of Pennsylvania, 200 South 33rd Street, Philadelphia, PA, USA, josephr@linc.cis.upenn.edu) |
||
| This paper describes the methodology that is being used to augment the Penn Treebank annotation with sense tags and other types of semantic information. Inspired by the results of SENSEVAL, and the high inter-annotator agreement that was achieved there, similar methods were used for a pilot study of 5000 words of running text from the Penn Treebank. Using the same techniques of allowing the annotators to discuss difficult tagging cases and to revise WordNet entries if necessary, comparable inter-annotator rates have been achieved. The criteria for determining appropriate revisions and ensuring clear sense distinctions are described. We are also using hand correction of automatic predicate argument structure information to provide additional thematic role labeling. | ||
| Keywords: Inter-Annotator Agreement, Predicate-Argument Structure, Sense Distinctions, Training Data | ||
| LREC2000 Proceedings: Session WO10 - Semantic Annotation of Corpora, pages 699-704 | ||
| Files: 197.ps, 197.pdf | ||
|
|
||
| A Step toward Semantic Indexing of an Encyclopedic Corpus | ||
| Alcouffe Philippe
(Hachette Multimédia, 11 rue de Cambrai, 75019 Paris, France,
philippe.alcouffe@hachette-multimedia.fr) Gacon Nicolas (Hachette Multimédia, 11 rue de Cambrai, 75019 Paris, France, nicolas.gacon@hachette-multimedia.fr) Roux Claude (Xerox Research Centre Europe, 6 chemin de Maupertuis, 38240 Meylan, France, roux@xrce.xerox.com) Segond Frédérique (Xerox Research Centre Europe, 6 chemin de Maupertuis, 38240 Meylan, France, segond@xrce.xerox.com) |
||
| This paper investigates a method for extracting and acquiring knowledge from Linguistic resources. In particular, we propose an NLP based architecture for building a semantic network out of an XML on line encyclopedic corpus. The general application underlying this work is a question-answering system on proper nouns within an encyclopedia. | ||
| Keywords: Encyclopedia, Extraction, Knowledge, Question-Answering, Robust Parsing, Semantic Network, XML | ||
| LREC2000 Proceedings: Session WO10 - Semantic Annotation of Corpora, pages 705-710 | ||
| Files: 161.ps, 161.pdf | ||
|
|
||
| Obtaining Predictive Results with an Objective Evaluation of Spoken Dialogue Systems: Experiments with the DCR Assessment Paradigm | ||
| Antoine Jean-Yves
(EQUIPAGE team, VALORIA, Université de Bretagne Sud, IUP Vannes,
r. Y. Mainguy, F-56000 Vannes, France., Email : Jean-Yves.Antoine@univ-ubs.fr) Siroux Jacques (CORDIAL team, IRISA-LLI, ENSSAT, 6 r. de Kerampont, F-22305 Lannion, France., Email : siroux@enssat.fr) Caelen Jean (CLIPS-IMAG, BP 53, F-38041 Grenoble Cedex 9, France., Email : Jean.Caelen@univ-ubs.fr @imag.fr) Villaneau Jeanne (EQUIPAGE team, VALORIA, Université de Bretagne Sud, IUP Vannes, r. Y. Mainguy, F-56000 Vannes, France., Email : berthele@univ-ubs.fr) Goulian Jérôme (EQUIPAGE team, VALORIA, Université de Bretagne Sud, IUP Vannes, r. Y. Mainguy, F-56000 Vannes, France., Email : Jerome.Goulian@univ-ubs.fr) Ahafhaf Mohamed (CLIPS-IMAG, BP 53, F-38041 Grenoble Cedex 9, France., Email : Mohamed.Ahafhaf@univ-ubs.fr @imag.fr) |
||
| The DCR methodology is a framework that proposes a generic and detailed evaluation of spoken dialog systems. We have already detailed (Antoine et al., 1998) the theoretical bases of this paradigm. In this paper, we present some experimental results on spoken language understanding that show the feasibility and the reliability of the DCR evaluation as well as its ability to provide a detailed diagnosis of the system’s behaviour. Finally, we highlight the extension of the DCR methodology to dialogue management. | ||
| Keywords: DCR Evaluation, Dialogue, Genericity, Multi-Criteria Diagnosis, Spoken Language Understanding | ||
| LREC2000 Proceedings: Session SO5 - Evaluation of Dialogue, pages 713-720 | ||
| Files: 36.ps, 36.pdf | ||
|
|
||
| Lessons Learned from a Task-based Evaluation of Speech-to-Speech Machine Translation | ||
| Levin Lori
(Language Technologies Institute, Carnegie Mellon University,
Pittsburgh, PA 15213 USA, lsl@cs.cmu.edu) Bartlog Boris (Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA, lsl@cs.cmu.edu) Font Llitjos Ariadna (Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA, lsl@cs.cmu.edu) Gates Donna (Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA, lsl@cs.cmu.edu) Lavie Alon (Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA, alavie@cs.cmu.edu, www.is.cs.cmu.edu) Wallace Dorcas (Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA, lsl@cs.cmu.edu) Watanabe Taro (Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA, lsl@cs.cmu.edu) Woszczyna Monika (Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA, lsl@cs.cmu.edu) |
||
| For several years we have been conducting Accuracy Based Evaluations (ABE) of the JANUS speech-to-speech MT system (Gates et al., 1997) which measure quality and fidelity of translation. Recently we have begun to design a Task Based Evaluation for JANUS (Thomas, 1999) which measures goal completion. This paper describes what we have learned by comparing the two types of evaluation. Both evaluations (ABE and TBE) were conducted on a common set of user studies in the semantic domain of travel planning. | ||
| Keywords: Speech-to-Speech Machine Translation, Task-Based Evaluation | ||
| LREC2000 Proceedings: Session SO5 - Evaluation of Dialogue, pages 721-724 | ||
| Files: 215.ps, 215.pdf | ||
|
|
||
| Galaxy-II as an Architecture for Spoken Dialogue Evaluation | ||
| Polifroni Joseph
(Spoken Language Systems Group, MIT Laboratory for Computer Science, 545
Technology Square, Cambridge, MA 02139 USA, joe@lcs.mit.edu) Seneff Stephanie (Spoken Language Systems Group, MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139 USA, seneff@lcs.mit.edu) |
||
| The GALAXY-II architecture, comprised of a centralized hub mediating the interaction among a suite of human language technology servers, provides both a useful tool for implementing systems and also a streamlined way of configuring the evaluation of these systems. In this paper, we discuss our ongoing efforts in evaluation of spoken dialogue systems, with particular attention to the way in which the architecture facilitates the development of a variety of evaluation configurations. We furthermore propose two new metrics for automatic evaluation of the discourse and dialogue components of a spoken dialogue system, which we call “user frustration” and “information bit rate.” | ||
| Keywords: Evaluation, Spoken Dialogue Architectures, Spoken Dialogue Dystems | ||
| LREC2000 Proceedings: Session SO5 - Evaluation of Dialogue, pages 725-730 | ||
| Files: 116.ps, 116.pdf | ||
|
|
||
| Issues in the Evaluation of Spoken Dialogue Systems - Experience from the ACCeSS Project | ||
| Brey Thomas
(University of Regensburg, Universitätsstr. 31, D-93040 Regensburg,
Germany, Thomas.Brey@sprachlit.uni-regensburg.de) Hanrieder Gerhard (Temic Speech Processing, Soeflinger Strasse 100, D-89077 Ulm, Germany, Gerhard.Hanrieder@temic.com) Heisterkamp Paul (DaimlerChrysler AG, Wilhelm-Runge Strasse, D-89081 Ulm, Germany, paul.heisterkamp@daimlerchrysler.com) Hitzenberger Ludwig (University of Regensburg, Universitätsstr. 31, D-93040 Regensburg, Germany, Ludwig.Hitzenberger@sprachlit.uni-regensburg.de) Regel-Brietzmann Peter (DaimlerChrysler AG, Wilhelm-Runge Strasse, D-89081 Ulm, Germany, peter.regel-brietzmann@daimlerchrysler.com) |
||
| We describe the framework and present detailed results of an evaluation of 1.500 dialogues recorded during a three-months field-trial of the ACCeSS Dialogue System. The system was routing incoming calls to agents of a call-center and handled about 100 calls per day. | ||
| Keywords: Call-Center Applications, Concept Accuracy, Evaluation, Metrics, Spoken Language Dialogue Systems, Success Rate | ||
| LREC2000 Proceedings: Session SO5 - Evaluation of Dialogue, pages 731-734 | ||
| Files: 162.ps, 162.pdf | ||
|
|
||
| Evaluation for Darpa Communicator Spoken Dialogue Systems | ||
| Walker Marilyn
(AT& T Labs - Research, 180 Park Ave, Florham Park, N.J. 07932,
U.S.A., walker@research.att.com) Hirschman Lynette (The MITRE Corporation, 202 Burlington Road;Bedford,MA 01730 USA, lynette@mitre.org) Aberdeen John (The MITRE Corporation, 202 Burlington Rd., Bedford, MA 01730, U.S.A., aberdeen@mitre.org) |
||
| The overall objective of the DARPA COMMUNICATOR project is to support rapid, cost-effective development of multi-modal speech-enabled dialogue systems with advanced conversational capabilities, such as plan optimization, explanation and negotiation. In order to make this a reality, we need to find methods for evaluating the contribution of various techniques to the users’ willingness and ability to use the system. This paper reports on the approach to spoken dialogue system evaluation that we are applying in the COMMUNICATOR program. We describe our overall approach, the experimental design, the logfile standard, and the metrics applied in the experimental evaluation planned for June of 2000. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SO5 - Evaluation of Dialogue, pages 735-742 | ||
| Files: 191.ps, 191.pdf | ||
|
|
||
| Evaluation of a Dialogue System Based on a Generic Model that Combines Robust Speech Understanding and Mixed-initiative Control | ||
| Diaz Verdejo J.E.
(Dpto. Electrónica y Tecnología de Computadores,
Universidad de Granada, 18071 Granada, España (Spain), Tel.:
+34-958-243193, FAX: +34-958-243230, E-mail: ,jedv@hal.ugr.es) López-Cózar R. (Dpto. Electrónica y Tecnología de Computadores, Universidad de Granada, 18071 Granada, España (Spain), Tel.: +34-958-243193, FAX: +34-958-243230, E-mail: ramon@hal.ugr.es) Rubio A.J. (Dpto. Electrónica y Tecnología de Computadores, Universidad de Granada, 18071 Granada, España (Spain), Tel.: +34-958-243193, FAX: +34-958-243230, E-mail: rubio@hal.ugr.es) De la Torre A. (Dpto. Electrónica y Tecnología de Computadores, Universidad de Granada, 18071 Granada, España (Spain)., Tel.: +34-958-243193, FAX: +34-958-243230, E-mail: atv@hal.ugr.es) |
||
| This paper presents a generic model to combine robust speech understanding and mixed-initiative dialogue control in spoken dialogue systems. It relies on the use of semantic frames to conceptually store user interactions, a frame-unification procedure to deal with partial information, and a stack structure to handle initiative control. This model has been successfully applied in a dialogue system being developed at our lab, named SAPLEN, which aims to deal with the telephone-based product orders and queries of fast food restaurants’ clients. In this paper we present the dialogue system and describe the new model, together with the results of a preliminary evaluation of the system concerning recognition time, word accuracy, implicit recovery and speech understanding. Finally, we present the conclusions and indicate possibilities for future work. | ||
| Keywords: Evaluation, Speech Generation, Speech Recognition, Speech Understanding, Spoken Dialogue Systems | ||
| LREC2000 Proceedings: Session SO5 - Evaluation of Dialogue, pages 743-748 | ||
| Files: 101.ps, 101.pdf | ||
|
|
||
| Automatic Extraction of English-Chinese Term Lexicons from Noisy Bilingual Corpora | ||
| Le Sun (Open
Systems & Chinese Information Processing Center, Institute of
Software, Chinese Academy of Sciences, Beijing 100080, P. R. China.,
lesun@sonata.iscas.ac.cn) Youbing Jin (Open Systems & Chinese Information Processing Center, Institute of Software, Chinese Academy of Sciences, Beijing 100080, P. R. China., ybjin@sonata.iscas.ac.cn) Lin Du (Open Systems & Chinese Information Processing Center, Institute of Software, Chinese Academy of Sciences, Beijing 100080, P. R. China., ldu@sonata.iscas.ac.cn) Yufang Sun (Open Systems & Chinese Information Processing Center, Institute of Software, Chinese Academy of Sciences, Beijing 100080, P. R. China., yfsun@sonata.iscas.ac.cn) |
||
| This paper describes our system, which is designed to extract English-Chinese term lexicons from noisy complex bilingual corpora and use them as translation lexicon to check sentence alignment results. The noisy bilingual corpora are aligned firstly by our improved length based statistical approach, which could detect sentence omission and insertion partly. A term extraction system is used to obtain term translation lexicons form roughly aligned corpora. Then the statistical approach is used to align the corpora again. Finally, we filter the noisy bilingual texts and obtain nearly perfect alignment corpora. | ||
| Keywords: Bilingual Corpora Processing, Sentence Alignment, Term Extraction | ||
| LREC2000 Proceedings: Session WO11 - Mono-Multilingual Lexicon Acquisition and Building, pages 751-756 | ||
| Files: 208.ps, 208.pdf | ||
|
|
||
| Chinese-English Semantic Resource Construction | ||
| Dorr Bonnie J.
(Institute for Advanced Computer Studies, University of Maryland,
bonnie@umiacs.umd.edu) Levow Gina-Anne (Institute for Advanced Computer Studies, University of Maryland, gina@umiacs.umd.edu) Lin Dekang (Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, T6G 2H1, lindek@cs.ualberta.ca) Thomas Scott (Institute for Advanced Computer Studies, University of Maryland, scthmas@umiacs.umd.edu) |
||
| We describe an approach to large-scale construction of a semantic lexicon for Chinese verbs. We leverage off of three existing resources— a classification of English verbs called EVCA (English Verbs Classes and Alternations) (Levin, 1993), a Chinese conceptual database called HowNet (Zhendong, 1988c; Zhendong, 1988b; Zhendong, 1988a) (http://www .how-net.com), and a large machine-readable dic-tionary called Optilex. The resulting lexicon is used for determining appropriate word senses in applications such as machine translation and cross-language information retrieval. | ||
| Keywords: Chinese-English Lexicons, Lexical Acquisition, Machine Translation, Semantic Resource Construction, Thematic Roles | ||
| LREC2000 Proceedings: Session WO11 - Mono-Multilingual Lexicon Acquisition and Building, pages 757-760 | ||
| Files: 327.ps, 327.pdf | ||
|
|
||
| Towards A Universal Tool For NLP Resource Acquisition | ||
| Sheremetyeva
Svetlana (Computing Research Laboratory, New Mexico State
University, Las Cruces, NM 88003 USA, lana@crl.nmsu.edu) Nirenburg Sergei (Computing Research Laboratory, New Mexico State University, Las Cruces, NM 88003 USA, sergei@crl.nmsu.edu) |
||
| This paper describes an approach to developing a universal tool for eliciting, from a non-expert human user, knowledge about any language L. The purpose of this elicitation is rapid development of NLP systems. The approach is described on the example of the syntax module of the Boas knowledge elicitation system for a quick ramp up of a standard transfer-based machine translation system from L into English. The preparation of knowledge for the MT system is carried out into two stages; the acquisition of descriptive knowledge about L and using the descriptive knowledge to derive operational knowledge for the system. Boas guides the acquisition process using data-driven, expectation-driven and goal-driven methodologies. | ||
| Keywords: Knowledge Elicitation, Language Resource, Syntax | ||
| LREC2000 Proceedings: Session WO11 - Mono-Multilingual Lexicon Acquisition and Building, pages 761-768 | ||
| Files: 28.ps, 28.pdf | ||
|
|
||
| Acquisition of Linguistic Patterns for Knowledge-based Information Extraction | ||
| Harabagiu Sanda M.
(Department of Computer Science and Engineering, Southern Methodist
University, Dallas, TX 75275-0122, U.S.A., sanda@renoir.seas.smu.edu) Maiorano Steven J. (Department of Computer Science and Engineering, Southern Methodist University, Dallas, TX 75275-0122, U.S.A., steve@renoir.seas.smu.edu) |
||
| In this paper we present a new method of automatic acquisition of linguistic patterns for Information Extraction, as implemented in the CICERO system. Our approach combines lexico-semantic information available from the WordNet database with collocating data extracted from training corpora. Due to the open-domain nature of the WordNet information and the immediate availability of large collections of texts, our method can be easily ported to open-domain Information Extraction. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO11 - Mono-Multilingual Lexicon Acquisition and Building, pages 769-776 | ||
| Files: 347.ps, 347.pdf | ||
|
|
||
| Using Lexical Semantic Knowledge from Machine Readable Dictionaries for Domain Independent Language Modelling | ||
| Demetriou George
(Department of Computer Science, University of Sheffield, 211 Portobello
Street, Sheffield S1 4DP, United Kingdom, G.Demetriou@dcs.shef.ac.uk) Atwell Eric (School of Computer Studies, University of Leeds, Woodhouse Lane, Leeds LS2 9JT, United Kingdom, eric@scs.leeds.ac.uk) Souter Clive (School of Computer Studies, University of Leeds, Woodhouse Lane, Leeds LS2 9JT, United Kingdom, cs@scs.leeds.ac.uk) |
||
| Machine Readable Dictionaries (MRDs) have been used in a variety of language processing tasks including word sense disambiguation, text segmentation, information retrieval and information extraction. In this paper we describe the utilization of semantic knowledge acquired from an MRD for language modelling tasks in relation to speech recognition applications. A semantic model of language has been derived using the dictionary definitions in order to compute the semantic association between the words. The model is capable of capturing phenomena of latent semantic dependencies between the words in texts and reducing the language ambiguity by a considerable factor. The results of experiments suggest that the semantic model can improve the word recognition rates in “noisy-channel” applications. This research provides evidence that limited or incomplete knowledge from lexical resources such as MRDs can be useful for domain independent language modelling. | ||
| Keywords: Language Modelling, Lexical Semantics, Machine Readable Dictionaries, Speech Recognition | ||
| LREC2000 Proceedings: Session WO11 - Mono-Multilingual Lexicon Acquisition and Building, pages 777-782 | ||
| Files: 357.ps, 357.pdf | ||
|
|
||
| ItalWordNet: a Large Semantic Database for Italian | ||
| Roventini Adriana
(Istituto di Linguistica Computazionale, CNR, Area della Ricerca di
Pisa, Via Alfieri 1, Loc. S. Cataldo, Ghezzano 56010 (PI) – ITALY,
adriana@ilc.pi.cnr.it) Alonge Antonietta (Istituto di Linguistica Computazionale, CNR, Area della Ricerca di Pisa, Via Alfieri 1, Loc. S. Cataldo, Ghezzano 56010 (PI) – ITALY, antoalonge@libero.it) Calzolari Nicoletta (Istituto di Linguistica Computazionale, CNR, Area della Ricerca di Pisa, Via Alfieri 1, Loc. S. Cataldo, Ghezzano 56010 (PI) – ITALY, glottolo@ilc.pi.cnr.it) Magnini Bernardo (Istituto per la Ricerca Scientifica e Tecnologica, I-38050, Povo, Trento, magnini@irst.itc.it) Bertagna Francesca (Consorzio Pisa Ricerche, Via S. Maria 40, Pisa 56100 - ITALY, F.Bertagna@ilc.pi.cnr.it) |
||
| The focus of this paper is on the work we are carrying out to develop a large semantic database within an Italian national project, SI-TAL, aiming at realizing a set of integrated (compatible) resources and tools for the automatic processing of the Italian language. Within SI-TAL, ItalWordNet is the reference lexical resource which will contain information related to about 130,000 word senses grouped into synsets. This lexical database is not being created ex novo, but extending and revising the Italian lexical wordnet built in the framework of the EuroWordNet project. In this paper we firstly describe how the lexical coverage of our wordnet is being extended by adding adjectives, adverbs and proper nouns, plus a terminological subset belonging to the economic and financial domain. The relevant changes involved by these extensions both in the linguistic model and in the data structure are then illustrated. In particular we discuss i) the new semantic relations identified to encode information on adjectives and adverbs ii) the new architecture including the terminological subset. | ||
| Keywords: Lexical Resources, Rexical Semantic Networks | ||
| LREC2000 Proceedings: Session WO11 - Mono-Multilingual Lexicon Acquisition and Building, pages 783-790 | ||
| Files: 129.ps, 129.pdf | ||
|
|
||
| An Open Architecture for the Construction and Administration of Corpora | ||
| Orăsan
Constantin (School of Humanities, Languages and Social Sciences,
Stafford Street, University of Wolverhampton, Wolverhampton, WV1 1SB,
United Kingdom, in6093@wlv.ac.uk) Krishnamurthy Ramesh (Computational Linguistics Group, School of Humanities, Languages and Social Sciences, R.Krishnamurthy@wlv.ac.uk, University of Wolverhampton, Stafford Street, Wolverhampton, WV1 1SB, United Kingdom) |
||
| The use of language corpora for a variety of purposes has increased significantly in recent years. General corpora are now available for many languages, but research often requires more specialized corpora. The rapid development of the World Wide Web has greatly improved access to data in electronic form, but research has tended to focus on corpus annotation, rather than on corpus building tools. Therefore many researchers are building their own corpora, solving problems independently, and producing project-specific systems which cannot easily be re-used. This paper proposes an open client-server architecture which can service the basic operations needed in the construction and administration of corpora, but allows customisation by users in order to carry out project-specific tasks. The paper is based partly on recent practical experience of building a corpus of 10 million words of Written Business English from webpages, in a project which was co-funded by ELRA and the University of Wolverhampton. | ||
| Keywords: Client-Server, Copyright, Corpora, Corpus Administration, Corpus Building, Modular Programming | ||
| LREC2000 Proceedings: Session WO12 - Language Resources: Infrastructural Issues, pages 793-800 | ||
| Files: 176.ps, 176.pdf | ||
|
|
||
| Corpus Resources and Minority Language Engineering | ||
| McEnery Tony
(Department of Linguistics, Lancaster University, Bailrigg, Lancaster,
LA1 4YT, UK, mcenery@comp.lancs.ac.uk) Baker Paul (Department of Linguistics, Lancaster University, Bailrigg, Lancaster, LA1 4YT, UK, mcenery@comp.lancs.ac.uk) Burnard Lou (Oxford University Computing Services, 13 Banbury Road, Oxford, OX2 6NN, UK) |
||
| Low density languages are typically viewed as those for which few language resources are available. Work relating to low density languages is becoming a focus of increasing attention within language engineering (e.g. Charoenporn, 1997, Hall and Hudson, 1997, Somers, 1997, Nirenberg and Raskin, 1998, Somers, 1998). However, much work related to low density languages is still in its infancy, or worse, work is blocked because the resources needed by language engineers are not available. In response to this situation, the MILLE (Minority Language Engineering) project was established by the Engineering and Physical Sciences Research Council (EPSRC) in the UK to discover what language corpora should be built to enable language engineering work on non-indigenous minority languages in the UK, most of which are typically low- density languages. This paper summarises some of the major findings of the MILLE project. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO12 - Language Resources: Infrastructural Issues, pages 801-806 | ||
| Files: 187.ps, 187.pdf | ||
|
|
||
| Towards a Query Language for Annotation Graphs | ||
| Bird Steven (LDC,
3615 Market Street, Suite 200, Philadelphia, PA, 19104-2608, USA, sb@unagi.cis.upenn.edu) Buneman Peter (Department of Computer Science, University of Pennsylvania, 200 South 33rd Street, Philadelphia, PA 19104, USA) Tan Wang-Chiew (Department of Computer Science, University of Pennsylvania, 200 South 33rd Street, Philadelphia, PA 19104, USA) |
||
| The multidimensional, heterogeneous, and temporal nature of speech databases raises interesting challenges for representation and query. Recently, annotation graphs have been proposed as a general-purpose representational framework for speech databases. Typical queries on annotation graphs require path expressions similar to those used in semistructured query languages. However, the underlying model is rather different from the customary graph models for semistructured data: the graph is acyclic and unrooted, and both temporal and inclusion relationships are important. We develop a query language and describe optimization techniques for an underlying relational representation. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO12 - Language Resources: Infrastructural Issues, pages 807-814 | ||
| Files: 194.ps, 194.pdf | ||
|
|
||
| Software Infrastructure for Language Resources: a Taxonomy of Previous Work and a Requirements Analysis | ||
| Cunnigham Hamish
(Department of Computer Science and, Institute for Language, Speech and
Hearing, University of Sheffield, UK, hamish@dcs.shef.ac.uk) Bontcheva Kalina (Department of Computer Science and, Institute for Language, Speech and Hearing, University of Sheffield, UK, kalina@dcs.shef.ac.uk) Tablan Valentin (Department of Computer Science and, Institute for Language, Speech and Hearing, University of Sheffield, UK, valyt@dcs.shef.ac.uk) Wilks Yorick (Department of Computer Science and, Institute for Language, Speech and Hearing, University of Sheffield, UK, yorick@dcs.shef.ac.uk) |
||
| This paper presents a taxonomy of previous work on infrastructures, architectures and development environments for representing and processing Language Resources (LRs), corpora, and annotations. This classification is then used to derive a set of requirements for a Software Architecture for Language Engineering (SALE). The analysis shows that a SALE should address common problems and support typical activities in the development, deployment, and maintenance of LE software. The results will be used in the next phase of construction of an infrastructure for LR production, distribution, and access. | ||
| Keywords: Architecture, Development Environment, Framework, Language Engineering, Language Resources Infrastructure | ||
| LREC2000 Proceedings: Session WO12 - Language Resources: Infrastructural Issues, pages 815-824 | ||
| Files: 170.ps, 170.pdf | ||
|
|
||
| XCES: An XML-based Encoding Standard for Linguistic Corpora | ||
| Ide Nancy
(Department of Computer Science, Vassar College, Poughkeepsie, NY
12604-0520 USA, ide@cs.vassar.edu) Bonhomme Patrice (LORIA, BP 239, F-54506 Vandoeuvre-lès-Nancy, bonhomme@loria.fr) Romary Laurent (LORIA (CNRS, INRIA), Campus Scientifique - BP 239, 54506 Vandoeuvre-lès-Nancy FRANCE, romary@loria.fr) |
||
| The Corpus Encoding Standard (CES) is a part of the EAGLES Guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES) that provides a set of encoding standards for corpus-based work in natural language processing applications. We have instantiated the CES as an XML application called XCES, based on the same data architecture comprised of a primary encoded text and ''standoff'' annotation in separate documents. Conversion to XML enables use of some of the more powerful mechanisms provided in the XML framework, including the XSLT Transformation Language, XML Schemas, and support for inter-rescue reference together with an extensive path syntax for pointers. In this paper, we describe the differences between the CES and XCES DTDs and demonstrate how XML mechanisms can be used to select from and manipulate annotated corpora encoded according to XCES specifications. We also provide a general overview of XML and the XML mechanisms that are most relevant to language engineering research and applications. | ||
| Keywords: Corpus Encoding, Data Architectures, XML | ||
| LREC2000 Proceedings: Session WO12 - Language Resources: Infrastructural Issues, pages 825-830 | ||
| Files: 172.ps, 172.pdf | ||
|
|
||
| The American National Corpus: A Standardized Resource for American English | ||
| Macleod Catherine
(Computer Science Department, New York University, New York, New York
10003-6806, macleod@cs.nyu.edu) Ide Nancy (Department of Computer Science, Vassar College, Poughkeepsie, NY 12604-0520 USA, ide@cs.vassar.edu) Grishman Ralph (Department of Computer Science, New York University, U.S.A, grishman@cs.nyu.edu) |
||
| At the first conference on Language Resources and Evaluation, Granada 1998, Charles Fillmore, Nancy Ide, Daniel Jurafsky, and Catherine Macleod proposed creating an American National Corpus (ANC) that would compare with the British National Corpus (BNC) both in balance and in size (one hundred million words). This paper reports on the progress made over the past two years in launching the project. At present, the ANC project is well underway, with commitments for support and contribution of texts from a number of publishers world-wide. | ||
| Keywords: Corpus, Corpus Architecture, Standards | ||
| LREC2000 Proceedings: Session WO12 - Language Resources: Infrastructural Issues, pages 831-834 | ||
| Files: 196.ps, 196.pdf | ||
|
|
||
| Accessibility of Multilingual Terminological Resources - Current Problems and Prospects for the Future | ||
| Budin Gerhard
(University of Vienna, Department of Translation and Interpretation,
1190 Vienna, Austria, gerhard.budin@univie.ac.at) Melby Alan K. (Brigham Young University, Department of Linguistics, Provo, Utah, USA, akm@byu.edu) |
||
| In this paper we analyse the various problems in making multilingual terminological resources available to users. Different levels of diversity and incongruence among such resources are discussed. Previous standardization efforts are reviewed. As a solution to the lack of co-ordination and compatibility among an increasing number of ‘standard’ interchange formats, a higher level of integration is proposed for the purpose of terminology-enabled knowledge sharing. The family of formats currently being developed in the SALT project is presented as a contribution to this solution. | ||
| Keywords: Accessibility, Data interchange, Multilingual Terminological Resources, SALT | ||
| LREC2000 Proceedings: Session TO1 - Terminology, pages 837-844 | ||
| Files: 283.ps, 283.pdf | ||
|
|
||
| Terminology in Korea: KORTERM | ||
| Choi Key-Sun
(Korea Terminology Research Center for Language and Knowledge
Engineering, Department of Computer Science, Korea Advanced Institute of
Science and Technology, 373-1 Kusong-dong Yusong-gu Taejon 305-701
Korea, kschoi@korterm.kaist.ac.kr) Chae Young-Soog (Korea Terminology Research Center for Language and Knowledge Engineering, Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1 Kusong-dong Yusong-gu Taejon 305-701 Korea, yschae@korterm.kaist.ac.kr) |
||
| Korterm (Korea Terminology Research Center for Language and Knowledge Engineering) had been set up in the late August of 1998 under the auspices of Ministry of Culture and Tourism in Korea. Its major mission is to construct terminology resources and their unification, harmonization and standardization. This mission is naturally linked to the general language engineering and knowledge engineering tasks including specific-domain corpus, ontology, wordnet, electronic dictionary construction as well as language engineering products like information extraction and machine translation. This organization is located under the KAIST (Korea Advanced Institute of Science and Technology) that is one of national university specifically under the Ministry of Science and Technology. KORTERM is only one representative for terminology standardization and research with relation to Infoterm. | ||
| Keywords: Knowledge, Language Engineering, Research Center, Terminology, Terminology Resource | ||
| LREC2000 Proceedings: Session TO1 - Terminology, pages 845-848 | ||
| Files: 276.ps, 276.pdf | ||
|
|
||
| ARC A3: A Method for Evaluating Term Extracting Tools and/or Semantic Relations between Terms from Corpora | ||
| Jouis Christophe
(CAVI − Censier, University Paris Sorbonne Nouvelle − Paris
III, 13, rue Santeuil, F−75251 Paris CEDEX, FRANCE, Phone: +33 (0)
1 45 87 42 74, Fax: +33 (0) 1 45 87 41 73, E−mail:
Christophe.Jouis@univ−paris3.fr) ARC A3 (ARC A3 Consortium, e−mail: aupelf−a3@univ−lille3.fr) |
||
| This paper describes an ongoing project evaluating Natural Language Processing (NLP) systems. The aim of this project is to test software capabilities in automatic or semi-automatic extraction of terminology from French corpora in order to build tools used in NLP applications. We are putting forward a strategy based on qualitative evaluation. The idea is to submit the results to specialists (i.e. field specialists, terminologists and/or knowledge engineers). The research we are conducting is sponsored by the ''Association des Universites Francophones'' (AUF) an international Organisation whose mission is to promote the dissemination of French as a scientific medium. Software submitted to this evaluation are conceived by French, Canadian and US research institutions (National Scientific Research Centre and Universities) and/or companies : CNRS (France), XEROX, and LOGOS Corporation among others. | ||
| Keywords: Corpus, Evaluation, Information Retrieval, Method, Terminology Extraction | ||
| LREC2000 Proceedings: Session TO1 - Terminology, pages 849-854 | ||
| Files: 247.ps, 247.pdf | ||
|
|
||
| Use of Greek and Latin Forms for Term Detection | ||
| Estopà Rosa
(Institute for Applied Linguistics, Universitat Pompeu Fabra, Rambla
Santa Mònica, 30, 08002 Barcelona, Spain, rosa.estopa@trad.upf.es) Vivaldi Jordi (Institute for Applied Linguistics, Universitat Pompeu Fabra, Rambla Santa Mònica, 30, 08002 Barcelona, Spain, jorge.vivaldi@info.upf.es) Cabré M. Teresa (Institute for Applied Linguistics, Universitat Pompeu Fabra, Rambla Santa Mònica, 30, 08002 Barcelona, Spain, teresa.cabre@trad.upf.es) |
||
| It is well known that many languages make use of neo-classical compounds, and that some domains with a very long tradition like medicine made an intense use of such morphemes. This phenomenon has been largely studied for different languages with the common result that a relatively short number of morphemes allows the detection of a high number of specialised terms to be produced. We believe that the use of such morphological knowledge may help a term detector in discovering very specialised terms. In this paper we propose a module to be included in a term extractor devoted specifically to detect terms that include neo-classical compounds. We describe such module as well the results obtained from it. | ||
| Keywords: Neo-Classical Compounds, Term Detection, Terminology | ||
| LREC2000 Proceedings: Session TO1 - Terminology, pages 855-860 | ||
| Files: 55.ps, 55.pdf | ||
|
|
||
| Automatically Augmenting Terminological Lexicons from Untagged Text | ||
| Demetriou George
(Department of Computer Science, University of Sheffield, 211 Portobello
Street, Sheffield S1 4DP, United Kingdom, G.Demetriou@dcs.shef.ac.uk) Gaizauskas Robert (Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK, R.Gaizauskas@dcs.shef.ac.uk) |
||
| Lexical resources play a crucial role in language technology but lexical acquisition can often be a time-consuming, laborious and costly exercise. In this paper, we describe a method for the automatic acquisition of technical terminology from domain restricted texts without the need for sophisticated natural language processing tools, such as taggers or parsers, or text corpora annotated with labelled cases. The method is based on the idea of using prior or seed knowledge in order to discover co-occurrence patterns for the terms in the texts. A bootstrapping algorithm has been developed that identifies patterns and new terms in an iterative manner. Experiments with scientific journal abstracts in the biology domain indicate an accuracy rate for the extracted terms ranging from 58% to 71%. The new terms have been found useful for improving the coverage of a system used for terminology identification tasks in the biology domain. | ||
| Keywords: Bootstrapping Methods, Information Extraction, Terminology Lexicons | ||
| LREC2000 Proceedings: Session TO1 - Terminology, pages 861-868 | ||
| Files: 320.ps, 320.pdf | ||
|
|
||
| Creating and Using Domain-specific Ontologies for Terminological Applications | ||
| Maynard Diana
(Dept. of Computer Science, Sheffield University, Regent Court, 211
Portobello Rd, Sheffield S1 4DP, U.K., D.Maynard@dcs.shef.ac.uk) Ananiadou Sophia (Computer Science, School of Sciences, University of Salford, Newton Building, Salford, M5 4WT, U.K., S.Ananiadou@salford.ac.uk) |
||
| Huge volumes of scientific databases and text collections are constantly becoming available, but their usefulness is at present hampered by their lack of uniformity and structure. There is therefore an overwhelming need for tools to facilitate the processing and discovery of technical terminology, in order to make processing of these resources more efficient. Both NLP and statistical techniques can provide such tools, but they would benefit greatly from the availability of suitable lexical resources. While information resources do exist in some areas of terminology, these are not designed for linguistic use. In this paper, we investigate how one such resource, the UMLS, is used for terminological acquisition in the TRUCKS system, and how other domain-specific resources might be adapted or created for terminological applications. | ||
| Keywords: Lexical Resources, NLP, Ontology, Terminology | ||
| LREC2000 Proceedings: Session TO1 - Terminology, pages 869-874 | ||
| Files: 22.ps, 22.pdf | ||
|
|
||
| SALA: SpeechDat across Latin America. Results of the First Phase | ||
| Moreno Asunción
(Universitat Politècnica de Catalunya, Jordi Girona 1-3 08034
Barcelona, SPAIN, http://gps-tsc.upc.es/veu, asuncion@tsc.upc.es) Comeyne Robrecht (Lernout & Hauspie, Ieper, Belgium) Haslam Keith (Vocalis, Cambridge, UK) van den Heuvel Henk (SPEX, Nijmegen, Netherlands, e-mail: H.v.d.Heuvel@let.kun.nl) Höge Harald (Siemens AG, München, Germany) Horbach Sabine (Philips, Aagen, Germany, CSELT, Torino, Italy) Micca Giorgio (CSELT, Via G. Reiss Romoli 274, 10148 Torino, Italia, giorgio.micca@cselt.it) |
||
| The objective of the SALA (SpeechDat across Latin America) project is to record large SpeechDat-like databases to train telephone speech recognisers for any country in Latin America. The SALA consortium is composed by several European companies, (CSELT, Italy; Lernout & Hauspie, Belgium; Philips, Germany; Siemens AG, Germany; Vocalis, U.K.) and Universities (UPC Spain, SPEX The Netherlands). This paper gives an overview of the project, introduces the definition of the databases, shows the dialectal distribution in the countries where recordings take place and gives information about validation issues, actual status and practical experiences in recruiting and annotating such large databases in Latin America. | ||
| Keywords: Latin America, Oral Databases, Spanish and Portuguese, Speech Recognition, Telephone Speech | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 877-882 | ||
| Files: 10.ps, 10.pdf | ||
|
|
||
| SPEECON - Speech Data for Consumer Devices | ||
| Siemund Rainer
(Philips Speech Processing) Höge Harald (Siemens AG, München, Germany) Kunzmann Siegfried (IBM) Marasek Krzysztof (SONY) |
||
| SPEECON, launched in February 2000, is a project focusing on collecting linguistic data for speech recogniser training. Put into action by an industrial consortium, it promotes the development of voice controlled consumer applications such as television sets, video recorders, audio equipment, toys, information kiosks, mobile phones, palmtop computers and car navigation kits. During the lifetime of the project, scheduled to last two years, partners will collect speech data for 18 languages or dialectal zones, including most of the languages spoken in the EU. Attention will also be devoted to research into the environment of the recordings, which are, like the typical surroundings of CE applications, at home, in the office, in public places or in moving vehicles. The following pages will give a brief overview of the workplan for the months to come. | ||
| Keywords: Acoustic Adaptation, Consumer Electronics, Data Collection, Information Society, Speech Recognition | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 883-886 | ||
| Files: 63.ps, 63.pdf | ||
|
|
||
| The Spoken Dutch Corpus. Overview and First Evaluation | ||
| Oostdijk Nelleke
(Dept. of Language and Speech, University of Nijmegen, P.O. Box 9103,
6500 HD Nijmegen, The Netherlands, n.oostdijk@let.kun.nl) |
||
| In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, its aims, structure and organization. It then goes on to discuss the considerations - both methodological and practical - that have played a role in the design of the corpus as well as in its compilation and annotation. The paper concludes with an account of the data that are available in the first release of the first part of the corpus that came out on March 1st, 2000. | ||
| Keywords: Annotation, Corpus Design, Dutch (spoken), Evaluation, Spoken Language Corpora | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 887-894 | ||
| Files: 110.ps, 110.pdf | ||
|
|
||
| SPEECHDAT-CAR. A Large Speech Database for Automotive Environments | ||
| Moreno Asunción
(Universitat Politècnica de Catalunya, Jordi Girona 1-3 08034
Barcelona, SPAIN, http://gps-tsc.upc.es/veu, asuncion@tsc.upc.es) Lindberg Børge (Center for PersonKommunikation (CPK), Aalborg, Denmark) Draxler Christoph (IPSK of the University of Munich) Richard Gaël (Lernout & Hauspie , France) Choukri Khalid (European Language Resources Association (ELRA) &, European Language resources - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin, 75013 Paris France, choukri@elda.fr) Euler Stephan (Robert Bosch GmbH Germany) Allen Jeffrey (European Language Resources Association (ELRA) &, European Language resources - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin, 75013 Paris France, jeff@elda.fr) |
||
| The aims of the SpeechDat-Car project are to develop a set of speech databases to support training and testing of multilingual speech recognition applications in the car environment. As a result, a total of ten (10) equivalent and similar resources will be created. The 10 languages are Danish, British English, Finnish, Flemish/Dutch, French, German, Greek, Italian, Spanish and American English. For each language 600 sessions will be recorded (from at least 300 speakers) in seven characteristic environments (low speed, high speed with audio equipment on, etc.). This paper gives an overview of the project with a focus on the production phases (recording platforms, speaker recruitment, annotation and distribution). | ||
| Keywords: Car environment, GSM signals, Multilingual, Oral Databases, Speech Recognition | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 895-900 | ||
| Files: 373.ps, 373.pdf | ||
|
|
||
| Creation of Spoken Hebrew Databases | ||
| Rannon Tami (NSC,
Natural Speech Communication Ltd., 33 Lazarov ST., P.O. Box 5212,
Rishon-LeZion 75150, Israel, tami_r@nsc.co.il) Golani Ofra (NSC, Natural Speech Communication Ltd., 33 Lazarov ST., P.O. Box 5212, Rishon-LeZion 75150, Israel, ofrag@nsc.co.il) Goren Anat (NSC, Natural Speech Communication Ltd., 33 Lazarov ST., P.O. Box 5212, Rishon-LeZion 75150, Israel, anattr@nsc.co.il) Shammass Sherrie (NSC, Natural Speech Communication Ltd., 33 Lazarov ST., P.O. Box 5212, Rishon-LeZion 75150, Israel, shaunie@nsc.co.il) Moyal Ami (NSC, Natural Speech Communication Ltd., 33 Lazarov ST., P.O. Box 5212, Rishon-LeZion 75150, Israel, amym@nsc.co.il) |
||
| Two Spoken Hebrew databases were collected over fixed telephone lines at NSC - Natural Speech Communication. Their creation was based on the SpeechDat model, and represents the first comprehensive spoken database in Modern Hebrew that can be successfully applied to the teleservices industry. The speakers are a representative sample of Israelis, based on sociolinguistic factors such as age, gender, years of education and country of origin. The database includes, digit sequences, natural numbers, money amounts, time expressions, dates, spelled words, application words and phrases for teleservices (e.g., call, save, play), phonetically rich words, phonetically rich sentences, and names. Both read speech and spontaneous speech were elicited. | ||
| Keywords: Hebrew, Semetic Language SR, Speech Recognition, Spoken Database, Telephony Applications | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 901-904 | ||
| Files: 52.ps, 52.pdf | ||
|
|
||
| Spoken Portuguese: Geographic and Social Varieties | ||
| Bettencourt Gonçalves
José (Centro de Linguística da Universidade de Lisboa,
Av. 5 de Outubro, 85 - 6º, 1050-050 LISBOA, Portugal,
jose.bettencourt@clul.ul.pt) Veloso Rita (Centro de Linguística da Universidade de Lisboa, Av. 5 de Outubro, 85 - 6º, 1050-050 LISBOA, Portugal, rita.veloso@clul.ul.pt) |
||
| The Spoken Portuguese: Geographic and Social Varieties project has as its main goal the Portuguese teaching as foreign language. The idea is to provide a collection of authentic spoken texts and to make it friendly usable. Therefore, a selection of spontaneous oral data was made, using either already compiled material or material recorded for this purpose. The final corpus constitution resulted in a representative sample that includes European, Brazilian and African Portuguese, as well as Macau and East-Timor Portuguese. In order to accomplish a functional product the Linguistics Center of Lisbon University developed a sound/text alignment software. The final result is a CD-ROM collection that contains 83 text files, 83 sound files and 83 files produced by the sound/text alignment tool. This independence between sound and text files allows the CD-ROM user to manipulate it for other purposes than the educational one. | ||
| Keywords: Language Teaching, Listening and Understanding, Portuguese Corpus, Portuguese Varieties, Spoken Portuguese | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 905-908 | ||
| Files: 71.ps, 71.pdf | ||
|
|
||
| Orthographic Transcription of the Spoken Dutch Corpus | ||
| Goedertier Wim
(Electronics and Information Systems (ELIS), University Gent,
Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium, odul@elis.rug.ac.be) Goddijn Simo (Speech Processing Expertise Centre (SPEX), Department of Language and Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands, s.goddijn@let.kun.nl) Martens Jean-Pierre (Electronics and Information Systems (ELIS), University Gent, Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium, martens@elis.rug.ac.be) |
||
| This paper focuses on the specification of the orthographic transcription task in the Spoken Dutch Corpus, the problems encountered in making that specification and the evaluation experiments that were carried out to assess the transcription efficiency and the inter-transcriber consistency. It is stated that the role of the orthographic transcriptions in the Spoken Dutch Corpus is twofold: on the one hand, the transcriptions are important for future database users, on the other hand they are indispensable to the development of the corpus itself. The main objectives of the transcription task are the following: (1) to obtain a verbatim transcription that can be made with a minimum level of interpretation of the utterances; (2) to obtain an alignment of the transcription to the speech signal on the level of relatively short chunks; (3) to obtain a transcription that is useful to researchers working in several research areas and (4) to adhere to international standards for existing large speech corpora. In designing the transcription protocol and transcription procedure it was attempted to establish the best compromise between consistency, accuracy and usability of the output and efficiency of the transcription task. For example, the transcription procedure always consists of a first transcription cycle and a verification cycle. Some efficiency and consistency statistics derived from pilot experiments with several students transcribing the same material are presented at the end of the paper. In these experiments the transcribers were also asked to record the amount of time they spent on the different audio files, and to report difficulties they encountered in performing their task. | ||
| Keywords: Orthographic Transcription, Speech Corpora, Spoken Dutch, Spoken Language Resources | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 909-914 | ||
| Files: 87.ps, 87.pdf | ||
|
|
||
| Development of Acoustic and Linguistic Resources for Research and Evaluation in Interactive Vocal Information Servers | ||
| Bernardis Giulia
(Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland,
Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP),
Martigny, Switzerland, giulia@idiap.ch) Bourlard Hervé (Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland, Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP), Martigny, Switzerland) Rajman Martin (Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland) Chappelier Jean-Cédric (Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland) |
||
| This paper describes the setting up of a resource database for research and evaluation in the domain of interactive vocal information servers. All this resource development work took place in a research project aiming at the development of an advanced speech recognition system for the automatic processing of telephone directory requests and was performed on the basis of the Swiss-French Polyphone database (collected in the framework of the European SpeechDat project). Due to the unavailability of a properly orthographically transcribed, consistently labeled and tagged database of unconstrained speech (together with its associated lexicon) for the targeted area, we first concentrated on the annotation and structuration of the spoken requests data in order to make it profitable for lexical and linguistic modeling and for the evaluation of recognition results. A baseline speech recognition system was then trained on the newly developed resources and tested. Preliminary recognition experiments showed a relative improvement of 46% for the Word Error Rate (WER) compared to the results previously obtained with a baseline system very similar but working on the unconsistent natural speech database that was originally available. | ||
| Keywords: Knowledge Extraction, Named Entity Tagging, Orthographic Labeling, Speech Data Annotation, Unconstrained Speech Recognition | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 915-920 | ||
| Files: 90.ps, 90.pdf | ||
|
|
||
| Development and Evaluation of an Italian Broadcast News Corpus | ||
| Federico Marcello
(ITC-irst - Centro per la Ricerca Scientifica e Tecnologica, I-38050
Povo, Trento, Italy) Giordani Dimitri (ITC-irst - Centro per la Ricerca Scientifica e Tecnologica, I-38050 Povo, Trento, Italy) Coletti Paolo (ITC-irst - Centro per la Ricerca Scientifica e Tecnologica, I-38050 Povo, Trento, Italy) |
||
| This paper reports on the development and evaluation of an Italian broadcast news corpus at ITC-irst, under a contract with the Euro-pean Language resources Distribution Agency (ELDA). The corpus consists of 30 hours of recordings transcribed and annotated with conventions similar to those adopted by the Linguistic Data Consortium for the DARPA HUB-4 corpora. The corpus will be completed and released to ELDA by April 2000. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 921-924 | ||
| Files: 95.ps, 95.pdf | ||
|
|
||
| Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts | ||
| Cieri Christopher
(Linguistic Data Consortium, University of Pennsylvania, Philadelphia,
Pennsylvania, USA, ccieri@ldc.upenn.edu) Graff David (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, graff@ldc.upenn.edu) Liberman Mark (Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, myl@ldc.upenn.edu) Martey Nii (Linguistic Data Consortium, 3615 Market Street, Philadelphia, PA 19104, USA, nmartey@ldc.upenn.edu) Strassel Stephanie (Linguistic Data Consortium, 3615 Market Street, Philadelphia, PA 19104, USA, strassel@ldc.upenn.edu) |
||
| This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segmentation of the news streams into individual stories, detection of new topics, identification of the first story to discuss any topic, tracking of all stories on selected topics and detection of links among stories discussing the same topics. The corpora contain English and Chinese broadcast television and radio, newswires, and text from web sites devoted to news. For each source there are texts or text intermediaries; for the broadcast stories the audio is also available. Each broadcast is also segment to show start and end times of all news stories. LDC staff have defined news topics in the corpora and annotated each story to indicate its relevance to each topic. The end products are massive, richly annotated corpora available to support research and development in information retrieval, topic detection and tracking, information extraction message understanding directly or after additional annotation. This paper will describe the corpora created for TDT including sources, collection processes, formats, topic selection and definition, annotation, distribution and project management for large corpora. | ||
| Keywords: Annotation, Data Collection and Distribution, Information Retrieval, Language Resources, Topic Detection and Tracking | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 925-930 | ||
| Files: 210.ps, 210.pdf | ||
|
|
||
| Live Lexicons and Dynamic Corpora Adapted to the Network Resources for Chinese Spoken Language Processing Applications in an Internet Era | ||
| Chien Lee-Feng
(Institute of Information Science, Academia Sinica, Taipei, Taiwan,
Republic of China, lfchien@iis.sinica.edu.tw) Lee Lin-Shan (Institute of Information Science, Academia Sinica, Dept. of Electrical Engineering, National Taiwan University, Taipei, Taiwan, Republic of China, lsl@iis.sinica.edu.tw) |
||
| In the future network era, huge volume of information on all subject domains will be readily available via the network. Also, all the network information are dynamic, ever-changing and exploding. Furthermore, many of the spoken language processing applications will have to do with the content of the network information, which is dynamic. This means dynamic lexicons, language models and so on will be required. In order to cope with such a new network environment, automatic approaches for the collection, classification, indexing, organization and utilization of the linguistic data obtainable from the networks for language processing applications will be very important. On the one hand, high performance spoken language technology can hopefully be developed based on such dynamic linguistic data on the network. On the other hand, it is also necessary that such spoken language technology can be intelligently adapted to the content of the dynamic and the ever-changing network information. Some basic concept for live lexicons and dynamic corpora adapted to the network resources has been developed for Chinese spoken language processing applications and briefly summarized here in this paper. Although the major considerations here are for Chinese language, the concept may equally apply to other languages as well. | ||
| Keywords: Dynamic Corpora, Internet, Live Lexicon | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 931-936 | ||
| Files: 214.ps, 214.pdf | ||
|
|
||
| Shallow Discourse Genre Annotation in CallHome Spanish | ||
| Ries Klaus
(Interactive System Labs, www.is.cs.cmu.edu, ries@cs.cmu.edu, University
of Karlsruhe, Karlsruhe, Germany) Levin Lori (Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA, lsl@cs.cmu.edu) Levin Lori (Interactive System Labs, www.is.cs.cmu.edu, lsl@gcs.cmu.edu, Carnegie Mellon University, Pittsburgh, PA, USA) Valle Liza (Interactive System Labs, www.is.cs.cmu.edu, lsl@cs.cmu.edu, Carnegie Mellon University, Pittsburgh, PA, USA) Lavie Alon (Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213 USA, alavie@cs.cmu.edu, www.is.cs.cmu.edu) Waibel Alex (Interactive System Labs, www.is.cs.cmu.edu, ahw@cs.cmu.edu, University of Karlsruhe, Karlsruhe, Germany) |
||
| The classification of speech genre is not yet an established task in language technologies. However we believe that it is a task that will become fairly important as large amounts of audio (and video) data become widely available. The technological cability to easily transmit and store all human interactions in audio and video could have a radical impact on our social structure. The major open question is how this information can be used in practical and beneficial ways. As a first approach to this question we are looking at issues involving information access to databases of human-human interactions. Classification by genre is a first step in the process of retrieving a document out of a large collection. In this paper we introduce a local notion of speech activities that are exist side-by-side in conversations that belong to speech-genre: While the genre of CallHome Spanish is personal telephone calls between family members the actual instances of these calls contain activities such as storytelling, advising, interrogation and so forth. We are presenting experimental work on the detection of those activities using a variety of features. We have also observed that a limited number of distinguised activities can be defined that describes most of the activities in this database in a precise way. | ||
| Keywords: Discourse, Genre, Neural Networks, Speech, Speech Act | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 937-942 | ||
| Files: 228.ps, 228.pdf | ||
|
|
||
| Issues in Design and Collection of Large Telephone Speech Corpus for Slovenian Language | ||
| Kačič
Zdravko (Faculty of Electrical Engineering and Computer Science,
University of Maribor, Smetanova 17, 2000 Maribor, kacic@uni-mb.si) Horvat Bogomir (University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia, bogo.horvat@uni-mb.si) Zögling Aleksandra (University of Maribor, Research and Study Centre, Razlagova 22, 2000 Maribor, Slovenia, sandra.zogling@uni-mb.si) |
||
| In this paper, different issues in design, collection and evaluation of the large vocabulary telephone speech corpus of Slovenian language are discussed. The database is composed of three text corpora containing 1530 different sentences. It contains read speech of 82 speakers where each speaker read in average more than 200 sentences and 21 speakers read also the text passage of 90 sentences. The initial manual segmentation and labeling of speech material was performed. Based on this the automatic segmentation was carried out. The database should facilitate the development of speech recognition systems to be used in dictation tasks over the telephone. Until now the database was used mostly for isolated digit recognition tasks and word spotting. | ||
| Keywords: Continuous Speech Recognition over the Telephone, Language Resources, Speech Databases, Speech Dictation Task | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 943-946 | ||
| Files: 246.ps, 246.pdf | ||
|
|
||
| Spontaneous Speech Corpus of Japanese | ||
| Maekawa Kikuo
(7KH National Language Research Institute, 3-9-14 Nishiga’oka, Kita-ku,
Tokyo 115-8620 Japan, kikuo@kokken.go.jp) Koiso Hanae (7KH National Language Research Institute, 3-9-14 Nishiga’oka, Kita-ku, Tokyo 115-8620 Japan, koiso@kokken.go.jp) Furui Sadaoki (Tokyo Institute of Technology, 2-12-1, Ookayama, Meguro-ku, Tokyo 152-8552 Japan, furui@furui.cs.titech.ac.jp) Isahara Hitoshi (Communications Research Laboratory, 588-2, Iwaoka, Nishi-ku, Kobe 651-2401 Japan, isahara@crl.go.jp) |
||
| Design issues of a spontaneous speech corpus is described. The corpus under compilation will contain 800-1000 hour spontaneously uttered Common Japanese speech and the morphologically annotated transcriptions. Also, segmental and intonation labeling will be provided for a subset of the corpus. The primary application domain of the corpus is speech recognition of spontaneous speech, but we plan to make it useful for natural language processing and phonetic/linguistic studies also. | ||
| Keywords: Intonaion Labeling, Japanese, Speech Recognition, Spontaneous Speech | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 947-952 | ||
| Files: 262.ps, 262.pdf | ||
|
|
||
| Corpora of Slovene Spoken Language for Multi-lingual Applications | ||
| Gros Jerneja
(Faculty of Electrical Engineering, University of Ljubljana, Tržaška
25, 1001 Ljubljana, Slovenia, nejka@fe.uni-lj.si) Mihelič France (Faculty of Electrical Engineering, University of Ljubljana, Tržaška 25, 1001 Ljubljana, Slovenia, mihelicf@fe.uni-lj.si) Dobrišek Simon (Faculty of Electrical Engineering, Univercity of Ljubljana, Laboratory of Artificial Perception, Tržaška 25, 1000 Ljubljana, Slovenia, simond@fe.uni-lj.si) Erjavec Tomaž (Dept. for Intelligent Systems, Jožef Stefan Institute, Ljubljana, Slovenia, tomaz.erjavecg@ijs.si) Žganec Mario (Masterpoint R&D, Baznikova 40, 1000 Ljubljana, Slovenia, Mario@masterpoint.si) |
||
| The domain of spoken language technologies ranges from speech input and output systems to complex understanding and generation systems, including multi- modal systems of widely differing complexity (such as automatic dictation machines) and multilingual systems (for example automatic dialogue and translation systems). The definition of standards and evaluation methodologies for such systems involves the specification and development of highly specific spoken language corpus and lexicon resources, and measurement and evaluation tools (EAGLES Handbook 1997). This paper presents the MobiLuz spoken resources of the Slovene language, which will be made freely available for research purposes in speech technology and linguistics. | ||
| Keywords: Annotation Tools, Continuous Speech, Diphone Inventory, Speech Corpus, Spoken Commands | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 953-956 | ||
| Files: 288.ps, 288.pdf | ||
|
|
||
| The ISLE Corpus of Non-Native Spoken English | ||
| Menzel Wolfgang
(Universität Hamburg, Fachbereich Informatik, Vogt-Kölln-Strasse
30, 22527 Hamburg, Germany, menzel@informatik.uni-hamburg.de) Atwell Eric (School of Computer Studies, University of Leeds, Woodhouse Lane, Leeds LS2 9JT, United Kingdom, eric@scs.leeds.ac.uk) Bonaventura Patrizia (Universität Hamburg, Fachbereich Informatik, Vogt-Kölln-Strasse 30, 22527 Hamburg, Germany, pbonaven@informatik.uni-hamburg.de) Herron Daniel (Universitat Hamburg, Fachbereich Informatik, Vogt-Kolln-Strasse 30, 22527 Hamburg, Germany, herron@informatik.uni-hamburg.de) Howarth Peter (University of Leeds, Woodhouse Lane, Leeds LS2 9JT, Great Britain, p.a.howarth@leeds.ac.uk) Morton Rachel (Entropic Cambridge Research Labs, Compass House, 80-82 Newmarket Road, Cambridge, CB1 4LD, Great Britain, rim@entropic.co.uk) Souter Clive (School of Computer Studies, University of Leeds, Woodhouse Lane, Leeds LS2 9JT, United Kingdom, cs@scs.leeds.ac.uk) |
||
| For the purpose of developing pronunciation training tools for second language learning a corpus of non-native speech data has been collected, which consists of almost 18 hours of annotated speech signals spoken by Italian and German learners of English. The corpus is based on 250 utterances selected from typical second language learning exercises. It has been annotated at the word and the phone level, to highlight pronunciation errors such as phone realisation problems and misplaced word stress assignments. The data has been used to develop and evaluate several diagnostic components, which can be used to produce corrective feedback of unprecedented detail to a language learner. | ||
| Keywords: Non-Native Speech, Pronunciation Training, Speech Corpus Annotation, Speech Corpus Design, Speech Recognition | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 957-964 | ||
| Files: 313.ps, 313.pdf | ||
|
|
||
| Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition | ||
| Nakamura Satoshi
(ATR Spoken Language Translation Labs 2-2, Hikaridai Seikacho Kyoto
619-0288, Japan, nakamura@slt.atr.co.jp) Hiyane Kazuo (Mitsubishi Research Institute 2-3-6, Otemachi Chiyoda Tokyo 100-8141, Japan, hiya@mri.co.jp) Asano Futoshi (Electrotechnical Laboratory 1-1-4, Umezono Tsukuba Ibaraki 305, Japan, asano@etl.go.jp) Nishiura Takanobu (Nara Institute of Science and Technology 8916-5 Takayama Ikoma Nara 630-0101, Japan, takano-n@is.aist-nara.ac.jp) Yamada Takeshi (Tsukuba Univesrity 1-1-1, Tennodai Tsukuba Ibaraki 305, Japan, takeshi@is.tsukuba.ac.jp) |
||
| This paper reports on a project for collection of the sound scene data. The sound scene data is necessary for studies such as sound source localization, sound retrieval, sound recognition and hands-free speech recognition in real acoustical environments. There are many kinds of sound scenes in real environments. The sound scene is denoted by sound sources and room acoustics. The number of combination of the sound sources, source positions and rooms is huge in real acoustical environments. However, the sound in the environments can be simulated by convolution of the isolated sound sources and impulse responses. As an isolated sound source, a hundred kinds of non-speech sounds and speech sounds are collected. The impulse responses are collected in various acoustical environments. In this paper, progress of our sound scene database project and application to environment sound recognition are described. | ||
| Keywords: Dry sound source, Environment sound recognition, Hands-free speech recognition, Impulse response, Microphone array, Real environments, Sound Scene | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 965-968 | ||
| Files: 356.ps, 356.pdf | ||
|
|
||
| The Influence of Scenario Constraints on the Spontaneity of Speech. A Comparison of Dialogue Corpora | ||
| Weilhammer Karl
(Institute of Phonetics and Speech Communication, Schellingstr. 3,80799
Munich, Germany, karl.weilhammer@phonetik.uni-muenchen.de) Oppermann Daniela (Institute of Phonetics and Speech Communication, Schellingstr. 3,80799 Munich, Germany, daniela.oppermann@phonetik.uni-muenchen.de) Burger Susanne (Interactive Systems Laboratories, Carnegie Mellon Univeristy Pittsburgh, USA, University of Karlsruhe, Germany, sburger@cs.cmu.edu) |
||
| In this article we compare two large scale dialogue corpora recorded in different settings. The main differences are unrestricted turn-taking vs. push-to-talk button and complex vs. simple negotiation task. In our investigation we found that vocabulary, durations of turns, words and sounds as well as prosodical features are influenced by differences in the setting. | ||
| Keywords: Dialogue Corpora, Spontaneous Speech, Turn-Taking, Variation | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 969-974 | ||
| Files: 217.ps, 217.pdf | ||
|
|
||
| Developing a Multilingual Telephone Based Information System in African Languages | ||
| Roux J.C.
(Research Unit for Experimental Phonology, University of Stellenbosch,
Private Bag X1 Stellenbosch 7602, South Africa, jcr@maties.sun.ac.za) Botha E.C. (Department of Electrical and Electronic Engineering, University of Pretoria, Pretoria, South Africa, botha@ee.up.ac.za) du Preez J.A. (Department of Electrical and Electronic Engineering, University of Stellenbosch, Stellenbosch, South Africa, dupreez@dsp.sun.ac.za) |
||
| This paper introduces the first project of its kind within the Southern African language engineering context. It focuses on the role of idiosyncratic linguistic and pragmatic features of the different languages concerned and how these features are to be accommodated within (a) the creation of applicable speech corpora and (b) the design of the system at large. An introduction to the multilingual realities of South Africa and its implications for the development of databases is followed by a description of the system and different options that may be implemented in the system. | ||
| Keywords: African Languages, Language Engineering, Speech Databases | ||
| LREC2000 Proceedings: Session SP3 - Spoken Language Resources' Projects, pages 975-980 | ||
| Files: 329.ps, 329.pdf | ||
|
|
||
| Extraction of Concepts and Multilingual Information Schemes from French and English Economics Documents | ||
| Cadel Peggy (Laboratoire
d'Ingénierie Linguistique et de Linguistique Appliquée,
Université de Nice-Sophia Antipolis, cadel@hermes.unice.fr) Ledouble Hélène (Laboratoire d'Ingénierie Linguistique et de Linguistique Appliquée, Université de Nice-Sophia Antipolis, ledouble@hermes.unice.fr) |
||
| This paper focuses on the linguistic analysis of economic information in French and English documents. Our objective is to establish domain-specific information schemes based on structural and conceptual information. At the structural level, we define linguistic triggers that take into account each language's specificity. At the conceptual level, analysis of concepts and relations between concepts result in a classification, prior to the representation of schemes. The final outcome of this study is a mapping between linguistic and conceptual structures in the field of economics. | ||
| Keywords: Conceptual Representation, Multilingual Information Schemes, Syntactic and Semantic Constraints | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 983-986 | ||
| Files: 202.ps, 202.pdf | ||
|
|
||
| Application of WordNet ILR in Czech Word-formation | ||
| Klímová
Jana (Institute of Czech Language, Academy of Scienses of the Czech
republic, Letenská 4, 118 51 Praha, Czech republic, jana.klimova@ff.cuni.cz) Pala Karel (Dept. of Information Technologies, Faculty of Informatics, Masaryk Univercity Brno, Botanicá 68a, 600 00 Brno, Czech Republic, Pala@fi.muni.cz) |
||
| The aim of this paper is to describe some typical word formation procedures in Czech and to show how the internal language relations (ILR) as they are introduced in Czech WordNet can be related to the chosen derivational processes. In our exploration we have paid attention to the roles of agent, location, instrument and subevent which yield the most regular and rich ways of suffix derivation in Czech. We also deal with the issues of the translation equivalents and corresponding lexical gaps that had to be solved in the framework of EuroWordNet 2 (confronting Czech with English) since they are basically brought about by verb prefixation (single, double, verb aspect pairs) or noun suffixation (diminutives, move in gender). Finally, we try to demonstrate that the mentioned derivational processes can be employed to extend Czech lexical resources in a semiautomatic way. | ||
| Keywords: Corpus, Derivation, Word Formation, WordNet | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 987-992 | ||
| Files: 223.ps, 223.pdf | ||
|
|
||
| Coping with Lexical Gaps when Building Aligned Multilingual Wordnets | ||
| Bentivogli Luisa
(ITC-irst, via Sommarive 18, I-38050 Povo - Trento, Italy, fbentivo@irst.itc.it) Pianta Emanuele (ITC-irst, via Sommarive 18, I-38050 Povo - Trento, Italy, pianesig@irst.itc.it) Pianesi Fabio (ITC-irst, via Sommarive 18, I-38050 Povo - Trento, Italy, pianta@irst.itc.it) |
||
| In this paper we present a methodology for automatically classifying the translation equivalents of a machine readable bilingual dictionary in three main groups: lexical units, lexical gaps (that is cases when a lexical concept of a language does not have a correspondent in the other language) and translation equivalents that need to be manually classified as lexical units or lexical gaps. This preventive classification reduces the manual work necessary to cope with lexical gaps in the construction of aligned multilingual wordnets. | ||
| Keywords: Electronic Dictionaries, Lexical Gaps, Multilingual Wordnets | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 993-998 | ||
| Files: 236.ps, 236.pdf | ||
|
|
||
| Extension and Use of GermaNet, a Lexical-Semantic Database | ||
| Kunze Claudia
(Seminar für Sprachwissenschaft, University of Tübingen,
Wilhelmstr. 113, 72074 Tübingen, Germany, kunze@sfs.nphil.uni-tuebingen.de) |
||
| This paper describes GermaNet, a lexical-semantic network and on-line thesaurus for the German language, and outlines its future extension and use. GermaNet is structured along the same lines as the Princeton WordNet (Miller et al., 1990; Fellbaum, 1998), encoding the major semantic relations like synonymy, hyponymy, meronymy, etc. that hold among lexical items. Constructing semantic networks like GermaNet has become very popular in recent approaches to computational lexicography, since wordnets constitute important language resources for word sense disambiguation, which is a prerequisite for various applications in the field of natural language processing, like information retrieval, machine translation and the development of different language-learning tools. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 999-1002 | ||
| Files: 369.ps, 369.pdf | ||
|
|
||
| CDB - A Database of Lexical Collocations | ||
| Krenn Brigitte
(Austrian Research Institute for Artificial Intelligence (ÖFAI)
Freyung 6/3/1a, Vienna, Austria email: brigitte@ai.univie.ac.at) |
||
| CDB is a relational database designed for the particular needs of representing lexical collocations. The relational model is defined such that competence-based descriptions of collocations (the competence base) and actually occurring collocation examples extracted from text corpora (the example base) complete each other. In the paper, the relational model is described and examples for the representation of German PP-verb collocations are given. A number of example queries are presented, and additional facilities which are built on top of the database are discussed. | ||
| Keywords: Computational Lexicon, Language Resource, Lexicon-Cum-Corpus, Linguistic Database, Relational Model and Database for Lexical Collocations | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 1003-1008 | ||
| Files: 189.ps, 189.pdf | ||
|
|
||
| Towards a Strategy for a Representation of Collocations - Extending the Danish PAROLE-lexicon | ||
| Braasch Anna
(Center for Sprogteknologi Njalsgade 80, DK-2300, Denmark, e-mail: anna@cst.ku.dk) Olsen Sussi (Center for Sprogteknologi Njalsgade 80, DK-2300, Denmark, e-mail: sussi@cst.ku.dk) |
||
| We describe our attempts to formulate a pragmatic definition and a partial typology of the lexical category of ’collocation’ taking both lexicographical and computational aspects into consideration. This provides a suitable basis for encoding collocations in an NLP-lexicon. Further, this paper explains the principles of an operational encoding strategy which is applied to a core section of the typology, namely to subtypes of verbal collocation. This strategy is adapted to a pre-defined lexicon model which has been developed in the PAROLE-project. The work is carried out within the framework of the STO-project the aim of which is to extend the Danish PAROLE-lexicon. The encoding of collocations, in addition to single-word lemmas, greatly increases the lexical and linguistic coverage and thereby also the usability of the lexicon as a whole. Decisions concerning the selection of the most frequent types of collocation to be encoded are made on empirical data i.e. corpus-based recognition. We present linguistic descriptions with focus on some characteristic syntactic features of collocations that are observed in a newspaper corpus. We then give a few prototypical examples provided with formalised descriptions in order to illustrate the restriction features. Finally, we discuss the perspectives of the work done so far. | ||
| Keywords: Collocation, NLP-Lexicon, PAROLE, Word Combinations | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 1009-1016 | ||
| Files: 47.ps, 47.pdf | ||
|
|
||
| Improving Lexical Databases with Collocational Information: Data from Portuguese | ||
| Guerreiro Paula
(Centro de Linguística da Universidade de Lisboa: CLG –
Computation of Lexical and Grammatical Knowledge Research Group Av. 5 de
Outubro, 85 – 5º. 1000 Lisboa. Portugal email:Paula.Guerreiro@clul.ul.pt) |
||
| This article focuses on ongoing work done for Portuguese concerning the phenomenon of lexical co-occurrence known as collocation (cf. Cruse, 1986, inter al.). Instances of the syntactic variety formed by noun plus adjective have been especially observed. Collocational instances are not lexical entries, and thus should not be stored in the lexicon as multiword lexical units. Their processing can be conceived through relations linking the lexical components. Mechanisms for dealing with the collocation-hood of the expressions are required to be included in the systems, topographically, in their lexical modules. Lexical databases like wordnets, with a general architecture typically structured on semantic relations, make room for the specification of this phenomenon. This can be handled through the definition of ad-hoc relations expressing the different semantic effects the adjectival modification bring to nominal phrases, collocationally. | ||
| Keywords: Collocation, Lexical Databases, Lexical Semantics, Portuguese Resources, Privileged co-Occurrence | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 1017-1020 | ||
| Files: 340.ps, 340.pdf | ||
|
|
||
| A Bilingual Electronic Dictionary for Frame Semantics | ||
| Fontenelle Thierry
(19 Rue du Merschgrund, L-8373 Hobscheid, Luxembourg email:fontenel@pt.lu) |
||
| Frame semantics is a linguistic theory which is currently gaining ground. The creation of lexical entries for a large number of words presupposes the development of complex lexical acquisition techniques in order to identify the vocabulary for describing the elements of a 'frame'. In this paper, we show how a lexical-semantic database compiled on the basis of a bilingual (English-French) dictionary can be used to identify some general frame elements which are relevant in a frame-semantic approach such as the one adopted in the FrameNet project (Fillmore & Atkins 1998, Gahl 1998). The database has been systematically enriched with explicit lexical-semantic relations holding between some elements of the microstructure of the dictionary entries. The manifold relationships have been labelled in terms of lexical functions, based on Mel'cuk's notion of co-occurrence and lexical-semantic relations in Meaning-Text Theory (Mel'cuk et al. 1984). We show how these lexical functions can be used and refined to extract potential realizations of frame elements such as typical instruments or typical locatives, which are believed to be recurrent elements in a large number of frames. We also show how the database organization of the computational lexicon makes it possible to readily access implicit and translationally-relevant combinatorial information. | ||
| Keywords: Bilingual Dictionary, Collocations, Computational Lexicography, Lexical Function, Lexical-Semantic Database | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 1021-1028 | ||
| Files: 69.ps, 69.pdf | ||
|
|
||
| A Text->Meaning->Text Dictionary and Process | ||
| Dutoit Dominique
(MEMODATA, 17 Rue Dumont d’Urville, 14000 Caen, France, memodata@wanadoo.fr) |
||
| In this article we deal with various applications of a multilingual semantic network named The Integral Dictionary. We revise different commercial applications that uses semantic networks and we show the results with the Integral Dictionary. The details of the semantic calculations are not given here but we show that contrary to the WordNet semantic net, the Integral Dictionary provides most data and relations needed to these calculations. The article presents results and discussion on lexical expanding, lexical reduction, WSD, query expansion, lexical translation extraction, document summary Emails sorting, catalogue access and information retrieval. We conclude that resource like Integral Dictionary can become a good new step for all those who tried to compute semantics with WordNet and that complementary between the two dictionaries could be seriously study in a shared project. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 1029-1034 | ||
| Files: 132.ps, 132.pdf | ||
|
|
||
| Production of NLP-oriented Bilingual Language Resources from Human-oriented dictionaries | ||
| Fluhr-Semenova Vera
(SCIPER 46 rue du Moulin a Tan, 91150 Etampes, France, fluhrsciper@aol.com) Fluhr Christian (SCIPER 46 rue du Moulin a Tan, 91150 Etampes, France, fluhrsciper@aol.com) Brisson Stéphanie (SCIPER 46 rue du Moulin a Tan, 91150 Etampes, France , email: 101376.156@compuserve.com) |
||
| In this paper, the main features of manually produced bilingual dictionaries, which have been originally designed for human use, are considered. The problem is to find the way to use such kind of dictionaries in order to produce bilingual language resources that could make a base for automate text processing, such as machine translation, cross-lingual interrogation in text retrieval, etc. The transformation technology suggested hereby is based on XML-parsing of the file obtained from the source data by means of serial of special procedures. In order to produce well-formed XML-file, automatic procedures suffice. But in most cases, there are still semantic problems and inconveniencies that could be retired only in interactive way. However, the volume of this work can be minimized due to automatic pre-editing and suitable XML mark-up. The paper presents the results of R&D project which was carried out in the framework of ELRA’1999 Call for proposals on Language resources Production. The paper is based on the authors’ experience with English-Russian and French-Russian dictionaries, but the technology can be applied to other pairs of languages. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP4 - Lexicon: Semantic and Multilingual Issues, pages 1035-1038 | ||
| Files: 328.ps, 328.pdf | ||
|
|
||
| Terms Specification and Extraction within a Linguistic-based Intranet Service | ||
| Pedrazzini Sandro
(University of Basel Klingelbergstrasse 50 4056 Basel (Switzerland) and
SUPSI/IDSIA Galleria 2 6928 Manno (Switzerland), email: sandro@idsia.ch) Maier Elisabeth (Canoo Engineering AG Kirschgartenstrasse 7 4051 Basel (Switzerland), elisabeth.maier@canoo.com) König Dierk (Canoo Engineering AG Kirschgartenstrasse 7 4051 Basel (Switzerland), dierk.koenig@canoo.com) |
||
| This paper describes the adaptation and extension of an existing morphological system,Word Manager,and its integration into an intranet service of a large international bank.The system includes a tool for the analysis and extraction of simple and complex terms.As a side-effect the procedure for the definition of new terms has been consolidated.The intranet service analyzes HTML pages on the fly,compares the results with the vocabulary of an inhouse terminological database (CLS-TDB)and generates hyperlinks in case matches have been found.Currently,the service handles terms in both German and English.The implementation of the service for Italian,French and Spanish is under way. | ||
| Keywords: Finite-state tools, Morphology, Terminology, Text Analysis | ||
| LREC2000 Proceedings: Session TP1 - Terminology, pages 1041-1046 | ||
| Files: 17.ps, 17.pdf | ||
|
|
||
| With WORLDTREK Family, Create, Update and Browse your Terminological World | ||
| Abbas Yasmina (EDF
– Division Recherche & Développement (Division R&D),
MTI/NTIC/TAIC 01, Avenue du Général de Gaulle 92141
CLAMART Cedex, FRANCE INALCO , email: yasmina.abbas@edf.fr) Picard Marie-Luce (EDF - Division Recherche & Développement (Division R&D), SDC/CLEO/SOAD 01, Avenue du Général de Gaulle 92141 CLAMART Cedex, FRANCE, email: marie-luce.picard@edf.fr) |
||
| Companies need to extract pertinent and coherent information from large collections of documents to be competitive and efficient. Structured terminologies are essential for a better drafting, translation or understanding of technical communication. WORLDTREK EDITION is a tool created to help the terminologist elaborate, browse and update structured terminologies in a ergonomic environment without changing his or her working method. This application can be entirely adapted to the « terminological habits » of the expert. Thus, the data loaded in the software is meta-data. Links, status, property names and domains can be customized. Moreover, the validation stage is facilitated by the use of templates, queries and filters. New terms and links can be easily created to enrich the domains and points of view. Properties like definition, context, equivalent in foreign languages are associated with the terms. WORLDTREK EDITION facilitates the comparison and merging of pre-existing networks. All these tasks and the visualization techniques constitute the tool which will help the terminologist to be more effective and productive. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session TP1 - Terminology, pages 1047-1052 | ||
| Files: 31.ps, 31.pdf | ||
|
|
||
| Extraction of Semantic Clusters for Terminological Information Retrieval from MRDs | ||
| Sierra Gerardo
(Instituto de Ingeniería, UNAM Apdo. Postal 70-472 México
04510, D.F., email: gsm@pumas.iingen.unam.mx) McNaught John (Centre for Computational Linguistics, UMIST P.O.Box 88 Manchester, U.K., M60 1QD email: jock@ccl.umist.ac.uk) |
||
| This paper describes a semantic clustering method for data extracted from machine readable dictionaries (MRDs) in order to build a terminological information retrieval system that finds terms from descriptions of concepts. We first examine approaches based on ontologies and statistics, before introducing our analogy-based approach that lets us extract semantic clusters by aligning definitions from two dictionaries. Evaluation of the final set of clusters for a small set of definitions demonstrates the utility of our approach. | ||
| Keywords: Clustering, Definitions, Dictionaries, Information Retrieval, Lexicography, Natural Language Processing, Ontologies, Semantics, Terminology | ||
| LREC2000 Proceedings: Session TP1 - Terminology, pages 1053-1060 | ||
| Files: 35.ps, 35.pdf | ||
|
|
||
| Reusing the Mikrokosmos Ontology for Concept-based Multilingual Terminology Databases | ||
| Moreno Antonio
(Universidad de Málaga, F. Filosofía y Letras, Campus de
Teatinos, 29071 Málaga, Spain, amo@uma.es) Pérez Chantal (Universidad de Málaga F. Filosofía y Letras, Campus de Teatinos 29071 Málaga, Spain email:amo@uma.es) |
||
| This paper reports work carried out within a multilingual terminology project (OncoTerm) in which the Mikrokosmos ( µK) ontology (Mahesh, 1996; Viegas et al 1999) has been used as a language independent conceptual structure to achieve a truly concept-based terminology database (termbase, for short). The original ontology, containing nearly 4,700 concepts and available in Lisp-like format (January 1997 version), was first converted into a set of tables in a relational database. A specific software tool was developed in order to edit and browse this resource. This tool has now been integrated within a termbase editor and released under the name of OntoTerm™. In this paper we focus on the suitability of the µK ontology for the representation of domain-specific knowledge and its associated lexical items. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session TP1 - Terminology, pages 1061-1066 | ||
| Files: 74.ps, 74.pdf | ||
|
|
||
| Term-based Identification of Sentences for Text Summarisation | ||
| Georgantopoulos
Byron (Institute for Language and Speech Processing Epidavrou &
Artemidos 6, 151 25 Maroussi, Greece email: byron@ilsp.gr) Piperidis Stelios (Institute for Language and Speech Processing, Artemidos 6 & Epidavrou, 151 25, Athens, Greece, tel: +301 6875300, fax: +301 6854270, spip@ilsp.gr) |
||
| The present paper describes a methodology for automatic text summarisation of Greek texts which combines terminology extraction and sentence spotting. Since generating abstracts has proven a hard NLP task of questionable effectiveness, the paper focuses on the production of a special kind of abstracts, called extracts: sets of sentences taken from the original text. These sentences are selected on the basis of the amount of information they carry about the subject content. The proposed, corpus-based and statistical approach exploits several heuristics to determine the summary-worthiness of sentences. It actually uses statistical occurrences of terms (TF· IDF formula) and several cue phrases to calculate sentence weights and then extract the top scoring sentences which form the extract. | ||
| Keywords: Automatic Term Extraction, Sentence Extraction, Statistical NLP, Terminological Resources, Text Summarisation | ||
| LREC2000 Proceedings: Session TP1 - Terminology, pages 1067-1070 | ||
| Files: 106.ps, 106.pdf | ||
|
|
||
| Terminology Encoding in View of Multifunctional NLP Resources | ||
| Katsoyannou Marianna
(ILSP-Institute for Language and Speech Processing 6, Artemidos Str.
& Epidaurou, 151 25 Paradissos Amaroussiou, Greece,
email:marianna@ilsp.gr) Efthimiou Eleni (ILSP-Institute for Language and Speech Processing 6, Artemidos Str. & Epidaurou, 151 25 Paradissos Amaroussiou, Greece, email: eleni_e@ilsp.gr) |
||
| Given the existing standards for organising terminology resources, the main question raised is how to create a DB or assimilated term list with properties allowing for an efficient NLP treatment of input texts. Here, we have dealt with the output of MT and have attempted to improve terminological annotation of the input text, in order to optimize reusability and efficiency of performance. By organizing terms in BD-like tables, which provide various cross-linked indications about head properties, morpho-syntax, derivational morphology and semantic-pragmatic relations between concepts of terms, we have managed to improve functionality of resources and enable better customisation. Moreover, we have tried to view the proposed term DB organisation as part of a global account of the problem of terminology resolution on-processing via grammar based or user-machine interaction techniques for term recognition and disambiguation, since term boundary definition is generally recognised to be a complex and costly enterprise, directly related to the fact that most problem causing terminology items are multi-word units either characterized as fixed or as ad hoc or not yet fixed terms. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session TP1 - Terminology, pages 1071-1074 | ||
| Files: 275.ps, 275.pdf | ||
|
|
||
| ARISTA Generative Lexicon for Compound Greek Medical Terms | ||
| Kontos John
(Department of Informatics, Athens University of Economics and Business
76 Patission St., 104 34 Athens, Hellas email:jpk@aueb.gr) Malagardi Ioanna (Department of Informatics Athens University of Economics and Business, 76 Patission St., 104 34 Athens, Hellas email:ioanna@ilsp.gr) Fountoukis Spyros (Department of Informatics Athens University of Economics and Business 76 Patission St., 104 34 Athens, Hellas) |
||
| A Generative Lexicon for Compound Greek Medical Terms based on the ARISTA method is proposed in this paper. The concept of a representation independent definition-generating lexicon for compound words is introduced in this paper following the ARISTA method. This concept is used as a basis for developing a generative lexicon of Greek compound medical terminology using the senses of their component words expressed in natural language and not in a formal language. A Prolog program that was implemented for this task is presented that is capable of computing implicit relations between the components words in a sublanguage using linguistic and extra linguistic knowledge. An extra linguistic knowledge base containing knowledge derived from the domain or microcosm of the sublanguage is used for supporting the computation of the implicit relations. The performance of the system was evaluated by generating possible senses of the compound words automatically and judging the correctness of the results by comparing them with definitions given in a medical lexicon expressed in the language of the lexicographer. | ||
| Keywords: Compound Nouns, Generative Lexicon, Implicit Relations, Medical Terms, Sublanguages | ||
| LREC2000 Proceedings: Session TP1 - Terminology, pages 1075-1078 | ||
| Files: 360.ps, 360.pdf | ||
|
|
||
| Hua Yu: A Word-segmented and Part-Of-Speech Tagged Chinese Corpus | ||
| Maosong Sun (The
State Key Laboratory of Intelligent Technology and Systems Tsinghua
University, Beijing 100084, P. R. China) Honglin Sun (Language Information Processing Institute Beijing Language and Culture University, Beijing 100084, P. R.China) Changning Huang (The State Key Laboratory of Intelligent Technology and Systems Tsinghua University, Beijing 100084, P. R. China) Pu Zhang (Language Information Processing Institute Beijing Language and Culture University, Beijing 100084, P. R.China) Hongbing Xing (Language Information Processing Institute Beijing Language and Culture University, Beijing 100084, P. R.China) Qiang Zhou (The State Key Laboratory of Intelligent Technology and Systems Tsinghua University, Beijing 100084, P. R. China) |
||
| As the outcome of a 3-year joint effort of Department of Computer Science, Tsinghua University and Language Information Processing Institute, Beijing Language and Culture University, Beijing, China, a word-segmented and part-of-speech tagged Chinese corpus with size of 2 million Chinese characters, named HuaYu, has been established. This paper firstly introduces some basics about HuaYu in brief, as its genre distribution, fundamental considerations in designing it, word segmentation and part-of-speech tagging standards. Then the complete list of tag set used in HuaYu is given, along with typical examples for each tag accordingly. Several pieces of annotated texts in each genre are also included at last for reader's reference. | ||
| Keywords: Annotated Corpus, Chinese Information Processing, Tag Set for Chinese, Word Segmentation and Part-of-Speech Tagging | ||
| LREC2000 Proceedings: Session WP5 - Corpus Tagging, pages 1081-1086 | ||
| Files: 372.ps, 372.pdf | ||
|
|
||
| Morphological Tagging to Resolve Morphological Ambiguities | ||
| Birocheau Gaëlle
(Centre Lucien TESNIERE, Université de FRANCHE-COMTE, BESANCON,
FRANCE 10, rue Léonard de Vinci, 25000 BESANCON, FRANCE
gaelle.birocheau@univ-fcomte.fr) |
||
| The issue of this paper is to present the advantages of a morphological tagging of English in order to resolve morphological ambiguities. Such a way of tagging seems to be more efficient because it allows an intention description of morphological forms compared with the extensive collection of usual dictionaries. This method has already been experimented on French and has given promising results. It is very relevant since it allows both to bring hidden morphological rules to light which are very useful especially for foreign learners and take lexical creativity into account. Moreover, this morphological tagging was conceived in relation to the subsequent disambiguation which is mainly based on local grammars. The purpose is to create a morphological analyser being easily adaptable and modifiable and avoiding the usual errors of the ordinary morphological taggers linked to dictionaries. | ||
| Keywords: BNC2, English, Local grammars, Morphological ambiguities, Morphological tagging, NLP, POS1 | ||
| LREC2000 Proceedings: Session WP5 - Corpus Tagging, pages 1087-1094 | ||
| Files: 277.ps, 277.pdf | ||
|
|
||
| Morphemic Analysis and Morphological Tagging of Latvian Corpus | ||
| Levāne Kristīne
(Institute of Mathematics and Computer Science of the University of
LatviaRaina bulvaris 29, LV - 1459, Riga, Latvia,
email:kristine@ailab.miii.lu.lv) Spektors Andrejs (Institute of Mathematics and Computer Science of the University of Latvia Raina bulvaris 29, LV - 1459, Riga, Latvia, email: aspekt@ailab.mii.lu.lv) |
||
| There are approximately 8 million running words in Latvian Corpus and it is initial size for investigations using national corpus. The corpus contains different texts: modern written Latvian, different newspapers, Latvian classical literature, Bible, Latvian Folk Believes, Latvian Folk Songs, Latvian Fairy-tales and other. Methodology and the software for SGML tagging are developed by Artificial Intelligence Laboratory; approximately 3 million running words is marked up by SGML language. The first step was to develop morphemic analysis in co-operation with Dr. B. Kangere from Stockholm University. The first morphological analyzer was developed in 1994 at Artificial Intelligence Laboratory. The analyzer has its own tag system. Later the tags for the morphological analyzer were elaborated according to MULTEXT-EAST recommendations. Latvian morphological system is rather complicate and there are many difficulties with the recognition of words, word forms as far as Latvian has many homonymous forms. The first corpus of texts of morphological analysis is marked up manually. Totally it covers approximately 10 000 words of modern written Latvian. The results of this work will be used in the further investigations. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP5 - Corpus Tagging, pages 1095-1098 | ||
| Files: 107.ps, 107.pdf | ||
|
|
||
| Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets | ||
| Džeroski Sašo
(Institute Jozef Stefan, Ljubljana, Slovenia) Erjavec Tomaž (Dept. for Intelligent Systems, Jožef Stefan Institute, Ljubljana, Slovenia, tomaz.erjavecg@ijs.si) Zavrel Jakub (CNTS / Language Technology Group, University of Antwerp, Universiteitsplein 1, 2610 Wilrijk, Belgium, zavrel@uia.ua.ac.be) |
||
| The paper evaluates tagging techniques on a corpus of Slovene, where we are faced with a large number of possible word-class tags and only a small (hand-tagged) dataset. We report on training and testing of four different taggers on the Slovene MULTEXT-East corpus containing about 100.000 words and 1000 different morphosyntactic tags. Results show, first of all, that training times of the Maximum Entropy Tagger and the Rule Based Tagger are unacceptably long, while they are negligible for the Memory Based Taggers and the TnT tri-gram tagger. Results on a random split show that tagging accuracy varies between 86% and 89% overall, between 92% and 95% on known words and between 54% and 55% on unknown words. Best results are obtained by TnT. The paper also investigates performance in relation to our EAGLES-based morphosyntactic tagset. Here we compare the per-feature accuracy on the full tagset, and accuracies on these features when training on a reduced tagset. Results show that PoS accuracy is quite high, while accuracy on Case is lowest. Tagset reduction helps improve accuracy, but less than might be expected. | ||
| Keywords: Evaluation, Slovene Langauge, Tagging | ||
| LREC2000 Proceedings: Session WP5 - Corpus Tagging, pages 1099-1104 | ||
| Files: 146.ps, 146.pdf | ||
|
|
||
| Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging | ||
| Tufiş Dan
(RACAI-Romanian Academy 13, “13 Septembrie”, Ro-74311, Bucharest 5,
Romania, email:tufis@valhalla.racai.ro) |
||
| The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagsets. For instance, our EAGLES-compliant lexicon required a set of about 1000 morpho-syntactic description codes (MSDs) which after considering some systematic syncretic phenomena, was reduced to a set of 614 MSDs. Building reliable language models (LMs) for this tagset would require unrealistically large training data (hand annotated/validated). Our solution was to design a hidden reduced tagset and use it in building various LMs. The underlying tagger uses these LMs to tag a new text in as many variants as LMs are available. The tag differences between these variants are processed by a combiner which chooses the most likely tags. In the end, the tagged text is subject to a conversion process that maps the tags from the reduced tagset onto the more informative tags from the large tagset. We describe this processing chain and provide a detailed evaluation of the results. | ||
| Keywords: Combined Classifiers, Evaluation, Standardization, Tagset Design, Tiered Tagging | ||
| LREC2000 Proceedings: Session WP5 - Corpus Tagging, pages 1105-1112 | ||
| Files: 11.ps, 11.pdf | ||
|
|
||
| The Context (not only) for Humans | ||
| Hladká
Barbora (Institute of Formal and Applied Linguistics Charles
University, Malostranské námĕstí 25 118 00
Prague Czech Republic, e-mail: hladka@ufal.mff.cuni.cz) |
||
| Our context considerations will be practically oriented; we will explore the specification of a context scope in the Czech morphological tagging. We mean by morphological tagging/annotation the automatic/manual disambiguation of the output of morphological analysis. The Prague Dependency Treebank (PDT) serves as a source of annotated data. The main aim is to concentrate on the evaluation of the influence of the chosen context on the tagging accuracy. | ||
| Keywords: Context, Morpohological Annotation vs Tagging | ||
| LREC2000 Proceedings: Session WP5 - Corpus Tagging, pages 1113-1116 | ||
| Files: 156.ps, 156.pdf | ||
|
|
||
| PoS Disambiguation and Partial Parsing Bidirectional Interaction | ||
| Felipe Montserrat
Marimon (Grup d’Investigació en Lingüística
Computacional Universitat de Barcelona, email: montse@gilcub.es) Porta Zamorano Jordi (Departamento de Lingüística Computacional Real Academia Española, email:porta@rae.es) |
||
| This paper presents Latch; a system for PoS disambiguation and partial parsing that has been developed for Spanish. In this system, chunks can be recognized and can be referred to like ordinary words in the disambiguation process. This way, sentences are simplified so that the disambiguator can operate interpreting a chunk as a word and chunk head information as a word analysis. This interaction of PoS disambiguation and partial parsing reduces the effort needed for writing rules considerably. Furthermore, the methodology we propose improves both efficiency and results. | ||
| Keywords: Chunks, Partial Parsing, PoS Disambiguation | ||
| LREC2000 Proceedings: Session WP5 - Corpus Tagging, pages 1117-1122 | ||
| Files: 169.ps, 169.pdf | ||
|
|
||
| Rule-based Tagging: Morphological Tagset versus Tagset of Analytical Functions | ||
| Ribarov Kiril
(Institute of Formal and Applied Linguistics Charles University
Malostranské námĕstí 25 118 00 Prague Czech
Republic, e-mail: ribarov@ufal.mff.cuni.cz) |
||
| This work presents a part of a more global study on the problem of parsing of Czech and on the knowledge extraction capabilities of the Rule-based method. It is shown that the successfulness of the Rule-based method for English and its unsuccessfulness for Czech, is not only due to the small cardinality of the English tagset (as it is usually claimed) but mainly depends on its structure (”regularity” of the language information). | ||
| Keywords: Analytical Functions, Dependency Syntax, Rule-Based, Tagging, Tagset | ||
| LREC2000 Proceedings: Session WP5 - Corpus Tagging, pages 1123-1126 | ||
| Files: 199.ps, 199.pdf | ||
|
|
||
| The New Edition of the Natural Language Software Registry (an Initiative of ACL hosted at DFKI) | ||
| Declerck Thierry
(DFKI GmbH (German Research Center for Artificial Intelligence) Language
Technology Lab Stulsatzenhausweg 3, 66123 Saarbrücken, Germany, ,
email: declerck@dfki.de) Werner Jachmann Alexander (DFKI GmbH (German Research Center for Artificial Intelligence) Language Technology Lab Stulsatzenhausweg 3, 66123 Saarbrücken, Germany , email: heinz@dfki.de) Uszkoreit Hans (DFKI GmbH (German Research Center for Artificial Intelligence) Language Technology Lab Stulsatzenhausweg 3, 66123 Saarbrücken, Germany email:hansug@dfki.de) |
||
| In this paper we present the new version (4th edition) of the Natural Language Software Registry (NLSR), an initiative of the Association for Computational Linguistics (ACL) hosted at DFKI in Saarbr¨ ucken. We give a brief overview of the history of this repository for Natural Language Processing (NLP) software, list some related works and go into the details of the design and the implementation of the new edition. | ||
| Keywords: Classified Listing of NLP Software | ||
| LREC2000 Proceedings: Session WP6 - Tools in the Written Area, pages 1129-1132 | ||
| Files: 267.ps, 267.pdf | ||
|
|
||
| Open Ended Computerized Overview of Controlled Languages | ||
| Gavieiro-Villatte
Elisa (Aerospatiale Matra Airbus. Human Factors Dept., section 513
316 route de Bayonne, 31060 Toulouse, France
email:elisa.gavieiro@airbus.aeromatra.com) Spaggiari Laurent (FORELL Lab., University of Poitiers 95 avenue Recteur Pineau, 86022 Poitiers Cedex, France email:laurent.spaggiari@libertysurf.fr) |
||
| We have built up an open-ended computerized overview which can give instant access to information because controlled languages (CLs) are of undoubted interest (for safety and economic reasons, etc.) for industry and those willing to create a CL need to be aware of what has already been done. To achieve it, we had a close look at what has been written in the field of CLs and tried to get in touch with the persons involved in different projects (K. Barthe, E. Johnson, K. Godden, B. Arendse, E. Adolphson, T. Hartley, etc.) | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP6 - Tools in the Written Area, pages 1133-1134 | ||
| Files: 81.ps, 81.pdf | ||
|
|
||
| Automatic Transliteration and Back-transliteration by Decision Tree Learning | ||
| Kang Byung-Ju
(Department of Computer Science Advanced Information Technology Research
Center (AITrc) Korea Terminology Center for Language and Knowledge
Engineering Korea Advanced Institute of Science and Technology 373-1
Kusong-dong, Yusong-gu, Taejon, 305-701, Korea,
bjkang@world.kaist.ac.kr) Choi Key-Sun (Korea Terminology Research Center for Language and Knowledge Engineering, Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1 Kusong-dong Yusong-gu Taejon 305-701 Korea, kschoi@korterm.kaist.ac.kr) |
||
| Automatic transliteration and back-transliteration across languages with drastically different alphabets and phonemes inventories such as English/Korean, English/Japanese, English/Arabic, English/Chinese, etc, have practical importance in machine translation, cross-lingual information retrieval, and automatic bilingual dictionary compilation, etc. In this paper, a bi-directional and to some extent language independent methodology for English/Korean transliteration and back-transliteration is described. Our method is composed of character alignment and decision tree learning. We induce transliteration rules for each English alphabet and back-transliteration rules for each Korean alphabet. For the training of decision trees we need a large labeled examples of transliteration and back-transliteration. However this kind of resources are generally not available. Our character alignment algorithm is capable of highly accurately aligning English word and Korean transliteration in a desired way. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP6 - Tools in the Written Area, pages 1135-1142 | ||
| Files: 227.ps, 227.pdf | ||
|
|
||
| The Universal XML Organizer: UXO | ||
| Milde Jan-Torsten
(Computational linguistics and text-technology group, Fakultät für
Linguistik und Literaturwissenschaft, Universität Bielefeld,
Bertelsmann Lexikonverlag, email: milde@coli.uni-bielefeld.de) Reinsch Markus (Computational linguistics and text-technology group, Fakultät für Linguistik und Literaturwissenschaft, Universität Bielefeld, Bertelsmann Lexikonverlag, email: Markus.Reinsch@bertelsmann.de) |
||
| The integrated editor UXO is the result of ongoing research and development of the text-technology group at Bielefeld. Being a full featured XML-based editing system, it also allows to combine the structured annotated data with information imported from relational databases by integrating a JDBC interface. The mapping processes between different levels of annotation can be programmed either by the integrated scheme interpreter, or by extending the functionality of UXO using the predefined Java API. | ||
| Keywords: Database Access, Linguistic Data Acquisition, Scheme-Scripting, Transcription Tool, XML-Editor | ||
| LREC2000 Proceedings: Session WP6 - Tools in the Written Area, pages 1143-1146 | ||
| Files: 253.ps, 253.pdf | ||
|
|
||
| LT TTT - A Flexible Tokenisation Tool | ||
| Grover Claire
(Language Technology Group University of Edinburgh, 2 Buccleuch Place
Edinburgh EH8 9LW, Scotland, email: grover@cogsci.ed.ac.uk) Matheson Colin (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, email:colin@cogsci.ed.ac.uk) Mikheev Andrei (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, email:mikheev@cogsci.ed.ac.uk) Moens Marc (Language Technology Group University of Edinburgh, 2 Buccleuch Place Edinburgh EH8 9LW, Scotland, marcg@cogsci.ed.ac.uk) |
||
| We describe LT TTT, a recently developed software system which provides tools to perform text tokenisation and mark-up. The system includes ready-made components to segment text into paragraphs, sentences, words and other kinds of token but, crucially, it also allows users to tailor rule-sets to produce mark-up appropriate for particular applications. We present three case studies of our use of LT TTT: named-entity recognition (MUC-7), citation recognition and mark-up and the preparation | ||
| Keywords: Corpus Preparation, Information Extraction, Named Entity Recognition, Tokenisation, XML Mark-Up | ||
| LREC2000 Proceedings: Session WP6 - Tools in the Written Area, pages 1147-1154 | ||
| Files: 93.ps, 93.pdf | ||
|
|
||
| Will Very Large Corpora Play For Semantic Disambiguation The Role That Massive Computing Power Is Playing For Other AI-Hard Problems? | ||
| Cucchiarelli
Alessandro (University of Ancona email:alex@inform.unian.it) Faggioli Enrico (University of Ancona) Velardi Paola (University of Roma 'La Sapienza' email: velardi@dsi.uniroma1.it) |
||
| In this paper we formally analyze the relation between the amount of (possibly noisy) examples provided to a word-sense classification algorithm and the performance of the classifier. In the first part of the paper, we show that Computational Learning Theory provides a suitable theoretical framework to establish one such relation. In the second part of the paper, we will apply our theoretical results to the case of a semantic disambiguation algorithm based on syntactic similarity. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WP6 - Tools in the Written Area, pages 1155-1162 | ||
| Files: 76.ps, 76.pdf | ||
|
|
||
| Interarbora and Thistle - Delivering Linguistic Structure by the Internet | ||
| Calder Jo
(University of Edinburgh, Division of Informatics, Language Technology
Group) |
||
| I describe an Internet service ''Interarbora'', which facilitates the visualization of tree structures. The service is built on top of a general purpose editor ''Thistle'', which allows the editing of diagrams and the generation of print format representations. | ||
| Keywords: Structure Driven Processing, Visualization | ||
| LREC2000 Proceedings: Session WP6 - Tools in the Written Area, pages 1163-1166 | ||
| Files: 319.ps, 319.pdf | ||
|
|
||
| A Proposal for the Integration of NLP Tools using SGML-Tagged Documents | ||
| Artola X. (Dept.
of Computer Languages and Systems, University of the Basque Country, 649
P. K., E-20080 Donostia, Basque Country) de Ilarraza A. Díaz (Faculty of Computer Science University of the Basque Country (UPV/EHU) 649 p.k., 20080 Donostia (The Basque Country)) Ezeiza N. (Faculty of Computer Science University of the Basque Country (UPV/EHU) 649 p.k., 20080 Donostia (The Basque Country)) Gojenola K. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Maritxalar A. (Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K., E-20080 Donostia, Basque Country) Soroa A. (Faculty of Computer Science University of the Basque Country (UPV/EHU) 649 p.k., 20080 Donostia (The Basque Country)) |
||
| In this paper we present the strategy used for an integration, in a common framework, of the NLP tools developed for Basque during the last ten years. The documents used as input and output of the different tools contain TEI-conformant feature structures (FS) coded in SGML. These FSs describe the linguistic information that is exchanged among the integrated analysis tools. The tools integrated until now are a lexical database, a tokenizer, a wide-coverage morphosyntactic analyzer, and a general purpose tagger/lemmatizer. In the future we plan to integrate a shallow syntactic parser. Due to the complexity of the information to be exchanged among the different tools, FSs are used to represent it. Feature structures are coded following the TEI’s DTD for FSs, and Feature Structure Definition descriptions (FSD) have been thoroughly defined. The use of SGML for encoding the I/O streams flowing between programs forces us to formally describe the mark-up, and provides software to check that these mark-up hold invariantly in an annotated corpus. A library of Abstract Data Types representing the objects needed for the communication between the tools has been designed and implemented. It offers the necessary operations to get the information from an SGML document containing FSs, and to produce the corresponding output according to a well-defined FSD. | ||
| Keywords: Feature Structures, Integration of NLP Tools, SGML, TEI-Conformant Feature Structures | ||
| LREC2000 Proceedings: Session WP6 - Tools in the Written Area, pages 1167-1174 | ||
| Files: 68.ps, 68.pdf | ||
|
|
||
| Reusability as Easy Adaptability: A Substantial Advance in NL Technology | ||
| Prodanof Irina
(Istituto di Linguistica Computazionale - CNR Via della Faggiola 32,
56100 Pisa, email: irina@ilc.pi.cnr.it) Cappelli Amedeo (Istituto di Linguistica Computazionale - CNR Via della Faggiola 32, 56100 Pisa) Moretti Lorenzo (Istituto di Linguistica Computazionale - CNR Via della Faggiola 32, 56100 Pisa) |
||
| The design and implementation of new applications in NLP at low costs mostly depends upon the availability of technologies oriented to the solution of any specific problem. The success of this task, besides the use of widely agreed formats and standards, relies upon at least two families of tools, those for managing and updating, and those for projecting an ''application view-point'' onto the data in the repository. This approach has different realizations if applied to a dictionary, a corpus, or a grammar. Some examples, taken frrom European and other industrial projects, show that reusability: a) in the building of industrial prototypes consists in the easy reconfiguration of resources (dictionary and grammar), easy portability and easy recombination of tools, by means of simple APIs, as well as on different implementation platforms: b) in the building of advanced applications still consists in the same features, together with the possibility of opening different view-points on dictionaries and grammars. | ||
| Keywords: Resources, Reusability, Tools | ||
| LREC2000 Proceedings: Session WP6 - Tools in the Written Area, pages 1175-1180 | ||
| Files: 298.ps, 298.pdf | ||
|
|
||
| Grammarless Bracketing in an Aligned Bilingual Corpus | ||
| Kinoshita Jorge
(Escola Politécnica da Universidade de São Paulo - Brazil,
email: jkinoshi@pcs.usp.br) |
||
| We propose a simple grammarless procedure to extract phrasal examples from aligned parallel texts. Is is based on the difference of word sequence in two languages. | ||
| Keywords: Alignment, Bilingual Corpus | ||
| LREC2000 Proceedings: Session WO13 - Multilingual Resources and Applications, pages 1183-1186 | ||
| Files: 164.ps, 164.pdf | ||
|
|
||
| Constructing a Tagged E-J Parallel Corpus for Assisting Japanese Software Engineers in Writing English Abstracts | ||
| Narita Masumi
(Software Research Center Ricoh Co., Ltd. 1-1-17 Koishikawa, Bunkyo-ku,
Tokyo, Japan, email: narita@src.ricoh.co.jp) |
||
| This paper presents how we constructed a tagged E-J parallel corpus of sample abstracts, which is the core language resource for our English abstract writing tool, the “Abstract Helper.” This writing tool is aimed at helping Japanese software engineers be more productive in writing by providing them with good models of English abstracts. We collected 539 English abstracts from technical journals/proceedings and prepared their Japanese translations. After analyzing the rhetorical structure of these sample abstracts, we tagged each sample abstract with both an abstract type and an organizational-scheme type. We also tagged each sample sentence with a sentence role and one or more verb complementation patterns. We also show that our tagged E-J parallel corpus of sample abstracts can be effectively used for providing users with both discourse-level guidance and sentence-level assistance. Finally, we discuss the outlook for further development of the “Abstract Helper.” | ||
| Keywords: CALL, Computer-assisted Writing Tool, E-J Parallel Corpus | ||
| LREC2000 Proceedings: Session WO13 - Multilingual Resources and Applications, pages 1187-1192 | ||
| Files: 78.ps, 78.pdf | ||
|
|
||
| Multilingual Linguistic Resources: From Monolingual Lexicons to Bilingual Interrelated Lexicons | ||
| Villegas Marta
(Institut d’Estudis Catalans, mvillegas@iec.es) Bel Nuria (GILCUB (Grup Investigació Lingüística Computacional Universitat Barcelona), nuria@gilcub.es) Lenci Alessandro (Istituto di Linguistica Computazionale – CNR, Via Alfieri 1 - Pisa 56010 - ITALY, lenci@ilc.pi.cnr.it) Calzolari Nicoletta (Istituto di Linguistica Computazionale, CNR, Area della Ricerca di Pisa, Via Alfieri 1, Loc. S. Cataldo, Ghezzano 56010 (PI) – ITALY, glottolo@ilc.pi.cnr.it) Ruimy Nilda (Istituto di Linguistica Computationale. CNR nilda@ilc.pi.cnr.it) Zampolli Antonio (Istituto di Linguistica Computationale. CNR parole@ilc.pi.cnr.it) Sadurní Teresa (Institut d’Estudis Catalans email:tsadurni@iec.es) Soler i Bou Joan (Institut d’Estudis Catalans, Carme, 47, 08001 Barcelona, SPAIN, jsoler@iec.es) |
||
| This paper describes a procedure to convert the PAROLE-SIMPLE monolingual lexicons into bilingual interrelated lexicons where each word sense of a given language is linked to the pertinent sense of the right words in one or more target lexicons. Nowadays, SIMPLE lexicons are monolingual although the ultimate goal of these harmonised monolingual lexicons is to build multilingual lexical resources. For achieving this goal it is necessary to automatise the linking among the different senses of the different monolingual lexicons, as the production of such multilingual relations by hand will be, as all tasks related with the development of linguistic resources, unaffordable in terms of human resources and time spent. The system we describe in this paper takes advantage of the SIMPLE model and the SIMPLE based lexicons so that, in the best case, it can find fully automatically the relevant sense-to-sense correspondences for determining the translational equivalence of two words in two different languages and, in the worst case, it will be able to narrow the set of admissible links between words and relevant senses. This paper also explores to what extent semantic encoding in already existing computational lexicons such as SIMPLE can help in overcoming the problems arisen when using monolingual meaning descriptions for bilingual links and aims to set the basis for defining a model for adding a bilingual layer to the SIMPLE model. This bilingual layer based on a bilingual relation model will be the basis indeed for defining the multilingual language resource we want PAROLE-SIMPLE lexicons to become. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO13 - Multilingual Resources and Applications, pages 1193-1200 | ||
| Files: 96.ps, 96.pdf | ||
|
|
||
| TransSearch: A Free Translation Memory on the World Wide Web | ||
| Macklovitch Elliott
(Laboratoire RALI, Université de Montréal, Montreal,
Canada, macklovi@iro.umontreal.ca) Simard Michel (Laboratoire RALI Université de Montréal, Montreal, Canada, email: simardm@iro.umontreal.ca) Langlais Philippe (Laboratoire RALI, Université de Montréal, Montreal, Canada, felipe@iro.umontreal.ca) |
||
| A translation memory is an archive of existing translations, structured in such a way as to promote translation re-use. Under this broad definition, an interactive bilingual concordancing tool like the RALI’s TransSearch system certainly qualifies as a translation memory. This paper describes the Web-based version of TransSearch, which, for the last three years, has given Internet users access to a large English-French translation database made up of Canadian parliamentary debates. Despite the fact that the RALI has done very little to publicize the availability of TransSearch on the Web, the system has been attracting a growing and impressive number of users. We present some basic data on who is using TransSearch and how, data which was collected from the system’s log file and by means of a questionnaire recently added to our Web site. We conclude with a call to the international community to help set up a network of bi-textual databases like TransSearch, which translators around the world could freely access over the Web. | ||
| Keywords: Bilingual Corpora, Bi-textual Database, Translation Memory, Translation Resource, World Wide Web | ||
| LREC2000 Proceedings: Session WO13 - Multilingual Resources and Applications, pages 1201-1208 | ||
| Files: 12.ps, 12.pdf | ||
|
|
||
| Annotating Resources for Information Extraction | ||
| Boisen Sean (BBN
Technologies 87 Fawcett Street, Cambridge MA 02138 , email:
Sean.Boisen@bbn.com) Crystal Michael R. (BBN Technologies 87 Fawcett Street, Cambridge MA 02138) Schwartz Richard (BBN Technologies 87 Fawcett Street, Cambridge MA 02138) Stone Rebecca (BBN Technologies 87 Fawcett Street, Cambridge MA 02138) Weischedel Ralph (BBN Technologies 87 Fawcett Street, Cambridge MA 02138) |
||
| Trained systems for NE extraction have shown significant promise because of their robustness to errorful input and rapid adaptability. However, these learning algorithms have transferred the cost of development from skilled computational linguistic expertise to data annotation, putting a new premium on effective ways to produce high-quality annotated resources at minimal cost. The paper reflects on BBN’s four years of experience in the annotation of training data for Named Entity (NE) extraction systems discussing useful techniques for maximizing data quality and quantity. | ||
| Keywords: Annotation, Information Extraction, Named Entity Extraction, Trained Systems | ||
| LREC2000 Proceedings: Session WO14 - Named Entity Recognition, pages 1211-1214 | ||
| Files: 263.ps, 263.pdf | ||
|
|
||
| Integrating Seed Names and ngrams for a Named Entity List and Classifier | ||
| Buchholz Sabine
(ILK / Computational Linguistics Tilburg University, P.O. Box 90153,
NL-5000 LE Tilburg, The Netherlands , email:fS.Buchholz@kub.nl,
http://ilk.kub.nl) van den Bosch Antal (ILK / Computational Linguistics Tilburg University, P.O. Box 90153, NL-5000 LE Tilburg, The Netherlands , email:vdnBoschg@kub.nl, http://ilk.kub.nl) |
||
| We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch newspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJECTIVAL name, are looked up in a 83-million word corpus, and their immediate contexts are stored as instances of their label. The latter 8-grams are used by a memory-based machine learning algorithm that, after training, (i) can produce high-precision labeling of instances to be added to the seed lists, and (ii) more generally labels new, unseen names. Unlabeled named-entity types are labeled with a precision of 61 % and a recall of 56 %. On free text, named-entity token labeling accuracy is 71 %. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO14 - Named Entity Recognition, pages 1215-1222 | ||
| Files: 141.ps, 141.pdf | ||
|
|
||
| Named Entity Recognition in Greek Texts | ||
| Demiros Iason
(Institute for Language and Speech Processing Artemidos 6 &
Epidavrou, 151 25, Athens, Greece, email: iason@ilsp.gr) Boutsis Sotiris (Institute for Language and Speech Processing, Artemidos 6 & Epidavrou, 151 25 Maroussi, Greece, sboutsis@ilsp.gr) Giouli Voula (Institute for Language and Speech Processing, Artemidos 6 & Epidavrou, 151 25, Athens, Greece, tel: +301 6875300, fax: +301 6854270, voula@ilsp.gr) Liakata Maria (Cambridge University, email: ml257@cam.ac.uk) Papageorgiou Harris (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, xaris@ilsp.gr) Piperidis Stelios (Institute for Language and Speech Processing, Artemidos 6 & Epidavrou, 151 25, Athens, Greece, tel: +301 6875300, fax: +301 6854270, spip@ilsp.gr) |
||
| In this paper, we describe work in progress for the development of a named entity recognizer for Greek. The system aims at information extraction applications where large scale text processing is needed. Speed of analysis, system robustness, and results accuracy have been the basic guidelines for the system’s design. Our system is an automated pipeline of linguistic components for Greek text processing based on pattern matching techniques. Non-recursive regular expressions have been implemented on top of it in order to capture different types of named entities. For development and testing purposes, we collected a corpus of financial texts from several web sources and manually annotated part of it. Overall precision and recall are 86% and 81% respectively. | ||
| Keywords: Greek, Information Extraction, Named Entity Recognition | ||
| LREC2000 Proceedings: Session WO14 - Named Entity Recognition, pages 1223-1228 | ||
| Files: 173.ps, 173.pdf | ||
|
|
||
| Minimally Supervised Japanese Named Entity Recognition: Resources and Evaluation | ||
| Utsuro Takehito
(Department of Information and Computer Sciences, Toyohashi University
of Technology, Tenpaku-cho, Toyohashi, 441-8580, Japan,
utsuro@ics.tut.ac.jp) Sassano Manabu (Fujitsu Laboratories, Ltd. 4-4-1, Kamikodanaka, Nakahara-ku, Kawasaki 211-8588, Japan, email: sassano@flab.fujitsu.co.jp) |
||
| Approaches to named entity recognition that rely on hand-crafted rules and/or supervised learning techniques have limitations in terms of their portability into new domains as well as in the robustness over time. For the purpose of overcoming those limitations, this paper evaluates named entity chunking and classification techniques in Japanese named entity recognition in the context of minimally supervised learning. This experimental evaluation demonstrates that the minimally supervised learning method proposed here improved the performance of the seed knowledge on named entity chunking and classification. We also investigated the correlation between performance of the minimally supervised learning and the sizes of the training resources such as the seed set as well as the unlabeled training data. | ||
| Keywords: co-Training, Decision List Learning, Information Extraction, Japanese Named Entity Recognition, Minimally Supervised Approach | ||
| LREC2000 Proceedings: Session WO14 - Named Entity Recognition, pages 1229-1236 | ||
| Files: 258.ps, 258.pdf | ||
|
|
||
| English Senseval: Report and Results | ||
| Kilgarriff Adam
(ITRI, University of Brighton, Brighton, England, adam@itri.bton.ac.uk) Rosenzweig Joseph (University of Pennsylvania, 200 South 33rd Street, Philadelphia, PA, USA, josephr@linc.cis.upenn.edu) |
||
| There are now many computer programs for automatically determining which sense a word is being used in. One would like to be able to say which were better, which worse, and also which words, or varieties of language, presented particular problems to which programs. In 1998 a first evaluation exercise, SENSEVAL, took place. The English component of the exercise is described, and results presented. | ||
| Keywords: Evaluation, SENSEVAL, Word Sense Disambiguation | ||
| LREC2000 Proceedings: Session EO3 - Evaluation and Semantics, pages 1239-1244 | ||
| Files: 8.ps, 8.pdf | ||
|
|
||
| Evaluation of a Generic Lexical Semantic Resource in Information Extraction | ||
| Yue Chai Joyce
(IBM T. J. Watson Research Center 30 Saw Mill River Rd, Hawthorne, NY
10532, USA, email:jchai@us.ibm.com) |
||
| We have created an information extraction system that allows users to train the system on a domain of interest. The system helps to maximize the effect of user training by applying WordNet to rule generation and validation. The results show that, with careful control, WordNet is helpful in generating useful rules to cover more instances and hence improve the overall performance. This is particularly true when the training set is small, where F-measure is increased from 65% to 72%. However, the impact of WordNet diminishes as the size of training data increases. This paper describes our experience in applying WordNet to this system and gives an evaluation of such an effort. | ||
| Keywords: Evaluation, Information Extraction, WordNet | ||
| LREC2000 Proceedings: Session EO3 - Evaluation and Semantics, pages 1245-1250 | ||
| Files: 259.ps, 259.pdf | ||
|
|
||
| Sublanguage Dependent Evaluation: Toward Predicting NLP performances | ||
| Illouz Gabriel
(LIMSI, CNRS / Université Paris Sud, Orsay, France,
illouz@limsi.fr) |
||
| In Natural Language Processing (NLP) Evaluation, such as MUC (Hirshman, 98), TREC (Harman, 98), GRACE (Adda et al, 97), SENSEVAL (Kilgarriff98), performance results provided are often average made on the complete test set. That does not give any clues on the systems robustness. knowing which system performs better on average does not help us to find which is the best for a given subset of a language. In the present article, the existing approaches which take into account language heterogeneity and offer methods to identify sublanguages are presented. Then we propose a new metric to assess robustness and we study the effect of different sublanguages identified in the Penn Tree Bank Corpus on performance variations observed for POS tagging. The work we present here is a first step in the development of predictive evaluation methods, intended to propose new tools to help in determining in advance the range of performance that can be expected from a system on a given dataset. | ||
| Keywords: Evaluation (predictive), Performance Variations, POS Tagging, Sublanguages, Textual Typology | ||
| LREC2000 Proceedings: Session EO3 - Evaluation and Semantics, pages 1251-1254 | ||
| Files: 252.ps, 252.pdf | ||
|
|
||
| Evaluation of Word Alignment Systems | ||
| Ahrenberg Lars
(Department of Computer and Information Science Linköping
University, Institute of Technology, S-581 83 Linköping, Sweden,
email:lah@ida.liu.se) Merkel Magnus (Department of Computer and Information Science Linköping University, Institute of Technology, S-581 83 Linköping, Sweden, email: magme@ida.liu.se) Sågvall Hein Anna (Department of Linguistics Uppsala University, Box 527, S-751 20 Uppsala, Sweden, email: anna@ling.uu.se) Tiedemann Jörg (Department of Linguistics Uppsala University, Box 527, S-751 20 Uppsala, Sweden, email: joerg@stp.ling.uu.se) |
||
| Recent years have seen a few serious attempts to develop methods and measures for the evaluation of word alignment systems, notably the Blinker project (Melamed, 1998) and the ARCADE project (Véronis and Langlais, forthcoming). In this paper we discuss different approaches to the problem and report on results from a project where two word alignment systems have been evaluated. These results include methods and tools for the generation of reference data and a set of measures for system performance. We note that the selection and sampling of reference data can have a great impact on scoring results. | ||
| Keywords: Automatic Evaluation, Bilingual Lexicon Extraction, Evaluation Metrics, Gold Standard, Parallel Text, Word Alignment | ||
| LREC2000 Proceedings: Session EO3 - Evaluation and Semantics, pages 1255-1262 | ||
| Files: 137.ps, 137.pdf | ||
|
|
||
| Language Resources Development at the Spanish Royal Academy | ||
| Municio Ángel
Martín (Real Academia Española Felipe IV 4, 28014
Madrid, Spain, email: amunicio@rae.es) Rojo Guillermo (Dept. of Spanish Language, University of Santiago de Compostela, Burgo das Nacións, s/n., E-15771 Santiago de Compostela, Spain, fegrojo@usc.es) Sánchez León Fernando (Real Academia Española Felipe IV 4, 28014 Madrid, Spain, email: fsanchez@rae.es) Pinillos Octavio (Real Academia Española Felipe IV 4, 28014 Madrid, Spain, email: pinillos@rae.es) |
||
| This paper explains some of the most relevant issues concerning the development of language resources at the Spanish Royal Academy. Two 125-M words corpus of Spanish language (synchronic and diachronic) and three specialized corpus has been developed. Around the corpus, RAE is also developing NLP tools and resources to morpho-syntactically annotate them. Some of the most relevant are: The Computational Lexicon, the Morphological analysis tools, the Disambiguation grammars and the Tokenizer generator. The last section describes the lexicographic use of corpus materials and includes a brief description of the Corpus-based lexicographical workbench and his related tools. | ||
| Keywords: Corpus, Grammars, Lexicography, Lexicon, Morphological Analysis, NLP Tools, Spanish, Spoken Corpus | ||
| LREC2000 Proceedings: Session WO15 - Language Resources Projects, pages 1265-1270 | ||
| Files: 297.ps, 297.pdf | ||
|
|
||
| A Self-Expanding Corpus Based on Newspapers on the Web | ||
| Hofland Knut
(HIT Centre, University of Bergen Allegt. 27, N-5007 Bergen, Norway,
email:Knut.Hofland@hit.uib.no) |
||
| A Unix-based system is presented which automatic collects newspaper articles from the web, converts the texts, and includes these texts in a newspaper corpus. This corpus can be searched from a web-browser. The corpus is currently 70 millions words and increases by 4 millions words each month. | ||
| Keywords: Batch Download, Corpus, Newspapers, Web, Web-Based Concordance | ||
| LREC2000 Proceedings: Session WO15 - Language Resources Projects, pages 1271-1272 | ||
| Files: 362.ps, 362.pdf | ||
|
|
||
| For a Repository of NLP Tools | ||
| Chaudiron Stéphane
(Ministère de la Recherche & Université de Paris 10 -
CRIS 200, avenue de la République 92001 Nanterre cedex, France,
email: stephane.chaudiron@u-paris10.fr) Choukri Khalid (European Language Resources Association (ELRA) &, European Language resources - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin, 75013 Paris France, choukri@elda.fr) Mance Audrey (European Language Resources Association (ELRA) &, European Language resource - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin, 75013 Paris France, mance@elda.fr) Mapelli Valérie (European Language Resources Association (ELRA) &, European Language resource - Distribution Agency (ELDA), 55-57, rue Brillat-Savarin, 75013 Paris France, mapelli@elda.fr) |
||
| In this paper, we assume that the perspective which consists of identifying the NLP supply according to its different uses gives a general and efficient framework to understand the existing technological and industrial offer in a user-oriented approach. The main feature of this approach is to analyse how a specific technical product is really used by the users and not only to highlight how the developers expect the product to be used. To achieve this goal with NLP products, we first need to have a clear and quasi-exhaustive picture of the technical and industrial supply. During the 1998-1999 period, the European Language Resources Association (ELRA) conducted a study funded by the French Ministry of Research and Higher Education to produce a directory of language engineering tools and resources for French. In this paper, we present the main results of the study. The first part gives some information on the methodology adopted to conduct the study, the second part presents the main characteristics of the classification and the third part gives an overview of the applications which have been identified. | ||
| Keywords: Assessment and Evaluation, French Market, Maturity, NLP Applications, NLP Offer, Surveys, Tool Directory, Tool Typology, Usage | ||
| LREC2000 Proceedings: Session WO15 - Language Resources Projects, pages 1273-1278 | ||
| Files: 316.ps, 316.pdf | ||
|
|
||
| Coreference Annotation: Whither? | ||
| Kibble Rodger
(Information Technology Research Institute, University of Brighton,
Lewes Rd, Brighton, UK, rags@itri.brighton.ac.uk,
http:/www.itri.brighton.ac.uk/projects/rags) van Deemter Kees (Information Technology Research Institute University of Brighton, Brighton BN2 4CJ U.K., email: Kees.van.Deemter@itri.brighton.ac.uk) |
||
| The terms coreference and anaphora tend to be used inconsistently and interchangeably in much empirically-oriented work in NLP, and this threatens to lead to incoherent analyses of texts and arbitrary loss of information. This paper discusses the role of coreference annotation in Information Extraction, focussing on the coreference scheme defined for the MUC-7 evaluation exercise. We point out deficiencies in that scheme and make some suggestions towards a new annotation philosophy. | ||
| Keywords: Coreference, Information Extraction | ||
| LREC2000 Proceedings: Session WO16 - Corpus Annotation and Information Extraction, pages 1281-1286 | ||
| Files: 100.ps, 100.pdf | ||
|
|
||
| Annotating Events and Temporal Information in Newswire Texts | ||
| Setzer Andrea
(Department of Computer Science University of Sheffield Regent Court 211
Portobello Street Sheffield S1 4DP, U.K., email:
A.Setzer@dcs.shef.ac.uk) Gaizauskas Robert (Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK, R.Gaizauskas@dcs.shef.ac.uk) |
||
| If one is concerned with natural language processing applications such as information extraction (IE), which typically involve extracting information about temporally situated scenarios, the ability to accurately position key events in time is of great importance. To date only minimal work has been done in the IE community concerning the extraction of temporal information from text, and the importance, together with the difficulty of the task, suggest that a concerted effort be made to analyse how temporal information is actually conveyed in real texts. To this end we have devised an annotation scheme for annotating those features and relations in texts which enable us to determine the relative order and, if possible, the absolute time, of the events reported in them. Such a scheme could be used to construct an annotated corpus which would yield the benefits normally associated with the construction of such resources: a better understanding of the phenomena of concern, and a resource for the training and evaluation of adaptive algorithms to automatically identify features and relations of interest. We also describe a framework for evaluating the annotation and compute precision and recall for different responses. | ||
| Keywords: Discourse Annotation, Events, Information Extraction, Temporal Information | ||
| LREC2000 Proceedings: Session WO16 - Corpus Annotation and Information Extraction, pages 1287-1294 | ||
| Files: 321.ps, 321.pdf | ||
|
|
||
| A Semi-automatic System for Conceptual Annotation, its Application to Resource Construction and Evaluation | ||
| Black W.J.
(Department of Language Engineering, UMIST, Manchester, UK,
email:bill,jock@ccl.umist.ac.uk) McNaught John (Centre for Computational Linguistics, UMIST P.O.Box 88 Manchester, U.K., M60 1QD email: jock@ccl.umist.ac.uk) Zarri G.P. (Centre National de la Recherche Scientifique, Paris, France) Persidis A. (ASSETT–Biovista, Athens, Greece) Brasher A. (Pira International, Leatherhead, UK) Gilardoni L. (QUINARY SpA, Milano, Italy) Bertino E. (Dipartimento di Scienze dell’Informazione, Università degli Studi, Milano, Italy) Semeraro G. (Dipartimento di Informatica, Universita di Bari, Italy) Leo P. (Java Technology Center, IBM Semea Sud, Bari, Italy, http://concerto.ccl.umist.ac.uk/) |
||
| The CONCERTO project, primarily concerned with the annotation of texts for their conceptual content, combines automatic linguistic analysis with manual annotation to ensure the accuracy of fact extraction, and to encode content in a rich knowledge representation framework. The system provides annotation tools, automatic multi-level linguistic analysis modules, a partial parsing formalism with a more user friendly language than standard regular expression languages, XML-based document management, and a powerful knowledge representation and query facility. We describe the architecture and functionality of the system, and how it can be adapted for a range of resource construction tasks, and how the system can be configured to compute statistics on the accuracy of its automatic analysis components. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session WO16 - Corpus Annotation and Information Extraction, pages 1295-1302 | ||
| Files: 165.ps, 165.pdf | ||
|
|
||
| Using a Formal Approach to Evaluate Grammars | ||
| Gargouri Bilel
(LARIS Laboratory - FSEG - Sfax B.P. 1081 - 3018 SFAX - TUNISIA,
Bilel.Gargouri@fsegs.rnu.tn) Jmaiel Mohamed (LARIS Laboratory - FSEG - Sfax B.P. 1081 - 3018 SFAX - TUNISIA, Mohamed.Jmaiel@enis.rnu.tn) Hamadou Abdelmajid Ben (LARIS Laboratory - FSEG - Sfax B.P. 1081 - 3018 SFAX - TUNISIA, Abdelmajid.Benhamadou@fsegs.rnu.tn) |
||
| In this paper, we present a methodological formal approach to evaluate grammars based on a unified representation. This approach uses two kinds of criteria. The first one considers a grammar as a resource enabling the representation of particular aspects of a given language. The second is interested in using grammars in the development of lingware. The evaluation criteria are defined in a formal way. In addition, we indicate for every criterion how it would be applied. | ||
| Keywords: Evaluation Criteria, Formal Approach, Grammars, Lingware Development | ||
| LREC2000 Proceedings: Session EO4 - Grammars and Systems Evaluation, pages 1305-1308 | ||
| Files: 285.ps, 285.pdf | ||
|
|
||
| Towards More Comprehensive Evaluation in Anaphora Resolution | ||
| Mitkov Ruslan
(School of Humanities, Languages and Social Studies, University of Wolverhampton, Stafford Street, Wolverhampton WV1 1SB, United Kingdom, R.Mitkov@wlv.ac.uk) |
||
| The paper presents a package of evaluation tasks for anaphora resolution. We argue that these newly added tasks which have been carried out on Mitkov's (1998) knowledge-poor, robust approach, provide a better picture of the performance of an anaphora resolution system. The paper also outlines future work on the development of a ''consistent'' evaluation environment for anaphora resolution. | ||
| Keywords: | ||
| LREC2000 Proceedings: Session EO4 - Grammars and Systems Evaluation, pages 1309-1314 | ||
| Files: 115.ps, 115.pdf | ||
|
|
||
| Coreference Resolution Evaluation Based on Descriptive Specificity | ||
| Trouilleux François
(Xerox Research Centre Europe, 6 Chemin de Maupertuis. 38240 Meylan) Gaussier Eric (Xerox Research Centre Europe, 6 Chemin de Maupertuis. 38240 Meylan) Bès Gabriel G. (Groupe de recherche dans les industries de la langue (GRIL), UFR-LACC, Université Blaise-Pascal, Clermont 2, 34, avenue Carnot. 63037 Clermont-Ferrand - France) Zaenen Annie (Xerox Research Centre Europe, 6 Chemin de Maupertuis. 38240 Meylan) |
||
| This paper introduces a new evaluation method for the coreference resolution task. Considering that coreference resolution is a matter of linking expressions to discourse referents, we set our evaluation criteron in terms of an evaluation of the denotations assigned to the expressions. This criterion requires that the coreference chains identified in one annotation stand in a one-to-one correspondence with the coreference chains in the other. To determine this correspondence and with a view to keep closer to what human interpretation of the coreference chains would be, we take into account the fact that, in a coreference chain, some expressions are more specific to their referent than others. With this observation in mind, we measure the similarity between the chains in one annotation and the chains in the other, and then compute the optimal similarity between the two annotations. Evaluation then consists in checking whether the denotations assigned to the expressions are correct or not. New measures to analyse errors are also introduced. A comparison with other methods is given at the end of the paper. | ||
| Keywords: Anaphora, Coreference, Descriptive Specificity, Evaluation, Text Understanding | ||
| LREC2000 Proceedings: Session EO4 - Grammars and Systems Evaluation, pages 1315-1322 | ||
| Files: 131.ps, 131.pdf | ||
|
|
||
| Methods and Metrics for the Evaluation of Dictation Systems: a Case Study | ||
| Canelli Maria
(TIM/ISSCO, ETI, University of Geneva, 40 blvd du Pont d’Arve, CH-1211
Geneva 4) Grasso Daniele (TIM/ISSCO, ETI, University of Geneva, 40 blvd du Pont d’Arve, CH-1211 Geneva 4) King Margaret (TIM/ISSCO, ETI, University of Geneva, 40 blvd du Pont d’Arve, CH-1211 Geneva 4., Margaret.King©issco.unige.ch) |
||
| This paper describes the practical evaluation of two commercial dictation systems in order to assess the potential usefulness of such technology in the specific context of a translation service translating legal text into Italian. The service suffers at times from heavy workload, lengthy documents and short deadlines. Use of dictation systems accepting continuous speech might improve productivity at these times. Design and execution of the evaluation followed the methodology worked out by the EAGLES Evaluation Working Group. The evaluation therefore also constitutes a test bed application of this methodology. | ||
| Keywords: Dictation Systems, EAGLES, Evaluation, ISO | ||
| LREC2000 Proceedings: Session SO6 - Recognition, pages 1325-1332 | ||
| Files: 56.ps, 56.pdf | ||
|
|
||
| Design Issues in Text-Independent Speaker Recognition Evaluation | ||
| Martin Alvin
(National Institute of Standards and Technology, 100 Bureau Drive, Stop
8940, Gaithersburg, MD, 20899, USA, alvin.martin@nist.gov) Przybocki Mark (National Institute of Standards and Technology, 100 Bureau Drive, Stop 8940, Gaithersburg, MD, 20899, USA, mark.przybocki@nist.gov) |
||
| We discuss various considerations that have been involved in designing the past five annual NIST speaker recognition evaluations. These text-independent evaluations using conversational telephone speech have attracted state-of-the- art automatic systems from research sites around the world. The availability of appropriate data for sufficiently large test sets has been one key design consideration. There have also been variations in the specific task efinitions, the amount and type of training data provided, and the durations of the test segments. The microphone types of the handsets used, as well as the match or mismatch of training and test handsets, have been found to be important considerations that greatly affect system performance. | ||
| Keywords: Evaluation, NIST, Speaker Recognition, Text-independence | ||
| LREC2000 Proceedings: Session SO6 - Recognition, pages 1333-1336 | ||
| Files: 286.ps, 286.pdf | ||
|
|
||
| Perceptual Evaluation of a New Subband Low Bit Rate Speech Compression System based on Waveform Vector Quantization and SVD Postfiltering | ||
| Fotinea
Stavroula-Evita (Institute for Language and Speech Processing,
Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, evita@ilsp.gr) Dologlou Ioannis (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, ydol@ilsp.gr) Bakamidis Stylianos (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece) Stainhaouer Gregory (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, stein@ilsp.gr) Carayannis George (Institute for Language and Speech Processing, Epidavrou & Artemidos 6, 151 25 Maroussi, Greece, gcara@ilsp.gr) |
||
| This paper proposes a new low rate speech coding algorithm, based on a subband approach. At first, a frame of the incoming signal is fed to a low pass filter, thus yielding the low frequency (LF) part. By subtracting the latter from the incoming signal the high frequency (HF), non-smoothed part is obtained. The HF part is modeled using waveform vector quantisation (VQ), while the LF part is modeled using a spectral estimation method based on a Hankel matrix, its shift invariant property and SVD, called CSE. At the receiver side an adaptive postfiltering based on SVD is performed for the HF part, a simple resynthesis for the LF part, before the two components are added in order to produce the reconstructed signal. Progressive speech compression (variable degree of analysis/synthesis at transmitter/receiver) is thus possible resulting in a variable bit rate scheme. The new method is compared to the CELP algorithm at 4800 bps and is proven of similar quality, in terms of intelligibility and segmental SNR. Moreover, perceptual evaluation tests of the new method were conducted for different bit rates up to 1200 bps and the majority of the evaluators indicated that the technique provides intelligible reconstruction. | ||
| Keywords: Low Bit Rate Speech Compression, Perceptual Evaluation, Subband Approach, SVD | ||
| LREC2000 Proceedings: Session SO6 - Recognition, pages 1337-1342 | ||
| Files: 16.ps, 16.pdf | ||
|
|
||
| IPA Japanese Dictation Free Software Project | ||
| Shikano Kiyohiro
(NAIST, 1-6-10 Takayama, Ikoma, Nara, 630-0101, Japan,
shikano@is.aist-nara.ac.jp) Kawahara Tatsuya (Kyoto University, Japan, kawahara@kuis.kyoto-u.ac.jp) Takeda Kasuya (Nagoya University, Japan, takeda@nuee.nagoya-u.ac.jp) Yamada Atsushi (ASTEM, Japan) Itou Akinori (Yamagata University, Japan) Itou Katsunobu (ETL, 1-1-4 Umezono, Tsukuba, Ibaraki, 305-8568, Japan, kito@etl.go.jp) Utsuro Takehito (Department of Information and Computer Sciences, Toyohashi University of Technology, Tenpaku-cho, Toyohashi, 441-8580, Japan, utsuro@ics.tut.ac.jp) Kobayashi Tetsunori (Waseda University, Japan) Minematsu Nobuaki (Toyohashi University, Japan) Yamamoto Mikio (Tsukuba University, Japan) Sagayama Shigeki (JAIST, Japan) Lee Akinobu (Kyoto University, Japan) |
||
| Large vocabulary continuous speech recognition (LVCSR) is an important basis for the application development of speech recognition technology. We had constructed Japanese common LVCSR speech database and have been developing sharable Japanese LVCSR programs/models by the volunteer-based efforts. We have been engaged in the following two volunteer-based activities. a) IPSJ (Information Processing Society of Japan) LVCSR speech database working group. b) IPA (Information Technology Promotion Agency) Japanese dictation free software project. IPA Japanese dictation free software project (April 1997 to March 2000) is aiming at building Japanese LVCSR free software/models based on the IPSJ LVCSR speech database (JNAS) and Mainichi newspaper article text corpus. The software repository as the product of the IPA project is available to the public. More than 500 CD-ROMs have been distributed. The performance evaluation was carried out for the simple version, the fast version, and the accurate version in February 2000. The evaluation uses 200 sentence utterances from 46 speakers. The gender-independent HMM models and 20k/60k language models are used for evaluation. The accurate version with the 2000 HMM states and 16 Gaussian mixtures shows 95.9 % word correct rate. The fast version with the phonetic tied mixture HMM and the 1/10 reduced language model shows 92.2 % word correct rate and realtime speed. The CD-ROM with the IPA Japanese dictation free software and its developing workbench will be distributed by the registration to http://www.lang.astem.or.jp/dictation-tk/ or by sending e-mail to dictation-tk-request@astem.or.jp. | ||
| Keywords: Dictation Free Software, Japanese Dictation, LVCSR, LVCSR Workbench | ||
| LREC2000 Proceedings: Session SO6 - Recognition, pages 1343-1350 | ||
| Files: 261.ps, 261.pdf | ||
|
|
||
| The COST 249 SpeechDat Multilingual Reference Recogniser | ||
| Johansen Finn Tore
(Telenor R&D, Kjeller, Norway) Warakagoda Narada (Telenor R&D, Kjeller, Norway) Lindberg Børge (Center for PersonKommunikation (CPK), Aalborg, Denmark) Lehtinen Gunnar (Swiss Federal Institute of Technology (ETH), Zurich, Switzerland) Kačič Zdravko (Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, kacic@uni-mb.si) Žgank Andreh (University of Maribor, Slovenia) Elenius Kjell (University of Maribor, Slovenia) Salvi Gampiero (Kungliga Tekniska H¨ ogskolan (KTH), Stockholm, Sweden) |
||
| The COST 249 SpeechDat reference recogniser is a fully automatic, language-independent training procedure for building a phonetic recogniser. It relies on the HTK toolkit and a SpeechDat(II) compatible database. The recogniser is designed to serve as a reference system in multilingual recognition research. This paper documents version 0.95 of the reference recogniser and presents results on small and medium vocabulary recognition for five languages. | ||
| Keywords: Multi-Linguality, Reference Systems, Telephony Speech Databases | ||
| LREC2000 Proceedings: Session SO6 - Recognition, pages 1351-1356 | ||
| Files: 274.ps, 274.pdf | ||
|
|
||
| Automotive Speech-Recognition - Success Conditions Beyond Recognition Rates | ||
| Bengler Klaus
(BMW AG, Knorrstr. 147, 80788 Munich, Germany, klaus.bengler@bmw.de) |
||
| From a car-manufacturer’s point of view it is very important to integrate evaluation procedures into the MMI development process. Focusing the usability evaluation of speech-input and speech-output systems aspects beyond recognition rates must be fulfilled. Two of these conditions will be discussed based upon user studies conducted in 1999: • Mental-workload and distraction • Learnability | ||
| Keywords: | ||
| LREC2000 Proceedings: Session SO6 - Recognition, pages 1357-1360 | ||
| Files: 312.ps, 312.pdf | ||
|
|
||
| Evaluating Multi-party Multi-modal Systems | ||
| Damianos Laurie E.
(The MITRE Corporation, 202 Burlington Road;Bedford,MA 01730 USA,
laurie@mitre.org) Drury Jill (The MITRE Corporation, 202 Burlington Road;Bedford,MA 01730 USA, jldrury@mitre.org) Fanderclai Tari (The MITRE Corporation, 202 Burlington Road;Bedford,MA 01730 USA, tari@nwe.ufl.edu) Hirschman Lynette (The MITRE Corporation, 202 Burlington Road;Bedford,MA 01730 USA, lynette@mitre.org) Oshika Beatrice (The MITRE Corporation, 202 Burlington Road;Bedford,MA 01730 USA, bea@mitre.org) |
||
| The MITRE Corporation ’s Evaluation Working Group has developed a methodology for evaluating multi-modal groupware systems and capturing data on human-human interactions.The methodology consists of a framework for describing collaborative systems, scenario-based evaluation approach,and evaluation metrics for the various components of collaborative systems.We designed and ran two sets of experiments to validate the methodology by evaluating collaborative systems.In one experiment,we compared two configurations of a multi-modal collaborative application using a map navigation scenario requiring information sharing and decision making.In the second experiment,we pplied the evaluation methodology to a loosely integrated set of collaborative tools,again using a scenario-based approach.In both experiments,multi-modal,multi-user data were collected,visualized,annotated,and analyzed. | ||
| Keywords: Cenario, Collaborative System, Data Visualization, Evaluation Methodology, Human-Computer Interaction, Metrics, Usability | ||
| LREC2000 Proceedings: Session SO6 - Recognition, pages 1361-1368 | ||
| Files: 368.ps, 368.pdf | ||
|
|
||
| What's in a Thesaurus? | ||
| Kilgarriff Adam
(ITRI, University of Brighton, Brighton, England, adam@itri.bton.ac.uk) Yallop Colin (Macquarie University, Sydney, cyallop@ling.mq.edu.au) |
||
| We first describe four varieties of thesaurus: (1) Roget-style, produced to help people find synonyms when they are writing; (2) WordNet and EuroWordNet; (3) thesauruses produced (manually) to support information retrieval systems; and (4) thesauruses produced auto-matically from corpora. We then contrast thesauruses and dictionaries, and present a small experiment in which we look at polysemy in relation to thesaurus structure. It has sometimes been assumed that different dictionary senses for a word that are close in meaning will be near neighbours in the thesaurus. This hypothesis is explored, using as inputs the hierarchical structure of WordNet 1.5 and a mapping between WordNet senses and the senses of another dictionary. The experiment shows that pairs of ‘lexicographically close’ meanings are frequently found in different parts of the hierarchy. | ||
| Keywords: Polysemy, Semantic Similarity, Thesaurus | ||
| LREC2000 Proceedings: Session WO17 - Semantic Lexicons, pages 1371-1378 | ||
| Files: 180.ps, 180.pdf | ||
|
|
||
| SIMPLE: A General Framework for the Development of Multilingual Lexicons | ||
| Bel Nuria
(GILCUB (Grup Investigació Lingüística Computacional
Universitat Barcelona), nuria@gilcub.es) Busa Federica (Istituto di Linguistica Computazionale, Pisa, Brandeis University) Calzolari Nicoletta (Istituto di Linguistica Computazionale, CNR, Area della Ricerca di Pisa, Via Alfieri 1, Loc. S. Cataldo, Ghezzano 56010 (PI) – ITALY, glottolo@ilc.pi.cnr.it) Gola Elisabetta (Istituto di Linguistica Computazionale, Pisa) Lenci Alessandro (Istituto di Linguistica Computazionale – CNR, Via Alfieri 1 - Pisa 56010 - ITALY, lenci@ilc.pi.cnr.it) Monachini Monica (Istituto di Linguistica Computazionale, Pisa) Ogonowski Antoine (LexiQuest) Peters Ivonne (Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK, ivonne@dcs.shef.ac.uk) Peters Wim (Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK, w.peters@dcs.shef.ac.uk) Ruimy Nilda (Istituto di Linguistica Computationale. CNR nilda@ilc.pi.cnr.it) Villegas Marta (Institut d’Estudis Catalans, mvillegas@iec.es) Zampolli Antonio (Istituto di Linguistica Computationale. CNR parole@ilc.pi.cnr.it) |
||
| The project LE-SIMPLE is an innovative attempt of building harmonized syntactic-semantic lexicons for 12 European languages, aimed at use in different Human Language Technology applications. SIMPLE provides a general design model for the encoding of a large amount of semantic information, spanning from ontological typing, to argument structure and terminology. SIMPLE thus provides a general framework for resource development, where state-of-the-art results in lexical semantics are coupled with the needs of Language Engineering applications accessing semantic information. | ||
| Keywords: Computational Lexicons, Lexical Semantics, Resources, Syntax-Semantics Linking | ||
| LREC2000 Proceedings: Session WO17 - Semantic Lexicons, pages 1379-1384 | ||
| Files: 61.ps, 61.pdf | ||
|
|
||
| The Treatment of Adjectives in SIMPLE: Theoretical Observations | ||
| Peters Ivonne
(Department of Computer Science, University of Sheffield, Regent Court,
211 Portobello Street, Sheffield S1 4DP, UK, ivonne@dcs.shef.ac.uk) | ||