ORGANIZER: Harald Baayen
Max Planck Institute for Psycholinguistics, P.O. Box 310, 6500 AH, Nijmegen, The Netherlands.
PAPERS:
1. Experimental design: Syntactic annotation as words,
by Hans van Halteren
2. Comparison of word-based and syntax-based methods:
Vocabulary richness measures and the highest frequency elements, by Fiona
Tweedie
3. The discriminatory potential of the lowest frequency rewrite rules, by Harald Baayen
KEYWORDS: authorship attribution, syntactic annotation, principal
components analysis, hapax legomena, function words
AFFILIATION: Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands.
E-MAIL: baayen@mpi.nl FAX NUMBER: +31-24-3521213 PHONE NUMBER: +31-24-3521323
Abstract
This session describes an experiment in authorship
attribution in which statistical measures and methods that have been widely
applied to words and their frequencies of use are applied to rewrite rules as
they appear in a syntactically annotated corpus. The outcome of this experiment
suggests that the frequencies with which syntactic rewrite rules are put to use
provide at least as good a cue to authorship as word usage. Moreover, one
method, which focuses on the use of the lowest-frequency syntactic rules, has a
higher resolution than traditional word-based analyses, and promises to be a
useful new technique for authorship attribution.
Introduction
A number of recent contributions to authorship
attribution are based on words and their frequencies of occurrence (see, e.g.,
Burrows 1992, 1993; Holmes, 1994; Holmes and Forsyth 1995). This comes as no
surprise, as the statistical analysis of word frequencies requires minimal
textual preprocessing. Nevertheless, precisely those words which have proved to
have a high discriminatory resolution in the seminal work by Burrows (1992,
1993), the so-called function words (a, the, that, and, but, ..., etc.), appear
to tap into the use of syntax. This suggests it might be profitable to study the
use of syntax directly by analyzing the use of rewrite rules in texts.
We have designed a statistical experiment using syntactically annotated corpus material to investigate the discriminatory potential of syntactic rewrite rules for authorship attribution. The corpus, its syntactic annotation, and the details of the design of our statistical experiment, are discussed in section 1 by van Halteren. In section 2, Tweedie discusses the accuracy of methods based on measures for vocabulary richness and of methods based on the highest-frequency elements, applied both to words and rewrite rules. In section 3, Baayen investigates the discriminatory potential of the way in which authors make use of the lowest-frequency rewrite rules.
Before going into further detail, we need to make explicit three crucial details of our methodology. First, traditionally, as in the study by Mosteller and Wallace (1964), a text of unknown authorship is compared with texts of which authorship is beyond doubt. In our experiment, the authorship of all texts is known (be it only to the experiment leader, van Halteren, and not to Tweedie and Baayen, who carried out the analyses). This allows us to straightforwardly evaluate the accuracy of the methods we have used. Second, a preliminary pilot study shows that texts written by one author in different genres can differ more than texts written by different authors in the same genre. We have therefore selected our texts from one particular text type, crime fiction. Third, to ensure the accuracy of assignment is independent of our particular split in labeled and unlabeled text fragments, we also required that a successful method should group all text fragments of different authors into clearly distinguishable clusters.
References
Burrows, J. F., (1992). Computers and the Study of
Literature. In: C.S. Butler (Ed.), Computers and Written Texts. Oxford:
Blackwell. (pp. 167-204).
Burrows, J. F., (1993). Tiptoeing into the infinite: testing for evidence of national differences in the language of English narrative. In: S. Hockey and N. Ide (Eds.), Research in Humanities Computing '92. London: Oxford University Press.
Holmes, D. I., (1994). Authorship Attribution. Computers and the Humanities 28(2):87-106.
Holmes, D. I. and Forsyth, R. S., (1995). The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing 10(2):111-127.
Mosteller, F. and Wallace, D. L., (1964). Applied Bayesian and Classical Inference. The case of the Federalist Papers. New York: Springer.