Computer corpora and language descriptionProfessor Pam Peters Tuesday 17th July 2007 at 11am
AbstractThis presentation examines ongoing challenges for automatic analysis of (i) standard general language and (ii) specialised sublanguages, where additional layers of meaning (denotative and connotative) are still crucial for sophisticated NLP systems. Computational techniques based on small, purpose-designed corpora have been used by linguists since the 1960s to quantify lexical and grammatical elements of standard English, and to support intercomparisons between varieties of English. Corpus frequencies can show numerous syntactic divergences between major varieties such as British and American, as reported in the Longman Grammar of Spoken and Written English (Biber et al. 1999). Interesting differences have likewise been found in quantitative studies of new varieties of English such as those of Australia, New Zealand, Singapore, Philippines (Hundt, 2006), e.g. syntactic variables such as the patterns of agreement for collective nouns. However the subtle polysemy of most high frequency words still requires discretionary analysis, to separate common and distinctive senses of words -- despite the availability of mutual information tools. This problem affects usage of the function words and phrases of English, e.g." in case" in British and American English, as well as new usages found in ex-colonial Englishes, e.g. Singaporean use of "until". Conjunctions/prepositions like these define the logical relationships between the content-bearing clauses/phrases of the sentence and are the key to their interpretation. Other regional differences, e.g. the British preference for "about" v American for "around" are more cosmetic. They nevertheless serve to geolocate the text to some extent -- give it a regional tinge which may or may not matter to its writers and readers, and may or may not be reinforced by other more obvious though less frequent regionalisms of the "sidewalk"/"pavement" kind. Computer-based techniques for profiling specialised forms of language, aka sublanguages, also go back several decates in research by information engineers such as Bross, Shapiro and Anderson (1972) on the language of hospital surgeons. The identification of specialised terms and constructions is based on the principle that they occur with much greater frequency in technical texts than those intended for general reading (e.g. newspapers). A corollary of this is that the corpus needed to profile the terminology of a specialism need not be so large as that needed to support research on the lexis of the standard language (McEnery and Wilson, 1996/2001). Comparative frequency data from general and specialised corpora are effective in identifying the technical terminology of a discipline such as anatomy (Chung, 2003). However the terminology of different academic disciplines is rather variable in scope, and experimental research has shown that the density and structure of terms is quite different in texts from anatomy and, say, applied linguistics (Chung and Nation 2003). A key conceptual issue is whether to include in the inventory of terms only those which are distinctive to the discipline (the traditional terminological approach), or to embrace also those terms which are special uses of everyday words, e.g. "menu" in computer science (the descriptive terminologist's approach). The latter are essential for comprehensive coverage and professional training, but again they raise problems of polysemy for automatic analysis of corpora. The presentation will demonstrate the combination of computational and discretionary techniques, involving both linguists/lexicographers and disciplinary specialists, which is currently being used at the Dictionary Research Centre to build online termbanks of specialised expressions for academic disciplines in science and social science at Macquarie University (the TermFinder project). References
Short resumePam Peters is Professor of Linguistics at Macquarie University and Director of its Dictionary Research Centre. She has led the compilation of several kinds of computer corpora at Macquarie, and authored reference books on regional English usage, including the Cambridge Guide to English Usage (2004) and the Cambridge Guide to Australian English Usage (2007). |