Mining Syntactically Annotated Corpora

Dr Gosse Bouma
Information Science
University of Groningen, The Netherlands

(A joint HAIL/SALS-SIG Seminar)

*** NOTE: Additional Seminar in off-week ***

Tuesday 3rd April 2007 at 11am

 

Abstract

Using a robust and accurate wide-coverage parser, large syntactically annotated corpora can be constructed easily. In this talk, we will review a number of application areas where such corpora for Dutch have been found useful: to study the distribution of certain syntactic constructions (i.e. word order in indirect object constructions, the distribution of focus particles inside PPs, (alleged) extraction of PPs from NPs, etc.), to acquire lexical and ontological information (ranging from support verb constructions to definition sentences), and for relation extraction and question answering.

An issue in all applications is the development of tools for searching, extracting, and combining information from treebanks stored in XML. Recently, we have started to use XQuery, a generic XML query language based on XPath, for a such tasks. An interesting feature of the language is the module system, which allows the definition of treebank-specific functions that can be used to support advanced extraction tasks.

Short resume

Gosse Bouma has worked at the University of Stuttgart and the University of Groningen, where he received his PhD in 1993 (thesis title: Nonmonotonicity and Categorial Unification Grammar). Since then he has worked on theoretical linguistics (Categorial Grammar and Head-driven Phrase Structure Grammar), finite-state methods (for grapheme-to-phoneme conversion and hyphenation), corpus linguistics and, recently, information extraction and question answering.

Back to HAIL Home Page