Sentence Augmentation: A Text-to-Text Generation Component for Summarisation

Stephen Si-En Wan

Supervisors: Dr. Robert Dale, Dr. Mark Dras, Dr. Cécile Paris

Acknowledgements

Download the thesis [2.7MB]

This thesis was presented for the degree of
Doctor of Philosophy
at the
Centre for Language Technology
Department of Computing
Faculty of Science
Macquarie University
NSW2109 Australia
Submitted December 2009; Completed June 2010

Abstract:

An examination of a corpus of manually authored executive summaries suggests a predominant strategy that human writers appear to adopt: key sentences, which form the core of the summary, are fleshed out, or augmented, using information from additional sentences in the document, referred to here as auxiliary sentences. In this thesis, we focus on developing methods that will enable a computational account of this strategy, which we describe as the Sentence Augmentation process.

We model sentence augmentation as a text-to-text generation process in which a novel sentence is produced as a result of re-organising content from a key sentence in conjunction with information from auxiliary sentences. As in related work on text-to-text revision, we characterise sentence augmentation as a Noisy Channel problem. 

In particular, we concentrate on two key facets of the process for which no suitable account yet exists in the literature: first, auxiliary content must be selected to be added into the sentences being generated; and second, the key and auxiliary content must be organised such that the result is a grammatical sentence. Our investigation of these two facets of leads to the following three findings:

1. A model of content selection: Information from within auxiliary sentences can be automatically chosen to support key information using schema-like patterns, represented as a statistical model that captures the prototypical juxtaposition of words. We show that the automatically derived schema-based model better predicts content selection compared with baseline vector space approaches using term frequency weights.

2. A representation for content re-organisation: Summary sentences generated using representations of dependency structures better reflects the content of the input text and are more grammatical, compared to sentences generated using just representations of the Markov context. Dependency structure thus provides a suitable representation for re-organising selected content in language modelling tasks.

3. An account of grammaticality: Spanning tree algorithms can be combined with statistical dependency models to induce an ordering of selected content, allowing issues of grammaticality in English to be handled in a statistical text-to-text generation process. The spanning tree approach, which provides a global sentence-level representation of linguistic validity, is able to generate more grammatical sentences than n-gram models.

The thesis describes our corpus-based investigation of the sentence augmentation process. Our development and evaluation of models for the core facets of sentence augmentation provides the missing pieces for the automation of the process as a whole. In doing so, we contribute to our general understanding of text-to-text generation approaches.

Thesis-related Published Papers:

       Primary Funding:
Awarded an APA but accepted the Research Award For Areas and Centres of Excellence (RAACE) at Macquarie University
CSIRO CMIS Top-Up Scholarship