Wrapping Web Pages into XML Documents with Norfolk 

Anne-Marie Vercoustre
CSIRO-MIS Technologies for Electronic Documents Group

Tuesday 16 October at 11am

Abstract

The notion of wrapping a web server into XML documents is driven from the need for structured data that can be used by a variety of applications. The web contains vast amounts of information that is useless to most applications since it is mainly targeting a human audience. A solution to this would be to automate the browsing process and then convert the extracted information into a more suitable format - like XML. This is called wrapping. We have used two different tools to wrap several tourist sites into XML The tool we have been using are Norfolk, a system developed since 1997 by the TED group and W4F, initially developed at the University of Pennsylvania, now a commercial product.

This presentation will introduce the general tasks of wrappers and will present Norfolk, a system initially develop for creating virtual documents from heterogeneous sources. It has recently been extended to cater for the creation of XML documents for the purpose of wrapping. 

It will also discuss the limitation of current approaches and will suggest some future research directions.

Short resume

Dr. Anne-Marie Vercoustre is a senior researcher in the Research group for Technology for Electronic Documents ( TED) at CSIRO Mathematical and Information Sciences Division, based in Melbourne. Her main research interests are in structured document (SGML-like), document workflow, corporate memory, and the reuse of information from heterogeneous and distributed sources. Before joining CSIRO in September 2000, Anne-Marie has been a researcher for more than twenty years at INRIA, France, where she has been involved in research on syntax-directed programming environments and structured document tools. She has participated in several European projects around Software factories and Digital Libraries. She is currently on the chair board of SIGWEB and on many conference program committees.

Back to HAIL Home Page