Scamseek - Identifying Financial Scams on the Internet

Professor Jon Patrick
Language Technology Research Group
School of Information Technologies
University of Sydney

Tuesday 30th September 2003 at 11am

Abstract

The Scamseek Project has the aim of automating the process of identifying financial scams promoted on the Internet. It has been commissioned by the Australian Securities and Investment Commission (ASIC). The project contractors are the Capital Markets Cooperative Research Centre (CMCRC), which is a consortium of Australian Universities and businesses funded by the partners and the Australian Government. The research for the Scamseek Project has been sub-contracted to the University of Sydney.

The project is limited to identifying scams applicable to the finance industry and does not deal with other scams such as generally applicable to a range of other consumer interests, or offshore activities not targeted at Australian citizens.

The problem of identifying financial scams is being approached with a combined methodology of classical text classification and semantic analysis of the texts by linguists.

The experimental problems are dominated by the following issues:

  1. the sample of scams in the training corpus has a footprint of only 1.8% in about 7500 documents,
  2. the meaning representation in the texts is being developed under the model of Systemic Functional Linguistics which is not well developed for formal data representation or feature identification by computational methods,
  3. There is no established method of integrating the linguistic analyses with the text classification methods,
  4. The client's requirements for separating the documents into various classes is driven as much by administrative requirements as differentiable linguistic styles which drives the development of a classification system that has to identify linguistic styles within an administrative classification. This includes separating out a variety of offshore and consumer scams not encompassed by ASIC's legislative obligations.

The project has a total budget of $AU1 million, a planned 6 month life time and is scheduled to be completed on 30th September 2003. The project team consists of 2 linguists, 3 computational linguists, and 3 software engineers and a project director, and a variety of external advisors.

Short resume

Jon Patrick first built language technology systems in the 1980s to record the descriptions of teams sports in real-time. His early systems were adopted by television for Australian Rules football, NRL clubs of Rugby, WACB for cricket, and the Australian Institute of Sport for water polo. In the early 90s he trained in psychotherapy and developed a computational approach to analysing therapeutic language to asses its effectiveness. Since moving to the University of Sydney in 1998 he has published a grammar reference book of the Basque language, and developed an active learning system for converting multi-lingual dictionaries into XML knowledgebases. He was the inaugural director of the Capital Markets CRC Language Technology research program and is currently director of the Scamseek Project.