Using Support Vector Machines for Text Categorization

Susan Dumais, sdumais@microsoft.com
Decision Theory and Adaptive Systems Group
Microsoft Research
Redmond, USA
Wednesday 2 September,  11 AM.

Abstract

As the volume of electronic information increases, there is growing interest in developing tools to help people better find, filter, and manage these resources. Text categorization - the assignment of natural language texts to one or more predefined categories based on their content - is an important component in many information organization and management tasks. Machine learning methods, including Support Vector Machines (SVMs), have tremendous potential for helping people more effectively organize electronic resources.

Today, most text categorization is done by people. We all save hundreds of files, email messages, and URLs in folders every day. We are often asked to choose keywords from an approved set of indexing terms for describing our technical publications. On a much larger scale, trained specialists assign new items to categories in large taxonomies like the Dewey Decimal or Library of Congress subject headings, Medical Subject Headings (MeSH), or Yahoo!'s internet directory. In between these two extremes, objects are organized into categories to support a wide variety of information management tasks, including: information routing/filtering/push, identification of objectionable materials or junk mail, structured search and browsing, topic identification for topic-specific processing operations, etc.

Human categorization is very time-consuming and costly, thus limiting its applicability especially for large or rapidly changing collections. Consequently there is growing interest in developing technologies for (semi-)automatic text categorization. Rule-based approaches similar to those used in expert systems have been used, but they generally require manual construction of the rules, make rigid binary decisions about category membership, and are typically difficult to modify. Another strategy is to use inductive learning techniques to automatically  construct classifiers using labeled training data. A growing number of learning techniques have been applied to text categorization, including  multivariate regression, nearest neighbor classifiers, probabilistic Bayesian models, decision trees, and neural networks. Overviews of this text classification work can be found in Lewis and Hayes (1994) and Yang (1998). Recently, Joachims (1998) and Dumais et al. (1998) have used Support Vector Machines (SVMs) for text categorization with very promising results. In this paper we briefly describe the results of experiments in which we use SVMs to classify newswire stories from Reuters. Additional details can be found in Dumais et al. (1998).

Back to HAIL Home Page
Back to Home Page


Top of Page - Products and Services - Research Areas - Key Contacts
Latest News - 'Competitive Edge' - Staff List - Search - CMIS Home

 [_private/disclaimer.htm]