Using Support Vector Machines for Text Categorization
Susan
Dumais, sdumais@microsoft.com
Decision Theory and Adaptive Systems Group
Microsoft Research
Redmond, USA
Wednesday 2 September, 11 AM.
Abstract
As the volume of electronic information increases, there is growing
interest in developing tools to help people better find, filter, and
manage these resources. Text categorization - the assignment of natural
language texts to one or more predefined categories based on their
content - is an important component in many information organization and
management tasks. Machine learning methods, including Support Vector
Machines (SVMs), have tremendous potential for helping people more
effectively organize electronic resources.
Today, most text categorization is done by people. We all save
hundreds of files, email messages, and URLs in folders every day. We are
often asked to choose keywords from an approved set of indexing terms
for describing our technical publications. On a much larger scale,
trained specialists assign new items to categories in large taxonomies
like the Dewey Decimal or Library of Congress subject headings, Medical
Subject Headings (MeSH), or Yahoo!'s internet directory. In between
these two extremes, objects are organized into categories to support a
wide variety of information management tasks, including: information
routing/filtering/push, identification of objectionable materials or
junk mail, structured search and browsing, topic identification for
topic-specific processing operations, etc.
Human categorization is very time-consuming and costly, thus limiting
its applicability especially for large or rapidly changing collections.
Consequently there is growing interest in developing technologies for
(semi-)automatic text categorization. Rule-based approaches similar to
those used in expert systems have been used, but they generally require
manual construction of the rules, make rigid binary decisions about
category membership, and are typically difficult to modify. Another
strategy is to use inductive learning techniques to automatically
construct classifiers using labeled training data. A growing number of
learning techniques have been applied to text categorization, including
multivariate regression, nearest neighbor classifiers, probabilistic
Bayesian models, decision trees, and neural networks. Overviews of this
text classification work can be found in Lewis and Hayes (1994) and Yang
(1998). Recently, Joachims (1998) and Dumais et al. (1998) have used
Support Vector Machines (SVMs) for text categorization with very
promising results. In this paper we briefly describe the results of
experiments in which we use SVMs to classify newswire stories from
Reuters. Additional details can be found in Dumais et al. (1998).
Back to HAIL Home Page
Back to Home Page