Measuring Word Frequencies for an Evolving Lexicon
Elizabeth Leeds Hohman, (Naval Surface Warfare Center), firstname.lastname@example.org
This work is part of a larger project to analyze streaming documents such as news articles or web logs. The project uses a graph representation of the documents and provides dynamic methods for clustering and viewing the documents. As part of that project, a vector space model is used to represent the documents. In the vector space model, each document is represented as a vector with each dimension of the vector corresponding to a different word in the lexicon. The entries of the vector depend on the number of times the corresponding word in the lexicon occurred in the document. The entries are usually scaled by the frequency of the word in the corpus. This decreases the effect of common words that occur in many documents and increases the effect of rare words that signify the content of the document.
The data in this project are considered in a streaming fashion such as news articles or newsgroup entries collected in time. If we no longer assume a fixed corpus, we cannot use a fixed-dimensional vector space model. Since the lexicon cannot grow without limit, approximations must be made to the representation. New words will appear in documents while words that have been seen in the past might not be seen again. Since the lexicon is constantly changing, we cannot pre-assign dimensions of the vector space to specific words. Also, since vector entries are dependent not only on the frequency of the word in the document but also on the frequency of the word in the corpus, the corpus frequency must also be approximated in the case of streaming documents.
One solution to approximating the corpus frequency is to use a time window and calculate the frequency within that window. In this work, we use an exponentially weighted moving average where the parameter allows for varying the amount of history influencing the value. Since the parameter influences the amount of emphasis on past documents, we should expect streaming text sources that are changing more rapidly to require a different value than sources which change at a slower rate. We will examine simple classifiers and simple datasets in order to monitor classification performance as a function of this parameter. Although the focus will be on calculating word frequencies for the evolving lexicon, other details of text processing of streaming documents will also be presented.