Can noisy mailing lists be filtered by user-specific natural language processing (NLP)-based methods? Existing software systems are built for filtering a single email stream and makes a simple in or out decision. However, dealing with unwanted messages in the context of community mailing lists is not as simple. For example, messages considered irrelevant on one list may be very relevant to another. The problem then becomes which mailing list the message belongs to rather than a binary in/out decision. In these situations, users also have to determine which messages are appropriate for any given mailing list and conventions will have to be communicated to the community and consistently enforced to be effective.
We will attempt to build a natural-language based mailing list classifier. The classifier will search for specific features in messages and make a decision about which mailing list it seems most appropriate. The specific features that our proposed classifier will extract will be determined by research into community conventions. In essence, the classifier implicitly learns the rules of conventions of the iSchool mailing lists and enforces these conventions with little user intervention. Our prototype will be implemented as a script which takes in proposed message text and suggests to which mailing list the message should be sent.
At the U.C. Berkeley School of Information there are a proliferation of mailing lists. Three mailing lists of note are students@sims.berkeley.edu (hereafter referred to as students), fun@sims.berkeley.edu (hereafter referred to as fun) and noise@sims.berkeley.edu (hereafter referred to as noise). Many students (as well as some alumni and even faculty) freely post links, news, fun stories, events, etc. to fun, which is a high-traffic list; some subscribers, however, feel that certain material is not "fun-worthy" and should be demoted to noise, which is currently a low-traffic list. Students rarely post to students (and the infrequency is ironic, given the name of the list).
Currently, a small subset of students attempt to enforce the boundaries by emailing users who are perceived to have violated mailing list conventions. Not all warnings are graciously accepted, however, and this often leads to conflict over mailing list usage. The question then arises about how these conventions are set in the first place and why only a subset of the community actively enforces these conventions.
We extracted three months of mailing list archives via majordomo. Messages from November 2004, November 2005, and November 2006 sent to students, fun, and noise were parsed from mbox (an idiosyncratic "standard" format for email) into individual messages. We retained only the "thread-starters" under the assumptions that: 1) replies may include noisy data (e.g., admonishment for using the wrong mailing list); and 2) only thread-starters are relevant for training an application intended for routing new, "thread-starting" messages. The size of this dataset is 550 messages.
After performing some preliminary cross-validation (testing and training on distinct portions of the data), we realized that there was a problem with our dataset. The implicit classification provided by the name of the mailbox from which each message was extracted was not good enough for training the classifier. As a result, we created a subset of training data from the initial data consisting of 27 messages from each list, which we hand-tagged based on community standards.
Datasets are organized in hierarchical filesystem directory structures. The top level directory is the name of the dataset. Under this directory there are three sub-directories named "fun", "noise", and "students". Each of these sub-directories contains a set of email messages from their respective mailing list. Each file contains only one email message with its "From:", "Subject:", and "Body:" fields. A typical layout of a dataset resembles the following:
cleandata
|-- fun
| |-- mail10
| |-- mail31
| |-- mail32
| `-- mail57
|-- noise
| |-- mail24
| |-- mail51
| |-- mail52
| `-- mail54
`-- students
|-- mail9
|-- mail97
|-- mail98
`-- mail99
The classification process begins with the the training script loading a file from the dataset directory. The file is then passed onto a chain of "feature extractors." each of which examines the message and reports back a relevant feature. Each feature is encoded as a string with the feature name and level (binary or qualitative) at which the feature appears in the file, e.g., "Political-false" or "Length-medium." After the message is passed through the chain, all the features are collected and passed to the Bayes classifiers for training. In an earlier version of the system, there were three classifiers, each classifying between two classes. The three classifiers were as follows:
However, due to dismal performance, the current version reverts to the use of a single classifier that classifies between two classes and two classes only. Each training-testing cycle thus would then classify between one of the following pairs:
In this arrangement, test classifying on the training dataset reached up to 92.3% success rate. Testing on the test dataset reached up to 63.6%. The downside to this arrangement is that each pair-wise comparison requires some amount of work to determine which set of feature extractors are best suited to distinguishing between the two classes. For example, distinguishing between Fun and Noise messages may be best achieved by testing for the presence of time and date features, since these two features are abundant in the former class and almost non-existent in the latter.
In testing and classification, each message is loaded from their files and run though the feature extractor chains and the extracted features gathered. These features are then passed onto the Bayes classifiers which return a set of confidences about which class the message might belong in. The script then takes this list of classes and sorts on the confidences, returning the class with the highest confidence. This is returned as the guessed class.
The naive Bayes classifier used in this project was provided by reverend.thomas.bayes a Python packaged maintained by divmod.org. The package must be installed for this project to run. It is available at http://divmod.org/trac/wiki/DivmodReverend.
Each feature extractor is responsible for pulling one feature out of each message. Chained together in serial, a set of feature extractors will return a set of features that can give an abstract description about a message's content. The following feature extractors were created for the purposes of this project.
This extractor counts up the number of tokens in each message body.
LineCount returns the length of the message body.
Returns the number of tokens in the subject line.
Returns the number of URLs in the message body.
Returns whether or not a date was listed in the message body.
Returns whether or not a weekday was listed in the message body.
Returns whether or not a time was listed in the message body.
Returns whether the message was forwarded to the mailing list.
Returns whether the examined message contain vocabulary mentioning "Thirsty Thursday."
Returns whether the examined message contain vocabulary about political discussion.
Returns whether the message seem to be about food or going out for food.
Returns the Gunning-Fog index of the message.
Returns whether the message contain words that are commonly associated with messages from the administration.
In testing on the same dataset that the classifiers were trained on, performance reached 92.3% classification match rate. In running the classifiers against a manually tagged dataset, performance hovered at around 63.6% when classifying fun vs. noise messages. At the worst, the classifiers would guess all messages as one class.
The code and accompanying datasets are available under the following link, please read the instructions that follow to run the scripts.
To run the script:
1) untar all the datasets (the *.tgz files) with the following command
tar zxvf file.tgz
2) train the classifier:
python trainClassifier.py cleandata-fn
this trains the classifier to distinguish between Fun and Noise (-fn)
a file (*.bayes) will be generated at this point to hold the training data.
This file will be loaded at the next step to classify new incoming messages.
3) test the classifier against its own training set
python testClassifier.py cleandata-fn
this will load the trained classifier and classify against what it was trained
against.
4) test the classifier against a new dataset
python testClassifer.py testset-fn
this will load the trained classifer and classify against the new dataset
and returns the results.
this returns three columns of results:
1) the guessed class
2) the classifiers confidence on the guessed class
3) the actual class as denoted by the dataset
4) filename of the message