Hall Monitor: iSchool Mailing List Classifier


Ivan Tam and Joanne Wong
{ivan,joanne}@sims.berkeley.edu


UC Berkeley School of Information
Applied Natural Language Processing, Fall 2006
Prof. Marti Hearst



In this document: Introduction | Background | System | Code | Credits



Introduction (back to top)

Can noisy mailing lists be filtered by user-specific natural language processing (NLP)-based methods? Existing software systems are built for filtering a single email stream and makes a simple in or out decision. However, dealing with unwanted messages in the context of community mailing lists is not as simple. For example, messages considered irrelevant on one list may be very relevant to another. The problem then becomes which mailing list the message belongs to rather than a binary in/out decision. In these situations, users also have to determine which messages are appropriate for any given mailing list and conventions will have to be communicated to the community and consistently enforced to be effective.

We will attempt to build a natural-language based mailing list classifier. The classifier will search for specific features in messages and make a decision about which mailing list it seems most appropriate. The specific features that our proposed classifier will extract will be determined by research into community conventions. In essence, the classifier implicitly learns the rules of conventions of the iSchool mailing lists and enforces these conventions with little user intervention. Our prototype will be implemented as a script which takes in proposed message text and suggests to which mailing list the message should be sent.

Background (back to top)

At the U.C. Berkeley School of Information there are a proliferation of mailing lists. Three mailing lists of note are students@sims.berkeley.edu (hereafter referred to as students), fun@sims.berkeley.edu (hereafter referred to as fun) and noise@sims.berkeley.edu (hereafter referred to as noise). Many students (as well as some alumni and even faculty) freely post links, news, fun stories, events, etc. to fun, which is a high-traffic list; some subscribers, however, feel that certain material is not "fun-worthy" and should be demoted to noise, which is currently a low-traffic list. Students rarely post to students (and the infrequency is ironic, given the name of the list).

Currently, a small subset of students attempt to enforce the boundaries by emailing users who are perceived to have violated mailing list conventions. Not all warnings are graciously accepted, however, and this often leads to conflict over mailing list usage. The question then arises about how these conventions are set in the first place and why only a subset of the community actively enforces these conventions.

System (back to top)

Dataset creation

We extracted three months of mailing list archives via majordomo. Messages from November 2004, November 2005, and November 2006 sent to students, fun, and noise were parsed from mbox (an idiosyncratic "standard" format for email) into individual messages. We retained only the "thread-starters" under the assumptions that: 1) replies may include noisy data (e.g., admonishment for using the wrong mailing list); and 2) only thread-starters are relevant for training an application intended for routing new, "thread-starting" messages. The size of this dataset is 550 messages.

After performing some preliminary cross-validation (testing and training on distinct portions of the data), we realized that there was a problem with our dataset. The implicit classification provided by the name of the mailbox from which each message was extracted was not good enough for training the classifier. As a result, we created a subset of training data from the initial data consisting of 27 messages from each list, which we hand-tagged based on community standards.

Dataset organization

Datasets are organized in hierarchical filesystem directory structures. The top level directory is the name of the dataset. Under this directory there are three sub-directories named "fun", "noise", and "students". Each of these sub-directories contains a set of email messages from their respective mailing list. Each file contains only one email message with its "From:", "Subject:", and "Body:" fields. A typical layout of a dataset resembles the following:

cleandata
|-- fun
|   |-- mail10
|   |-- mail31
|   |-- mail32
|   `-- mail57
|-- noise
|   |-- mail24
|   |-- mail51
|   |-- mail52
|   `-- mail54
`-- students
    |-- mail9
    |-- mail97
    |-- mail98
    `-- mail99

Architecture

The classification process begins with the the training script loading a file from the dataset directory. The file is then passed onto a chain of "feature extractors." each of which examines the message and reports back a relevant feature. Each feature is encoded as a string with the feature name and level (binary or qualitative) at which the feature appears in the file, e.g., "Political-false" or "Length-medium." After the message is passed through the chain, all the features are collected and passed to the Bayes classifiers for training. In an earlier version of the system, there were three classifiers, each classifying between two classes. The three classifiers were as follows:

However, due to dismal performance, the current version reverts to the use of a single classifier that classifies between two classes and two classes only. Each training-testing cycle thus would then classify between one of the following pairs:

In this arrangement, test classifying on the training dataset reached up to 92.3% success rate. Testing on the test dataset reached up to 63.6%. The downside to this arrangement is that each pair-wise comparison requires some amount of work to determine which set of feature extractors are best suited to distinguishing between the two classes. For example, distinguishing between Fun and Noise messages may be best achieved by testing for the presence of time and date features, since these two features are abundant in the former class and almost non-existent in the latter.

In testing and classification, each message is loaded from their files and run though the feature extractor chains and the extracted features gathered. These features are then passed onto the Bayes classifiers which return a set of confidences about which class the message might belong in. The script then takes this list of classes and sorts on the confidences, returning the class with the highest confidence. This is returned as the guessed class.

The naive Bayes classifier used in this project was provided by reverend.thomas.bayes a Python packaged maintained by divmod.org. The package must be installed for this project to run. It is available at http://divmod.org/trac/wiki/DivmodReverend.

Feature Extractors

Each feature extractor is responsible for pulling one feature out of each message. Chained together in serial, a set of feature extractors will return a set of features that can give an abstract description about a message's content. The following feature extractors were created for the purposes of this project.

Results

In testing on the same dataset that the classifiers were trained on, performance reached 92.3% classification match rate. In running the classifiers against a manually tagged dataset, performance hovered at around 63.6% when classifying fun vs. noise messages. At the worst, the classifiers would guess all messages as one class.

Code (back to top)

The code and accompanying datasets are available under the following link, please read the instructions that follow to run the scripts.

    To run the script:

    1) untar all the datasets (the *.tgz files) with the following command
       tar zxvf file.tgz

    2) train the classifier:
       python trainClassifier.py cleandata-fn

       this trains the classifier to distinguish between Fun and Noise (-fn)
       a file (*.bayes) will be generated at this point to hold the training data.
       This file will be loaded at the next step to classify new incoming messages.

    3) test the classifier against its own training set
       python testClassifier.py cleandata-fn

       this will load the trained classifier and classify against what it was trained
       against.

    4) test the classifier against a new dataset
       python testClassifer.py testset-fn

       this will load the trained classifer and classify against the new dataset
       and returns the results.

       this returns three columns of results:
       1) the guessed class
       2) the classifiers confidence on the guessed class
       3) the actual class as denoted by the dataset
       4) filename of the message
    

Credits (back to top)

Project Team
We would also like to thank