The bayesian text classifier, now classifies sections of articles into one of four classes (introduction, methods, results, discussion). At the moment i've trained it on 1000 open access articles (from BMC and PMC). It works well i use it to reclassify the same articles but the quality of its classification really does drop down when i give it sections that it has not been trained on. This is not good.
The reason i created a statistical classifier was I liked the ease with which you can get a handle on the accuracy of your results. However as it has evolved it has become necessary to include all different kinds of tweaks and alterations that have no real statistical basis and will probable not be suitable to different types of text. This is why i decided to work out a
proper statistical way to improve the classifiers performance.
The biggest problem it has is that often the words used to calculate the overall probability of any given section belonging to any given class of article section do not vary in their probabilities of occurence between classes. A good example is the word "the", people usually use stop word lists to ignore all these common words. I don't like stop word lists, because i often get the feeling that common words can be useful. A good example is the word "were" it occurs 20,727 times in my current 1000 article training set. 56% of these occurences occur in methods sections, 32% in results, 3% in introductions and 9% in discussions. So clearly it is a good discriminatory word but it is almost always in included in standard stop word lists (see
here). So I decided to get rid of the stop word list and replace it with a simple calculation that gives you words that vary in their occurence between classes. I use the sample standard deviation and normalise it with respect to the word mean occurence. Sure this can give you words that are used in exaclty the same frequencies in 3 classes and not at all in the fourth, but either way it seems to work quite well. Here is the top 10.
'usa'
'instructions'
'inc'
'washed'
'conclusion'
'committee'
'santa'
'discussion'
'germany'
'kit'
A lot of these are obviously related to methods text. For example 'usa' is used lots in methods sections when people are declaring where they bought their 'kit's and which 'inc's sold them the goods and how they might have 'washed' their blots.
When i ran the classifier over a 1000 (takes 3.5 mins) articles I got this
P = precision
R = recall
F-m = F-measure
INTRODUCTION P: 0.35353535353535354 R: 0.14583333333333334 F-m: 0.2064896755162242
METHODS P: 0.732824427480916 R: 0.9014084507042254 F-m: 0.8084210526315789
RESULTS P: 0.9375 R: 0.7638888888888888 F-m: 0.8418367346938775
DISCUSSION P: 0.6450381679389313 R: 0.8622448979591837 F-m: 0.7379912663755459
Correct: 730 proportion correct: 0.6880301602262017 percentage correct: 68.80301602262017
Incorrect: 331 proportion incorrect: 0.31196983977379833 percentage incorrect: 31.196983977379833
However when i ran it over 10,000 articles (took 27 mins) i got this.
INTRODUCTION P: 0.3986175115207373 R: 0.27970897332255457 F-m: 0.3287410926365795
METHODS P: 0.5770573566084788 R: 0.5235294117647059 F-m: 0.5489916963226572
RESULTS P: 0.709049255441008 R: 0.37221888153938665 F-m: 0.4881703470031546
DISCUSSION P: 0.5684830633284241 R: 0.8348202216815356 F-m: 0.6763771766509692
Correct: 5556 proportion correct: 0.5530559426637468 percentage correct: 55.30559426637468
Incorrect: 4490 proportion incorrect: 0.44694405733625325 percentage incorrect: 44.694405733625324
I now need to optimise it a bit because it currently does a SELECT query to retrieve the class probabilities for each token in the text. This is a lot of overhead. That I could remove by using an IN("token1","token2","token3") type of query that is sorted into the same order as the list of tokens in the IN() section.
Not that great but perhaps it could do with training on the whole open access corpus.
Either way its a nice tool that hopefully can have some good applications.
I'll be making a nice visual web frontend for it soon.
using the
Google Web Toolkit, which i like a lot, because it lets you program in Java but then compiles it into cross-browser compatible javascript. It also lets you make your pages asynchronous, which is nice. No more pointless page reloads every time you change something.