Friday, April 13, 2007

Final mistakes and their corrections

I found out I was multiplying the class prior (the prob of finding that class given any document) by every token probability. This is not correct it should be:

Let ci = class prior prob for class i
Let ti = token prior prob for token i
P = ci+t1+t2+t3...+tn

Instead i was doing

P = (ci+t1)+(ci+t2)+(ci+t3)....+(ci+tn)

I've also rejigged the number of tokens the classifier uses to classify the document. I want to allow all tokens to be used. But it tends to get biased by words that occur in all the classes or 2 or 3 of the classes. So even though i've eliminated the stoplist for a nicer statistical calculation, I still have a static limit of the 50 most differentially used tokens between classes. Never mind.
Latest results are really good.
The best yet (from 10,000 sections)

Sect Corr Incor Precis Recall F-Measure
INTROD 1910 824 0.6986 0.7755 0.735
METHOD 2018 837 0.7068 0.9177 0.7986
RESULT 1469 783 0.6523 0.8871 0.7518
DISCUS 2079 80 0.9629 0.5646 0.7119
Correct: 7476 proportion correct: 0.7476 percentage correct: 74.76
Incorrect: 2524 proportion incorrect: 0.2524 percentage incorrect: 25.240000000000002

However you do have to remember that the testing is based on the documents it was traiend on. Which has obvious implications. However because its a statistical classifier I feel ok about using the same documents for training and testing. However i will prepare a proper independent test data set when im fully happy with it.

Thursday, April 12, 2007

Training

Finished training the classifier on the open access corpus.
I think

BUILD SUCCESSFUL (total time: 756 minutes 37 seconds)

says it all. It took a while.
The table has 1,080,554 unique words in it.

Here is some output from 1000 sections

Sect Corr Incor Precis Recall F-Measure
INTROD 125 47 0.7267 0.5507 0.6266
METHOD 195 121 0.6171 0.9701 0.7544
RESULT 182 96 0.6547 0.8922 0.7552
DISCUS 206 28 0.8803 0.5598 0.6844
Correct: 708 proportion correct: 0.708 percentage correct: 70.8
Incorrect: 292 proportion incorrect: 0.292 percentage incorrect: 29.2

The Bayesian Sectioniser

The bayesian text classifier, now classifies sections of articles into one of four classes (introduction, methods, results, discussion). At the moment i've trained it on 1000 open access articles (from BMC and PMC). It works well i use it to reclassify the same articles but the quality of its classification really does drop down when i give it sections that it has not been trained on. This is not good.

The reason i created a statistical classifier was I liked the ease with which you can get a handle on the accuracy of your results. However as it has evolved it has become necessary to include all different kinds of tweaks and alterations that have no real statistical basis and will probable not be suitable to different types of text. This is why i decided to work out a proper statistical way to improve the classifiers performance.
The biggest problem it has is that often the words used to calculate the overall probability of any given section belonging to any given class of article section do not vary in their probabilities of occurence between classes. A good example is the word "the", people usually use stop word lists to ignore all these common words. I don't like stop word lists, because i often get the feeling that common words can be useful. A good example is the word "were" it occurs 20,727 times in my current 1000 article training set. 56% of these occurences occur in methods sections, 32% in results, 3% in introductions and 9% in discussions. So clearly it is a good discriminatory word but it is almost always in included in standard stop word lists (see here). So I decided to get rid of the stop word list and replace it with a simple calculation that gives you words that vary in their occurence between classes. I use the sample standard deviation and normalise it with respect to the word mean occurence. Sure this can give you words that are used in exaclty the same frequencies in 3 classes and not at all in the fourth, but either way it seems to work quite well. Here is the top 10.

'usa'
'instructions'
'inc'
'washed'
'conclusion'
'committee'
'santa'
'discussion'
'germany'
'kit'

A lot of these are obviously related to methods text. For example 'usa' is used lots in methods sections when people are declaring where they bought their 'kit's and which 'inc's sold them the goods and how they might have 'washed' their blots.

When i ran the classifier over a 1000 (takes 3.5 mins) articles I got this
P = precision
R = recall
F-m = F-measure


INTRODUCTION P: 0.35353535353535354 R: 0.14583333333333334 F-m: 0.2064896755162242
METHODS P: 0.732824427480916 R: 0.9014084507042254 F-m: 0.8084210526315789
RESULTS P: 0.9375 R: 0.7638888888888888 F-m: 0.8418367346938775
DISCUSSION P: 0.6450381679389313 R: 0.8622448979591837 F-m: 0.7379912663755459
Correct: 730 proportion correct: 0.6880301602262017 percentage correct: 68.80301602262017
Incorrect: 331 proportion incorrect: 0.31196983977379833 percentage incorrect: 31.196983977379833


However when i ran it over 10,000 articles (took 27 mins) i got this.

INTRODUCTION P: 0.3986175115207373 R: 0.27970897332255457 F-m: 0.3287410926365795
METHODS P: 0.5770573566084788 R: 0.5235294117647059 F-m: 0.5489916963226572
RESULTS P: 0.709049255441008 R: 0.37221888153938665 F-m: 0.4881703470031546
DISCUSSION P: 0.5684830633284241 R: 0.8348202216815356 F-m: 0.6763771766509692
Correct: 5556 proportion correct: 0.5530559426637468 percentage correct: 55.30559426637468
Incorrect: 4490 proportion incorrect: 0.44694405733625325 percentage incorrect: 44.694405733625324


I now need to optimise it a bit because it currently does a SELECT query to retrieve the class probabilities for each token in the text. This is a lot of overhead. That I could remove by using an IN("token1","token2","token3") type of query that is sorted into the same order as the list of tokens in the IN() section.

Not that great but perhaps it could do with training on the whole open access corpus.
Either way its a nice tool that hopefully can have some good applications.
I'll be making a nice visual web frontend for it soon.
using the Google Web Toolkit, which i like a lot, because it lets you program in Java but then compiles it into cross-browser compatible javascript. It also lets you make your pages asynchronous, which is nice. No more pointless page reloads every time you change something.

Wednesday, April 11, 2007

Parsing XML into a DOM

When you have a DOM of an XML document life becomes easier. You can XPath's to extract single nodes or lists of nodes that you can then operate on. The problem with DOM however is the whole document needs to be parsed into a tree of nodes. Often this is very inefficient. I mostly use the DOM for XML parsing, the DOM is created using the xerces parser, i'm sure there are faster parsers out there but at the moment i can't be bothered finding and downloading one. However I am now processing thousands of BMC and PMC open access articles and the DOM parsing time is beginning to become a problem. However because these documents tend to be highly structured (see here for BMC and here for PMC dtd and XML markup info) you really do need XPath's and therefore DOM's of documents. I could write a custom SAX parser but i really do feel this would be a waste of time. As an example my computer which has a 2.2GHZ athlon takes roughly 3 hours to parse, extract the full text, section text and article metadata from the roughly 50,000 xml documents available from BMC and PMC.