Friday, April 13, 2007

Final mistakes and their corrections

I found out I was multiplying the class prior (the prob of finding that class given any document) by every token probability. This is not correct it should be:

Let ci = class prior prob for class i
Let ti = token prior prob for token i
P = ci+t1+t2+t3...+tn

Instead i was doing

P = (ci+t1)+(ci+t2)+(ci+t3)....+(ci+tn)

I've also rejigged the number of tokens the classifier uses to classify the document. I want to allow all tokens to be used. But it tends to get biased by words that occur in all the classes or 2 or 3 of the classes. So even though i've eliminated the stoplist for a nicer statistical calculation, I still have a static limit of the 50 most differentially used tokens between classes. Never mind.
Latest results are really good.
The best yet (from 10,000 sections)

Sect Corr Incor Precis Recall F-Measure
INTROD 1910 824 0.6986 0.7755 0.735
METHOD 2018 837 0.7068 0.9177 0.7986
RESULT 1469 783 0.6523 0.8871 0.7518
DISCUS 2079 80 0.9629 0.5646 0.7119
Correct: 7476 proportion correct: 0.7476 percentage correct: 74.76
Incorrect: 2524 proportion incorrect: 0.2524 percentage incorrect: 25.240000000000002

However you do have to remember that the testing is based on the documents it was traiend on. Which has obvious implications. However because its a statistical classifier I feel ok about using the same documents for training and testing. However i will prepare a proper independent test data set when im fully happy with it.