Monday, February 05, 2007

The bayesian classifier

Obviously this post has taken some considerable time to get round to.

No i haven't just been working on the classifier.

this post will update on the classifier:

I've incorporated it into the exisiting textminer project that i've been working on for the last year.

Training:

I extracted 1,000 methods sections from the BMC datamining set of articles, which is a fantastically useful for stuff like this, it consists of roughly 28,000 articles all in XML.

Pity the articles don't really use all the features in the DTD, so for example the Methods section is not marked as type:methods or type:1, where 1 = methods section in the DTS, instead you tend to get this. (sec)(title)Materials and Methods(/title)....(/sec). This is a shame really. Why can't BMC, a single publishers that is very involved with all its content (unlike PMC) ramp up the mark up a bit.

So in order to get 1,000 reliable methods the code had to parse through ~6,000 articles.

You then chunk the article according to double newline characters (described below)
remove the confounding chars
Tokenise the chunk
Check database for word presence in table
if present increase count
otherwise insert new row

At the end of all this i calc a posterior prob for each token
these are then used to calc a combined prob.

I used word counts instead of presence absence because of the number of tokens involved (quite large) in each methods section and i felt it would help the calculation, however this is unproven because i haven't implemented a presence/absence style.

The database is in mySQL, after learning a bit more about the Collections framework in java it may have been easiser to use sets and maps. But automating queries is so simple it may not have been.

Classification:

Get a chunk of text, the chunk being the product of a tokeniser using double newline \n characters or on windows two sets of \r\n, the regex therefore being \r\n\r\n (windows) or \n\n (pure java, linux, mac etc.)

Then replace all the symbols, punctuation and \r\b\t with null, leaving only spaces in the string.

Then tokenise using " ".

You then look up the posterior prob in the database for each word in the chunk.
You then calc the interestingness of all tokens ( |0.5 - posterior_prob| , note the absolute).
I then used the top 20 most interesting tokens to calc the chunk prob.

I also requuired a certain number of tokens per chunk to avoid, short sections being missclassified.