Monday, February 05, 2007

ISMB submission

Last week i finised a submission to ISMB in vienna in june.
I managed to get it done in a week which was good for me.

We titled it "A methodology for identifying methodologies from full text from literature"

We wanted to keep it to a description of how we did what we did, some of the many! tech problems we had to solve and how succesful this kind of thing can be. Clearly many elements can be improved, but thats more reason to get it out there. Surely there must be some proper text miners who are sick of PPI's and abstracts and want to solve some new problems, well, i've got plenty.

The best bit about the whole article is the final figure.
This was drawn using my java3D software, using my data andmy ideas over how to present the data.
Its nice to do something all by yourself that is actually quite novel and reasonable succesful in its aims.
The image describes the usage of different workflows through time (z-axis). Each workflow is represented by a shape, the diameter of which determined by it usage in each year. Colour is determined by the first use of the workflow. The x,y clustering of the workflows is specified by a radial tree, inferred using the NJ algorithm and a distance matrix of f-measure term similarity values.

Results of classifier

Some very cool results from the classifier.

It doesn't actually fare any better at the number of articles it finds methods text for than simple regexs based on journal formatting strcuture.

However the text is so much better.

e.g.

Regex:
start match"Material and Methods
We did this with then this with this. 5.0 for 6c of 2ul of 1 aliquot.
Results
It came out at 1.5, this is less than exp 4. Expt 1 failed.
Discussion
this is very important of yes. If we look at the results"end match

This happens because if you use start and end regexes you get matches all over the place and devising a good wa\y to decide which portion actually contains the methods is very hard. So i tended to say, if its bigger its more likely to contain the methods, so use that, after all its still excluding the refs, which are the worse source of false positive term matches.

With bayesian you get this:
"We did this using this"
"We also used this working in conjunction with this."
and then sometimes
"This figure was drawn using this method and this software."

This is very cool, so not only do you get more accurate method identification you also get other bits of text that can be said to be "methodological in nature".
This is a very nice result.

I'm going to extend this into a full-scale article mark up classifier.
with each chunk given the classification of its most probable section type.
This shouldn't be too hard with the help of the BMC articles.
Thanks BMC.

The bayesian classifier

Obviously this post has taken some considerable time to get round to.

No i haven't just been working on the classifier.

this post will update on the classifier:

I've incorporated it into the exisiting textminer project that i've been working on for the last year.

Training:

I extracted 1,000 methods sections from the BMC datamining set of articles, which is a fantastically useful for stuff like this, it consists of roughly 28,000 articles all in XML.

Pity the articles don't really use all the features in the DTD, so for example the Methods section is not marked as type:methods or type:1, where 1 = methods section in the DTS, instead you tend to get this. (sec)(title)Materials and Methods(/title)....(/sec). This is a shame really. Why can't BMC, a single publishers that is very involved with all its content (unlike PMC) ramp up the mark up a bit.

So in order to get 1,000 reliable methods the code had to parse through ~6,000 articles.

You then chunk the article according to double newline characters (described below)
remove the confounding chars
Tokenise the chunk
Check database for word presence in table
if present increase count
otherwise insert new row

At the end of all this i calc a posterior prob for each token
these are then used to calc a combined prob.

I used word counts instead of presence absence because of the number of tokens involved (quite large) in each methods section and i felt it would help the calculation, however this is unproven because i haven't implemented a presence/absence style.

The database is in mySQL, after learning a bit more about the Collections framework in java it may have been easiser to use sets and maps. But automating queries is so simple it may not have been.

Classification:

Get a chunk of text, the chunk being the product of a tokeniser using double newline \n characters or on windows two sets of \r\n, the regex therefore being \r\n\r\n (windows) or \n\n (pure java, linux, mac etc.)

Then replace all the symbols, punctuation and \r\b\t with null, leaving only spaces in the string.

Then tokenise using " ".

You then look up the posterior prob in the database for each word in the chunk.
You then calc the interestingness of all tokens ( |0.5 - posterior_prob| , note the absolute).
I then used the top 20 most interesting tokens to calc the chunk prob.

I also requuired a certain number of tokens per chunk to avoid, short sections being missclassified.