Monday, February 05, 2007

Results of classifier

Some very cool results from the classifier.

It doesn't actually fare any better at the number of articles it finds methods text for than simple regexs based on journal formatting strcuture.

However the text is so much better.

e.g.

Regex:
start match"Material and Methods
We did this with then this with this. 5.0 for 6c of 2ul of 1 aliquot.
Results
It came out at 1.5, this is less than exp 4. Expt 1 failed.
Discussion
this is very important of yes. If we look at the results"end match

This happens because if you use start and end regexes you get matches all over the place and devising a good wa\y to decide which portion actually contains the methods is very hard. So i tended to say, if its bigger its more likely to contain the methods, so use that, after all its still excluding the refs, which are the worse source of false positive term matches.

With bayesian you get this:
"We did this using this"
"We also used this working in conjunction with this."
and then sometimes
"This figure was drawn using this method and this software."

This is very cool, so not only do you get more accurate method identification you also get other bits of text that can be said to be "methodological in nature".
This is a very nice result.

I'm going to extend this into a full-scale article mark up classifier.
with each chunk given the classification of its most probable section type.
This shouldn't be too hard with the help of the BMC articles.
Thanks BMC.