Philosophiæ Doctor: June 2007

Thursday, June 07, 2007

A google web toolkit webapp to test out my web services (TextServices)

I have created a web application and set of 3 WSDL/SOAP web services. The webapp allows you to try out each of the services.

The three services currently available are

PdfToText service: Uses an http post form to upload a pdf, it returns the plain text extracted from that pdf. It makes use of base64 to decode and encode both the binary pdf file and returned string. Because the string will commonly carry characters that are xml invalid.
TextClassification service: You give it some text and it tells you which of the four standard article sections it is most likely to come from. You can also just type in whatever text you want. There are some amusing examples below.
TextRetrieval service: This service will give you a random section of text from our Open Access full text database, based on articles from PMC and BMC. This can then be pasted in to the classifier box, and you can test it out.

I tried these out on the classifier:

"This paper is not only destined to become seminal but is also highly original"
Classified as a section of type: INTRODUCTION

"We used complicated pieces of software and overly detailed protocols."
Classified as a section of type: METHODS

"As one variable went up the other one came down."
Classified as a section of type: RESULTS

"Our findings are wide-ranging and at least 10 times better the anyone elses."
Classified as a section of type: DISCUSSION

This is real output from the classifier.
When and if i'm able to get the website publically accessible. I'm definitely going to provide these as examples.

Finishing off

In order to provide a good quality error assessment of the classifiers abilities.
I've retrained it on 10,000 BMC sections and used this to classify 10,000 PMC sections.
These two section sets should be mutually exclusive. However i'm not sure they are, because PMC distributes BMC content. I'll have to look in to this overlap.

Anyway here is the result


Sect    Corr    Incor    Precis    Recall    F-Measure
INTROD    2308    839    0.7334    0.8809    0.8004
METHOD    1927    503    0.793    0.9432    0.8616
RESULT    1311    218    0.8574    0.8023    0.829
DISCUS    2672    222    0.9233    0.7216    0.8101
Correct: 8218 proportion correct: 0.8218 percentage correct: 82.17999999999999
Incorrect: 1782 proportion incorrect: 0.1782 percentage incorrect: 17.82
724482ms

not bad considering that when you train on all sections (over 300,000) then you get this for 10,000 classifications.


Sect    Corr    Incor    Precis    Recall    F-Measure
INTROD    2166    804    0.7293    0.8794    0.7973
METHOD    2041    536    0.792    0.9281    0.8547
RESULT    1409    443    0.7608    0.8508    0.8033
DISCUS    2525    76    0.9708    0.6858    0.8038
Correct: 8141 proportion correct: 0.8141 percentage correct: 81.41000000000001
Incorrect: 1859 proportion incorrect: 0.1859 percentage incorrect: 18.59
585402ms

nice

Philosophiæ Doctor

Thursday, June 07, 2007

A google web toolkit webapp to test out my web services (TextServices)

Finishing off

Try out Google Docs new Form/Spreadsheet function

Some interesting refs from my Google Feed Reader conglomeration

University o

Blog Archive

About Me