Philosophiæ Doctor

Thursday, December 11, 2008

Google Docs Presentation

I have created a Google Docs presentation here

Its titled 'Biological basics for non-biologists'

Its aimed at people who work with biological information a lot of the time

but who don't have any formal training in biology and want to be able to

understand what they're working with a little better.

Thursday, May 01, 2008

Using names of methods and techniques as a way to make your work 'fit in'

Consider these two recently published article titles.

"GAPscreener: An Automatic Tool for Screening Human Genetic Association Literature in PubMed Using the Support Vector Machine Technique"

"Extraction of semantic biomedical relations from text using conditional random fields"

I have no problem with the work, or the conclusions they draw or the methods they use.
But I do have a problem with the way that this title.

"Extraction of semantic biomedical relations from text, using a method which we chose because it was the most appropriate for task, but it doesn't really have a name that you will know"

Sounds rubbish, even after ignoring its extreme verbosity.

How can you present work that uses methods or techniques that are novel and well chosen and yet could be previously unpublished, don't have a specific name, and are not well known in the field?

Also it does make you wonder if people use 'known' methods for tasks, even if they are not the most appropriate choice, just to ease the process of peer review and publication.

Personally i'm not keen on fitting in, just for the sake of simplicity.

Thursday, April 10, 2008

Searching code

Google code search is truly brilliant.

Useful options

lang:java

Limits the search to Java code only, same works for many other languages see here.

Recently I wanted to look at some good examples of SwingWorker implementations.
And rather than doing the normal plain Google searches for "SwingWorker", or "SwingWorker example"/"SwingWorker tutorial", I thought i'd try code search. It worked really well and the best thing is, you get to see the code straight away instead of having to wade through download pages etc. It also has a very nice package hierarchy on the top left of any class so you can follow the usage of classes.

I searched for "extends SwingWorker lang:java".

Language as a complex phenotype

Just read this, an essay by Mark Pagel in Nature.
Other things I have read recently should by on the sidebar of this blog or you can look at them here.

I thought it was an extremely well thought out essay, with some very original and creative ideas that brought to my mind other ideas I'd had in the past.

On the nature of intergenic DNA (normally called 'junk' DNA, although this name significant underepresents its importance in the genome).

I agree that it is important for phenotypic regulation, and that it must be important in developing the complexity seen in phenotypically complex organisms (humans, trees, beetles).
RNA may play an important role, and may be what most of this DNA is doing there (see work by John Mattick).
I also believe intergenic DNA has physical importance in regulation, i.e. intergenic regions create novel promoter/enhancer elements, modifying polymerase assembly/transcription factor recruitment. Imagine shapes within shapes, of promoters enhancing TATA boxes to regulate transcription factor bound enhancement of ncRNA structures that catalyse RNA cleavage.
Even though the previous point is entirely unproven (and mostly rubbish), you can't deny it all adds up to a whole lot of complexity.

I have to say though I didn't totally agree with some of his statements.

He suggests analogue measurements are less precise, when it seems to me they can be more precise, and as long as you don't need to store analogue data (i.e. reduce its precision for storage e.g. rounding errors) then analogue will always be more precise.

In my opinion the genome cannot be encoded in an analogue way, but can be interpreted so. If you use an inexact system to read the digital genome you get an analogue result. E.g. transcription does not produce 10 RNA copies of a gene, it just transcribes it until the transcriptional machinery is no longer available or moves away.

Having said all of that though I do think that in regulatory systems, number/counts of molecules are important, and that concentrations often 'miss the point'. If an enzyme is at a very low concentration, there will be very few molecules of it around and therefore the chances of it bumping into its necessary reagents/cofactors etc. are not necessarily concentration dependant.

I think language is 'the voice of our genes' just as our brains are an adaptive strategy. The brain is a truly brilliant product of evolution. It allows us to evolve our behaviour, and to some extent our bodies, within our own lifetime, this is something that 'hard-coded' behaviour/instinct and non-plastic development cannot do. The only trouble is we can't pass it on to our offspring directly, we have to use language to tell our children about our experiences, so they can improve/modify them.

Thursday, August 09, 2007

Making it easier to use TextServices programmatically

I have produced a collection of resources which make using TextServices much easier.
These things are included

A Perl client for the text classification service (only 6 lines). Which can easily be extended to the other services. ("TextClassificationClient.pl").
A Ruby client for the text classification service. Which, again, can be extended to use the other services. ("TextClassificationClient.rb").
A Java graphical client for all the services (class "Main"). Which can be run on Mac and Windows by double clicking the jar file "TextServicesClients.jar".
Example Java clients for each of the services (in the package "textservicesclients").
Java binding stubs that the clients (from point 4) make use of. These were created by WSDL2Java from Apache axis.
All the jars that you need to include in your classpath to use the Java clients (in folder "lib").

I use these clients myself and thought they may be of use to anyone who ever uses these services. Accessing web services though Perl and Ruby is easy, Java is much harder, but also much more powerful.

The archive file is linked to on the text services main page
Or you can use this direct link

Thursday, August 02, 2007

clippings on the subject of best practice

From: Commission for Rural Communities
URL: http://www.ruralcommunities.gov.uk/files/CRC18-DefiningBestPractice.pdf

What is best practice?
The identification, collection and dissemination of best
practice is a commonly used approach to improving practice.
However in order to identify best practice we first need to
identify and agree the key components that make it up.
The criteria listed here are the result of consultation and
discussion both with CRC staff and a wide range of external
organisations. They can be used in national, regional and
local contexts and apply equally to the development of both
policy and practice.

What defining best practice is not
The identification of the key elements of best practice is not
an attempt to standardise the development of policy and
practice. In rural areas the one size fits all response does not
work. Approaches need to be tailored to fit local needs and
circumstances. Diversity is necessary both to meet current
needs and changing future conditions.
We do believe however that all best practice – whether it is a
process such as the development of national or regional
policy frameworks, or practice such as the setting up of a
community based project - shares some common features. It
is these features that we have tried to identify.

Definition of best practice
• Delivers effective, identifiable outcomes, meeting
identified needs or filling gaps in provision.
• Makes good use of scarce resources such as finance,
property, skills.
• Reflects local circumstances and conditions.
• Self evaluates, considers and learns from previous
examples and experience.
• Is flexible and can adapt to changing needs and
circumstances.
• Is creative in its approach to problem solving.
• Provides transferable models/blueprints for others to
follow, without over-reliance on exceptional individuals.
• Shows long term sustainability and viability.
• Demonstrates cross sector and partnership working.
• Is inclusive and consultative.

Sunday, July 01, 2007

Services Available to test

The three text web services and the testing webpage are now available at

http://130.88.90.134:8080/TextServices/

This page has a link to the extraordinarily simple userguide, links to the 3 wsdl documents for the web services and links to 3 taverna workflows that you can use to invoke the services.

If you want to submit a pdf to the SOAP interface of the text extraction service it must be submitted as a String of binary encoded as Base64.
I use the commons-codec Java class Base64 for this. It is very simple to use.
Or you can use the Taverna base64 encoder and decoders, these work fine as well and don't require any programming.

Thursday, June 07, 2007

A google web toolkit webapp to test out my web services (TextServices)

I have created a web application and set of 3 WSDL/SOAP web services. The webapp allows you to try out each of the services.

The three services currently available are

PdfToText service: Uses an http post form to upload a pdf, it returns the plain text extracted from that pdf. It makes use of base64 to decode and encode both the binary pdf file and returned string. Because the string will commonly carry characters that are xml invalid.
TextClassification service: You give it some text and it tells you which of the four standard article sections it is most likely to come from. You can also just type in whatever text you want. There are some amusing examples below.
TextRetrieval service: This service will give you a random section of text from our Open Access full text database, based on articles from PMC and BMC. This can then be pasted in to the classifier box, and you can test it out.

I tried these out on the classifier:

"This paper is not only destined to become seminal but is also highly original"
Classified as a section of type: INTRODUCTION

"We used complicated pieces of software and overly detailed protocols."
Classified as a section of type: METHODS

"As one variable went up the other one came down."
Classified as a section of type: RESULTS

"Our findings are wide-ranging and at least 10 times better the anyone elses."
Classified as a section of type: DISCUSSION

This is real output from the classifier.
When and if i'm able to get the website publically accessible. I'm definitely going to provide these as examples.

Finishing off

In order to provide a good quality error assessment of the classifiers abilities.
I've retrained it on 10,000 BMC sections and used this to classify 10,000 PMC sections.
These two section sets should be mutually exclusive. However i'm not sure they are, because PMC distributes BMC content. I'll have to look in to this overlap.

Anyway here is the result


Sect    Corr    Incor    Precis    Recall    F-Measure
INTROD    2308    839    0.7334    0.8809    0.8004
METHOD    1927    503    0.793    0.9432    0.8616
RESULT    1311    218    0.8574    0.8023    0.829
DISCUS    2672    222    0.9233    0.7216    0.8101
Correct: 8218 proportion correct: 0.8218 percentage correct: 82.17999999999999
Incorrect: 1782 proportion incorrect: 0.1782 percentage incorrect: 17.82
724482ms

not bad considering that when you train on all sections (over 300,000) then you get this for 10,000 classifications.


Sect    Corr    Incor    Precis    Recall    F-Measure
INTROD    2166    804    0.7293    0.8794    0.7973
METHOD    2041    536    0.792    0.9281    0.8547
RESULT    1409    443    0.7608    0.8508    0.8033
DISCUS    2525    76    0.9708    0.6858    0.8038
Correct: 8141 proportion correct: 0.8141 percentage correct: 81.41000000000001
Incorrect: 1859 proportion incorrect: 0.1859 percentage incorrect: 18.59
585402ms

nice