Tuesday, September 19, 2006

Naive Bayesian Classifier

More on this soon

Hoping to classify paragraphs into sections from scientific articles.

http://www.ddj.com/184406064

Good explanation of basic calculations

especially on how to combine the probs of individual words into a single document prob

look at

Survey of text mining : clustering, classification, and retrieval / Michael W. Berry (Editor).

Blogged with Flock

Wednesday, August 16, 2006

Information on JCR metrics

Total Cites

The total number of times that a journal has been cited by all journals included in the database in the JCR year.

Citations to journals listed in JCR are compiled annually from the JCR year’s combined database, regardless of which JCR edition lists the journal and regardless of what kind of article was cited or when the cited article was published. Each unique article-to-article link is counted as a citation.

Citations from a journal to an article previously published in the same journal are compiled in the total cites. However, some journals listed in JCR may be cited-only journals, in which case self-cites are not included.

The journal impact factor is the average number of times articles from the journal published in the past two years have been cited in the JCR year.

The impact factor is calculated by dividing the number of citations in the science citation index by the total number of articles published in the two previous years. An impact factor of 1.0 means that, on average, the articles published one or two year ago have been cited one time. An impact factor of 2.5 means that, on average, the articles published one or two year ago have been cited two and a half times. Citing articles may be from the same journal; most citing articles are from different journals.

The aggregate impact factor for a subject category is calcuted the same way as the impact factor for a journal, but it takes into account the number of citations to all journals in the category and the number of articles from all journals in the category. An aggregate impact factor of 1.0 means that that, on average, the articles in the subject category published one or two years ago have been cited one time. The median impact factor is the median value of all journal impact factors in the subject category.

The impact factor mitigates the importance of absolute citation frequencies. It tends to discount the advantage of large journals over small journals because large journals produce a larger body of citable literature. For the same reason, it tends to discount the advantage of frequently issued journals over less frequently issued ones and of older journals over newer ones. Because the journal impact factor offsets the advantages of size and age, it is a valuable tool for journal evaluation.

The impact factor trend graph shows the impact factor for a five-year period. To view the graph, click the Impact Factor Trend button at the top of the journal page.

The immediacy index is the average number of times an article is cited in the year it is published. The journal immediacy index indicates how quickly articles in a journal are cited. The aggregate immediacy index indicates how quickly articles in a subject category are cited.

The immediacy index is calculated by dividing the number of citations to articles published in a given year by the number of articles published in that year.

Because it is a per-article average, the immediacy index tends to discount the advantage of large journals over small ones. However, frequently issued journals may have an advantage because an article published early in the year has a better chance of being cited than one published later in the year. Many publications that publish infrequently or late in the year have low immediacy indexes.

For comparing journals specializing in cutting-edge research, the immediacy index can provide a useful perspective.

Journal Citing Half-Life

The citing half-life is the median age of articles cited by the journal in the JCR year. For example, in JCR 2003, the journal Food Biotechnology has a citing half-life of 9.0. That means that 50% of all articles cited by articles in Food Biotechnology in 2003 were published between 1995 and 2003 (inclusive).

Only journals that publish 100 or more cited references have a citing half-life. do not have a citing half-life.


The aggregate citing half-life is calculated the same way as the journal citing half-life, and its significance is comparable. For a subject category, the citing half-life is the median age of articles cited by journal in the category in the JCR year.

For example, in JCR 2003, the subject category Geochemistry & Geophysics has a citing half-life of 9.9. That means that 50% of all articles cited by articles in Geochemistry & Geophysics journals in 2003 were published between 1994 and 2003 (inclusive).

technorati tags:, , ,

Blogged with Flock

Friday, June 16, 2006

Some data from a preliminary test data set

This graph is the product of some queries from a test version of the text mining database.
The test database has about 1000 articles in it which are spread through several years.






The data has been normalised to acount for the number of articles in the databse per year.
There seems to be a downward trend in most of traditional methods of phylogenetic inference.
e.g. Neighbor-joining (neighbour-joining), maximum parsimony, maximum likelihood.
And yet a quite obvious upturn in the use of bayesian inference of phylogeny.
This would certainly fit with recent trends in phylogenetics.

Wednesday, May 10, 2006

An ontology for phylogenetics

Here is a very weird ontology that I am using for my entity recognition and markup process from my text mining system.

An example is below




the full ontology can be donwloaded and viewed from here


that is all

Monday, April 10, 2006

Things to investigate from Jena

SVM's, particularly hyperplanes through space
Look at morpho-guessing
Ontology to NER, classes/groups
java method 3Dto2D
http://tinyurl.com/p5sr7
http://tinyurl.com/pkf8o
http://tinyurl.com/m43t2

Recommendations from David

Still to fill in

Tuesday, March 28, 2006

Here is also an interesting bit of info

This is something i wasn't really aware of.
In a scan through ~200 journal of virology papers with the search "phylogenetic*" in abstract or title, maximum likelihood was the most popular tree-inference method. closely followed by neighbor-joining and the jukes-cantor model of nucleotide substitution was the most popular from the list screened (see below). Also the command-line only version of clustal; version w was much more popular than the GUI/user-friendly clustalX.

neighbor$joining 137
parsimony 94
likelihood 156
bayesian 26
upgma 9
p-distance 5
jukes-cantor 21
kimura$2-parameter 8
kimura$3-parameter 0
tamura-nei 20
f81 3
hky 13
general$time-reversible 14
dayhoff 13
jtt 6
wag 2
modeltest 17
model$of$nucleotide 17
model$of$protein 3
clustal$x 36
clustal$w 84

An update on what is happening

I have moved very strongly into the world of text-mining.
Essentially we are looking to get an idea of how people do their phylogenetics
by extracting it from papers. Im doing pretty well with the extracting side of things but how to visualise the data isn't so straightforward. Should it be a network and ontology or just a chart and what is the best way to visualise these.



Thursday, January 12, 2006

another post from flock

http://www.nag.co.uk/welcome_iec.asp



IRIS explorer location

Consensus trees and XML models

Read a lot of Felsensteins work on consensus trees.

Ml consensus tree methods "l" varies from 0% to 100% as a value that is used to determine if a group should be collapsed. e.g. l = 50%, so all groups that occur in more than 50% of the trees will be drawn in the final consensus tree. 50% is the value used in the Majority Rule Consensus Tree (MRCT) method. Obviously because anything over 50% has to be in the majority.

Still working on literature review, not sure how much i've written, but theres definitely some very core sections still to write.

Wanted to try out tree-like data structures in java today. Thought occured to me to create java data structures that represent all objects from phylogenetics. But it would need java which makes it a bad idea. A better idea was to create XML schema or models (or whatever the word is) for each entity within the great phylogenetic ontology.

I should try to do this above. I might try it for a DNA sequence.
Need to think about what elements a sequence needs, what are the elements, what are they called, what type are they, what data do they contain.

draft DNA sequence XML schema

ID TYPE DATA
---------------------------------------------
NAME STRING GI345642
LENGTH INTEGER 20
DNA STRING CATCGTCGATCTCGATCGAT
---------------------------------------------

transferring these between programs would make things a lot easier.

you would still need a parser for taking data from a prog and converting it to this and then putting it into this.

If we surround existing software with a transparent data transfer service we need a million interfaces to interconvert data. LOTS of boring text parsing. A waste of time.

We need a core set of components that all talk to each other using a standard defined ontology.

People would only use a system of this type if it had all the features they use, rather than trying to provide it for them. Tools should be made available that can put custom progs into the system. Or at a stretch tell people how to rewrite their software to work with system but this prob wouldn't work.

I should provide custom versions of the most popular software, e.g. blast, clustalW, phylip.

Big job really, but perhaps a toy system could work, I could use it to link units together each units having inputs and outputs. A bit like IRIS explorer.