Wednesday, April 11, 2007

Parsing XML into a DOM

When you have a DOM of an XML document life becomes easier. You can XPath's to extract single nodes or lists of nodes that you can then operate on. The problem with DOM however is the whole document needs to be parsed into a tree of nodes. Often this is very inefficient. I mostly use the DOM for XML parsing, the DOM is created using the xerces parser, i'm sure there are faster parsers out there but at the moment i can't be bothered finding and downloading one. However I am now processing thousands of BMC and PMC open access articles and the DOM parsing time is beginning to become a problem. However because these documents tend to be highly structured (see here for BMC and here for PMC dtd and XML markup info) you really do need XPath's and therefore DOM's of documents. I could write a custom SAX parser but i really do feel this would be a waste of time. As an example my computer which has a 2.2GHZ athlon takes roughly 3 hours to parse, extract the full text, section text and article metadata from the roughly 50,000 xml documents available from BMC and PMC.