Jez Higgins

Freelance software generalist
software created
extended or repaired


Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About

Friday 11 November 2005 Arabica: Dual Streaming/DOM XML processing

In the one corner you have an XML instance. What you'd like in the other corner is a nice object model of your own devising build and populated from the XML. Easy, right?

Write a SAX ContentHandler, or set of ContentHandlers, to build the objects as you go. That would work.

Build a DOM from the XML and work it over with some XPath queries. The nodes you get back tell what objects to build. That would work, too.

Now consider an XML instance several megabytes long. An instance with lots of optional elements. An instance where the presence or absence of an element, or set of elements, might determine what objects you need to create. The size of the instance tends towards a SAX approach, while its relative complexity tends towards DOM+XPath. In fact, something in between might be more useful - dual streaming and DOM mode.

Of the existing XML processing libraries that I'm aware of, the one that provides the best support for this kind of thing is Elliotte Harold's XOM Java library. This is hardly surprising as Harold is one of the few guys writing XML libraries who really seems to both care and think. His books are good too, I'd recommend both his XML In A Nutshell and Effective XML as indispensible for the jobbing XML slinger.

Dual mode XML processing allows you operate on individual nodes as the DOM is being built. The can be modified, altered or queried in-flight, as it were. XOM uses a node factory class, and implements dual mode by allowing you to substitute an alternative factory. We could do this in Arabica, but my uberuser Terris has an alternative approach using Boost::Function callbacks which looks much more C++ish.

Callbacks are one thing, but there's still the memory issue. XOM builds has its own tree model, that allows nodes to exist seperately from their document. This allows nodes to be trimmed out and discarded. The DOM, which Arabica implements, doesn't. Any node is intimately tied to the document. Even nodes which have been removed (through replaceNode or removeNode) can not be deleted, and must be held in memory until the entire document is discarded. (Basically, without adding a reference to count to every node, there's no way to know if a removed node is still needed or not. The memory overhead of the DOM is large enough as it is ...) To overcome this and allow removed nodes to be *really* removed, the DOM API needs to be extended with something like NodeT::discard() or DocumentT::prune(NodeT node). Something like that, anyway.

So that's where we're going in the next few days, I hope. Watch CVS for code excitement.


Tagged code, arabica, xml, and c++


Jez Higgins

Freelance software generalist
software created
extended or repaired

Older posts are available in the archive or through tags.

Feed

Follow me on Twitter
My code on GitHub

Contact
About