Recently, I stumbled upon a paper from the dawn age of XML:
Read More
"REX: XML Shallow Parsing with Regular Expressions", Robert D. Cameron
It describes how to do something I'd never seen done before: parse the entirety of standard XML syntax in a single regular expression.
We've all written short regexes to find some particular feature in an XML document, but we've also all seen those fail because of oddities of whitespace, quoting, linebreaks, etc., that are perfectly legal, but hard to account for in a short, line-by-line regular expression.
Standard XML parsers, like expat, are fabulous, well maintained, and efficient. However, they have a common achilles heel: the XML standard's insistence that XML processors "MUST" report a fatal error if a document contains unbalanced tags. For working with HTML or SGML based documents, this is disastrous!
In contrast, Cameron's regex-based parser is extremely fault-tolerant--it extracts as much structure from the document as possible, and reports the rest as plain text. Further, it supports "round-tripping": the ability to exactly re-generate a document from parser output, which standard parser typically lack. As a corollary of this property, it becomes possible to report absolute byte offsets, which is a "killer feature" for the purposes of indexing.
Because of all these benefits, I've opted to translate his source code from javascript to python. I call my modified implementation "shlax" [pronounced like "shellacs", sort of], a shallow, lazy XML parser. "Shallow" meaning that it doesn't check for well-formedness, and simply reports tokens, offsets, and attributes as best it can. "Lazy" meaning that it iterates over the input, and yields one object at a time--so you don't have to write 8 asynchronous event handlers to use it, as in a typical SAX-style parser. This is often called a "pull" parser, but "shpux" doesn't sound as good, does it?
If you're interested, you can look at the source at the libphilo github repo. The regular expression itself is built up over the course of about 30 expressions, to allow for maintainability and readability. I've made some further modifications to Cameron's code to fit our typical workflow. I've buffered the text input, which allows us to iterate over a file-handle, rather than a string--this saves vast amounts of memory for processing large XML files, in particular. And I return "node" objects, rather than strings, that contain several useful items of information:
- the original text content of the node
- the "type" of the node: text, StartTag,EndTag, or Markup[for DTD's, comments, etc.]
- any attributes the node has
- the absolute byte offset in the string or file
You don't need anything more than that to power PhiloLogic. If you'd like to see an example of how to use it, take a look at my DirtyParser class, which takes as input a set of xpaths to recognize for containers and metadata, and outputs a set of objects suitable for the index builder I wrote about last time.
Oh, and about performance: shlax is noticeably slower than Mark's perl loader. I've tried to mitigate for that in a variety of ways, but in general, python's regex engine is not as fast as perl's. On the other hand, I've recently had a lot of success with running a load in parallel on an 8-core machine, which I'll write about when the code settles. That said, if efficiency is a concern, our best option would be to use well-formed XML with a standard parser.
So, my major development push now is to refactor the loader into a framework that can handle multiple parser backends, flexible metadata recognizers, and multiple simultaneous parser processes. I'll be posting about that as soon as it's ready.