ARTFL Project Research Blog

Richard Wednesday, March 30, 2011 Leave a Comment

I've just pushed a few commits to the central philo4 repository;

mostly small bugfixes to the makefile and the parser, but I added a convenience method to the shlax XML parser.

As you may know, Python has a really nice XML library called ElementTree, but it has a few quirks:

1) it uses standard, "fussy" XML parsers that choke on the slightest flaw, and

2) it has a formally correct but incomprehensible approach to namespaces that is exceedingly impractical for day-to-day TEI hacking.

In this update, I've added a shlaxtree module to the philo4 distribution that hooks our fault-tolerant, namespace-agnostic XML parser up to ElementTree's XPath evaluator and serialization facilities. It generally prefers the 1.3 version of ElementTree, which is standard in python 2.7, but a simple install in 2.6 and 2.5.

Basically, the method philologic.shlaxtree.parse() will take in a file object, and return the root node of the xml document in the file, assuming it found one. You can use this to make a simple bibliographic extractor like so:


#!/usr/bin/env python
import philologic.shlaxtree as st
import sys
import codecs

for filename in sys.argv[1:]:
    file = codecs.open(filename,"r","utf-8")
    root = st.parse(file)
    header = root.find("teiHeader")
    print st.et.tostring(header)
    print header.findtext(".//titleStmt/title")
    print header.findtext(".//titleStmt/author")

Not bad for 10 lines, no? What's really cool is that you can modify trees, nodes, and fragments before writing them out, with neat recursive functions and what not. I've been using it for converting old SGML dictionaries to TEI--once you get the hang of it, it's much easier than regular expressions, and much easier to maintain and modify as well.

ARTFL Project Research Blog

shlax and ElementTree

Labels

Popular Posts

Blog Archive

Developed by ARTFL