ARTFL Project Research Blog

shlax: a shallow, lazy XML parser in python

Richard Wednesday, August 18, 2010 Leave a Comment

Recently, I stumbled upon a paper from the dawn age of XML:

"REX: XML Shallow Parsing with Regular Expressions", Robert D. Cameron

http://www.cs.sfu.ca/~cameron/REX.html

It describes how to do something I'd never seen done before: parse the entirety of standard XML syntax in a single regular expression.

We've all written short regexes to find some particular feature in an XML document, but we've also all seen those fail because of oddities of whitespace, quoting, linebreaks, etc., that are perfectly legal, but hard to account for in a short, line-by-line regular expression.

Standard XML parsers, like expat, are fabulous, well maintained, and efficient. However, they have a common achilles heel: the XML standard's insistence that XML processors "MUST" report a fatal error if a document contains unbalanced tags. For working with HTML or SGML based documents, this is disastrous!

In contrast, Cameron's regex-based parser is extremely fault-tolerant--it extracts as much structure from the document as possible, and reports the rest as plain text. Further, it supports "round-tripping": the ability to exactly re-generate a document from parser output, which standard parser typically lack. As a corollary of this property, it becomes possible to report absolute byte offsets, which is a "killer feature" for the purposes of indexing.

Because of all these benefits, I've opted to translate his source code from javascript to python. I call my modified implementation "shlax" [pronounced like "shellacs", sort of], a shallow, lazy XML parser. "Shallow" meaning that it doesn't check for well-formedness, and simply reports tokens, offsets, and attributes as best it can. "Lazy" meaning that it iterates over the input, and yields one object at a time--so you don't have to write 8 asynchronous event handlers to use it, as in a typical SAX-style parser. This is often called a "pull" parser, but "shpux" doesn't sound as good, does it?

If you're interested, you can look at the source at the libphilo github repo. The regular expression itself is built up over the course of about 30 expressions, to allow for maintainability and readability. I've made some further modifications to Cameron's code to fit our typical workflow. I've buffered the text input, which allows us to iterate over a file-handle, rather than a string--this saves vast amounts of memory for processing large XML files, in particular. And I return "node" objects, rather than strings, that contain several useful items of information:

the original text content of the node
the "type" of the node: text, StartTag,EndTag, or Markup[for DTD's, comments, etc.]
any attributes the node has
the absolute byte offset in the string or file

You don't need anything more than that to power PhiloLogic. If you'd like to see an example of how to use it, take a look at my DirtyParser class, which takes as input a set of xpaths to recognize for containers and metadata, and outputs a set of objects suitable for the index builder I wrote about last time.

Oh, and about performance: shlax is noticeably slower than Mark's perl loader. I've tried to mitigate for that in a variety of ways, but in general, python's regex engine is not as fast as perl's. On the other hand, I've recently had a lot of success with running a load in parallel on an 8-core machine, which I'll write about when the code settles. That said, if efficiency is a concern, our best option would be to use well-formed XML with a standard parser.

So, my major development push now is to refactor the loader into a framework that can handle multiple parser backends, flexible metadata recognizers, and multiple simultaneous parser processes. I'll be posting about that as soon as it's ready.

A Unified Index Construction Library

Richard Wednesday, May 19, 2010 1 comment

I've spent the last two weeks replacing PhiloLogic's index-construction routines, following my prior work on the query and database interfaces.

The legacy index-packing code dates back to sometime before PhiloLogic 2, and is spread over 3 executable programs linked together by a Makefile and some obscure binary state files.

Unfortunately, the 3 programs all link to different versions of the same compression library, so they couldn't simply be refactored and recompiled as a single unit.

Instead, I worked backwards from the decompression routines I wrote last month, to write a new index construction library from scratch.

Thus, I had the luxury of being able to define an abstract, high-level interface that meets my four major goals:

1)simple, efficient operation

2)flexible enough for various index formats

3)easy to bind to other languages.

4)fully compatible with 3-series PhiloLogic

The main loop is below. It's pretty clean. All the details are handled by a hit-buffer object named "hb" that does compression, memory management, and database interfacing.

while(1) {
 // as long as we read lines from standard input.
 if (fgets(line,511,stdin) == NULL) {
   hitbuffer_finish(hb);
   break;
 }
 // scan for hits in standard Philo3 format.
 state = sscanf(line,
           "%s %d %d %d %d %d %d %d %d %s\n",
           word, &hit[0],...);

 if (state == 10) {
   // if we read a valid hit
   if ((strcmp(word,hb->word))) {
     //if we have a new word...
     hitbuffer_finish(hb); // write out the current buffer.
     hitbuffer_init(hb, word); // and reinitialize
     uniq_words += 1LLU; //LLU for a 64-bit unsigned int.
   }
   hitbuffer_inc(hb, hit); //add the hit to whichever word you're on.
   totalhits += 1LLU;
 }
 else {
   fprintf(stderr, "Couldn't understand hit.\n");
 }
}

The code is publicly available on github, but I'm having some problems with their web interface. I'll post a link once it's sorted out.

Vector Processing for OHCO

Richard Thursday, May 06, 2010 Leave a Comment

I've posted an expanded version of my CI Days talk on Google docs. I'd recommend looking at the speaker notes (click "actions" on the bottom left) since I won't be narrating it in person.

The presentation is an attempt to describe, somewhat formally, how PhiloLogic is capable of performing as well as it does. This comes from spending three years learning how Leonid's search core works, and attempting to extend and elucidate whatever I can. It's also the intellectual framework that I'm using to plan new features, like search on line and meter position, metadata, joins, etc. Hopefully, I can get someone who's better at math than I am to help me tighten up the formalities.

Basically, I refer to the infamous OHCO thesis as a useful axiom for translating the features of a text into a set of numerical objects, and then compare the characteristics of this representation to XML or Relational approaches. I'd love to know how interesting/useful/comprehensible others find the presentation, or the concept. What needs more explanation? What gets tedious?

If you look at the speaker notes, you can see me derive a claim that PhiloLogic runs 866 times faster than a relational database for word search. Math is fun!

PhiloLogic proto-binding for Python

Richard Tuesday, April 13, 2010 Leave a Comment

In an earlier post, I mentioned that I'd try to to call the philologic C routines via ctypes, a Python Foreign Function Interface library. I did, and it worked awesomely well! Ctypes lets you call C functions from python without writing any glue at all in some cases, giving you access to high-performance C routines in a clean, modern programming language. We'd ultimately want a much more hand-crafted approach, but for prototyping interfaces, this is a very, very useful tool.

First, I had to compile the search engine as a shared library, rather than an executable:

gcc -dynamiclib -std=gnu99 search.o word.o retreive.o level.o gmap.o blockmap.o log.o out.o plugin/libindex.a db/db.o db/bitsvector.o db/unpack.o -lgdbm -o libphilo.dylib

All that refactoring certainly paid off. The search4 executable will now happily link against the shared library with no modification, and so can any other program that wants high-speed text object search:

#!/usr/bin/python

import sys,os
from ctypes import *

# First, we need to get the C standard library loaded in
# so that we can pass python's input on to the search engine.
stdlib=cdll.LoadLibrary("libc.dylib")
stdin = stdlib.fdopen(sys.stdin.fileno(),"r")
# Honestly, that's an architectural error.
# I'd prefer to pass in strings, not a file handle

# Now load in philologic from a shared library
libphilo = cdll.LoadLibrary("./libphilo.dylib")

# Give it a path to the database. The C routines parse the db definitions.
db = libphilo.init_dbh_folder("/var/lib/philologic/databases/mvotest5/")

# now initialize a new search object, with some reasonable defaults.
s = libphilo.new_search(db,"phrase",None,1,100000,0,None)

# Read words from standard input.
libphilo.process_input(s,stdin)

# Then dump the results to standard output.
libphilo.search_pass(s,0)
# Done.

That was pretty easy, right? Notice that there weren't any boilerplate classes. I could hold pointers to arbitrary data in regular variables, and pass them directly into the C subroutines as void pointers. Not safe, but very, very convenient.

Of course, this opens us up for quite a bit more work: the C library really needs a lot more ways to get data in and out than a pair of input/output file descriptors, I would say. In all likelihood, after some more experiments, we'll eventually settle on a set of standard interfaces, and generate lower-level bindings with SWIG, which would alow us to call philo natively from Perl or PHP or Ruby or Java or LISP or Lua or...anything, really.

Ctypes still has some advantages over automatically-generated wrappers, however. In particular, it lets you pass python functions back into C, allowing us to write search operators in python, rather than C--for example, a metadata join, or a custom optimizer for part-of-speech searching. Neat!

Unix Daemon Recipes

Richard Thursday, April 08, 2010 Leave a Comment

I was digging through some older UNIX folkways when I stumbled upon an answer to a long-standing PhiloLogic design question:

How do I create a long-running worker process that will neither:

1) terminate when it's parent terminates, such as a terminal session or a CGI script, or
2) create the dreaded "zombie" processes that clog process tables and eventually crash the system.

as it turns out, this is the same basic problem as any UNIX daemon program; this just happens to be one designed to, eventually, terminate. PhiloLogic needs processes of this nature at various places: most prominently, to allow the CGI interface to return preliminary results.

Currently, we use a lightweight Perl daemon process, called nserver.pl, to accept search requests from the CGI scripts, invoke the search engine, and then clean up the process after it terminates. Effective, but there's a simpler way, with a tricky UNIX idiom.

First, fork(). This allows you to return control to the terminal or CGI script. If you aren't going to exit immediately you should SIGCHLD as well, so that you don't get interrupted later.

Second, have the child process call setsid() to gain a new session, and thus detach from the parent. This prevents terminal hangups from killing the child process.

Third, call fork() again, then immediately exit the (original) child. The new "grandchild" process is now an "orphan", and detached from a terminal, so it will run to completion, and then be reaped by the system, so you can do whatever long-term analytics you like.

A command line example could go like this:


#!/usr/bin/perl
use POSIX qw(setsid);

my $word = $ARGV[0] or die "Usage:searchwork.pl word outfile\n";
my $outfile = $ARGV[1] or die "Usage:searchwork.pl word outfile\n";

print STDERR "starting worker process.\n";
&daemonize;

open(SEARCH, "| search4 --ascii --limit 1000000 /var/lib/philologic/somedb);

print SEARCH "$word\n";
close(SEARCH);

exit;

sub daemonize {
    open STDIN, '/dev/null'   or die "Can't read /dev/null: $!";
    open STDOUT, '>>/dev/null' or die "Can't write to /dev/null: $!";
    open STDERR, '>>/dev/null' or die "Can't write to /dev/null: $!";
    defined(my $childpid = fork)   or die "Can't fork: $!";
    if ($childpid) {
        print STDERR "[parent process exiting]\n";
        exit;
    }
    setsid                    or die "Can't start a new session: $!";
    print STDERR "Child detached from terminal\n";
    defined(my $grandchildpid = fork) or die "Can't fork: $!";
    if ($grandchildpid) {
        print STDERR "[child process exiting]\n";
        exit;
    }
    umask 0;
}

The benefit is that a similar &daemonize subroutine could entirely replace nserver, and thus vastly simplify the installation process. There's clearly a lot more that could be done with routing and control, of course, but this is an exciting proof of concept, particularly for UNIX geeks like myself.

The Joy of Refactoring Legacy Code

Richard Wednesday, March 31, 2010 Leave a Comment

I've spent the last few weeks rehabbing PhiloLogic's low-level search engine, and I thought I'd write up the process a bit.

PhiloLogic is commonly known as being a rather large Perl/CGI project, but all of the actual database interactions are done by our custom search engine, which is in highly optimized C. The flow of control in a typical Philo install looks something like this:

--CGI script search3t accepts user requests, and parses them.
--CGI passes requests off to a long-running Perl daemon process, called nserver.
--nserver spawns a long-running worker process search3 to evaluate the request
--the worker process loads in a compiled decompression module, at runtime, specific to the database.
--search3t watches the results of the worker process
--when the worker is finished, or outputs more than 50 results, search3t passes them off to a report generator.

This architecture is extremely efficient, but as PhiloLogic has accrued features over the years it has started to grow less flexible, and parts of the code base have started to decay. The command line arguments to search3, in particular, are arcane and undocumented. A typical example:

export SYSTEM_DIR=/path/to/db

export LD_LIBRARY_PATH=/path/to/db/specific/decompression/lib/
search3 -P:binary -E:L=1000000 -S:phrase -E:L=1000000 -C:1 /tmp/corpus.10866

The internals are quite a bit scarier. Arguments are processed haphazardly in bizarre corners of the code, and many paths and filenames are hard-coded in. And terrifying unsafe type-casts abound. Casting a structure containing an array of ints into an array of ints? Oh my.

I've long been advocating a much, much simpler interface to the search engine. The holy grail would be a single-point-of-entry that could be installed as a C library, and called from any scripting language with appropriate interfacing code. There are several obstacles, particularly with respect to caching and memory management, but the main one is organizational.

How do you take a 15-year-old C executable, in some state of disrepair, and reconfigure the "good parts" into a modern C library? Slowly and carefully. Modern debugging tools like Valgrind help, as does the collective C wisdom preserved by Google. A particular issue is imperative vs. object-oriented or functional style. Older C programs tend to use a few global variables to represent whatever global data structure they work upon--in effect, what modern OOP practices would call a "singleton" object, but in practice a real headache.

For example, PhiloLogic typically chooses to represent the database being searched as a global variable, often set in the OS's environment. But what if you want to search two databases at once? What if you don't have a UNIX system? An object-oriented representation of the large-scale constructs of a program allows the code to go above and beyond its original purpose.

Or maybe I'm just a neat freak--regardless, the [simplified] top-level architecture of 'search3.999' {an asymptotic approach to an as-yet unannounced product} should show the point of it all:

{

static struct option long_options[] = {

{"ascii", no_argument, 0, 'a'},

{"corpussize", required_argument, 0, 'c'},

{"corpusfile", required_argument, 0, 'f'},

{"debug", required_argument, 0, 'd'},

{"limit", required_argument, 0, 'l'},

{0,0,0,0}

};

//...process options with GNU getopt_long...

db = init_dbh_folder(dbname);

if (!method_set) {

strncpy(method,"phrase",256);

}

s = new_search(db,

method,

ascii_set,

corpussize,

corpusfile,

debug,

limit);

status = process_input ( s, stdin );

//...print output...

//...free memory...

return 0

}

An equivalent command-line call would be:

search3999 --ascii --limit 1000000 --corpussize 1 --corpusfile /tmp/corpus.10866 dbname search_method

which is definitely an improvement. It can also print a help message.

Beyond organizational issues, I also ended up rewriting large portions of the decompression routines. The database can now fully configure itself at runtime, which adds about 4 ms to each request, but with the benefit that database builds no longer require compilation. TODO: The overhead can be eliminated if we store that database parameters as integers, rather than as formatted text files.

I think at this point the codebase is clean enough to try hooking up to python, via ctypes, and then experiment with other scripting language bindings. Once I clean up the makefiles I'll put it up on our repository.

Reclassifying the Encyclopédie

Russ Monday, March 29, 2010 Leave a Comment

Diderot and D'Alembert's Encyclopédie might almost have been designed as a document classification exercise. For starters, it comes complete with a branching, hierarchical ontology of classes of knowledge. Out of the 77,000+ articles contained therein, 60,000 are classified according to this system, while 17,000 were left unclassified, providing a ready-made training set and evaluation set, respectively. To make it challenging, within the classified articles, the editors have chosen to apply some classifications with obfuscatory intent or, at least, result, rendering topic boundaries somewhat fuzzy. The categories span the entire breadth of human knowledge, and the articles range from brief renvois ("see XXX") to protracted philosophical treatises. In short, it has everything to make a machine learner happy, and miserable, in one package.

At ARTFL we've been mining this rich vein for some time now. We presented Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie at Digital Humanities 2007, detailing our initial attempts at classification and the critical interpretation of machine learning results. We followed up at DH 2008 with Twisted Roads and Hidden Paths, in which we expanded our toolkit to include k-nearest-neighbor vector space classifications, and a meta-classifying decision tree. Where we had previously achieved around 72% accuracy in categorizing articles medium-length and long articles using Naive Bayes alone, using multiple classifiers combined in this way we were able to get similar rates of accuracy over the entire encyclopedia, including the very short articles, which are quite difficult to classify due to their dearth of distinctive content. This post describes an effort to productionize the results of that latter paper, in order to insert our new, machine-generated classifications into our public edition of the Encyclopédie.

For the impatient, jump ahead to the ARTFL Encyclopédie search form and start digging. The new machine generated classifications can be searched just as any other item of Philologic metadata, allowing very sophisticated queries to be constructed.

For instance, we can ask questions like "Are there any articles originally classified under Géographie that are reclassified as Philsophie?" In fact there are several, and it's interesting to peruse them and deduce why their original classifications and generated classifications fall as they do. The editors followed a policy of not including biographies in the Encyclopédie, but evidently could not restrain themselves in many cases. Instead of creating a biography class, however, they categorized such entries under the headword corresponding to the region of the notable person's birth, and assigned it the class Géographie. Thus the article JOPOLI contains a discussion of the philosopher Augustin Nyphus, born there in 1472, and hence is classified by our machine learner under Philosophie.

Our goals in re-classifying the Encyclopédie are several: to provide better access for our users by adding class descriptions to previously unclassified articles; to identify articles that are re-classified differently from their original classes, allowing users to find them by their generated classes which are often more indicative of the overall content of an article; and to identify interesting patterns in the authors' uses of their classification system, again primarily by seeing what classes tend to be re-classified differently.

We initially undertook to examine a wide range of classifiers including Naive Bayesian, SVM and KNN vector space, with a range of parameters for word count normalization and other settings. After examining hundreds of such runs, we found two that, combined, provided the greatest accuracy in correctly re-classifying articles to their previous classifications: Naive Bayes, using simple word counts, and KNN, using 50 neighbors and tf-idf values for the feature vectors.

Each classifier alone was right about 64% of the time -- but together, at least one of them was right 77% of the time. If we could only decide which one to trust when they differed on a given classification decision, we could reap a substantial gain in accuracy on the previously classified articles, and presumably get more useful classifications of the unclassified articles. We must note that the class labels for each article, which appear at the beginning of the text, we retained for these runs, giving our classifiers an unfair advantage in re-classifying the articles that had such labels present. The class labels get no more weight, however, than any other words in the article. We retained them because our primary objective is to accurately classify the unclassified articles, which do not contain these labels, but may well contain words from these labels in other contexts.

It turned out that KNN was most accurate on smaller articles and smaller classes, whereas Naive Bayes worked best on longer articles that belonged to bigger classes, which gave us something to go on when deciding which classifier got to make the call when they were at odds with each other. By feeding the article and class meta-data into a simple decision tree classifier, along with the results of each classifier, we were able to learn some rules for deciding which classifier to prefer for a given decision where they disagreed on the class assignment. See the decision tree in the DH 2008 with DH 2008 paper for the details.

Of course, we couldn't make the perfect decision every time, but we were close enough to increase our accuracy on previously classified articles to 73%, 9% higher than the average of the individual classifiers. By using a meta-classifier to learn the relative strengths and weaknesses of the sub-classifiers, we were able to better exploit them to get more interesting data for our users, and peel back another layer of the great Encyclopédie. Additionally, we learned characteristics of the classifiers themselves that will enable us to target their applications more precisely in the future.

P.S.: For all you Diderot-philes, here are the stats on the original and machine-learned classes of the articles he authored:

If any of this piques your interest, please do get in touch with us (artfl dawt project at gee-male dawt com should work). We'd love to discuss our work and possible future directions. Or come check us out in Virginia next week!

using the JSON perl mod

Clovis Monday, March 08, 2010 2 comments

I just thought I'd make a quick blog post on how to use the JSON perl mod. Why use JSON when we have XML, I'll leave that to Russ or Richard, but to make a long story short, easier object handling for the projected javascript driven DVLF.
So, the perl JSON module is actually very easy and nice to use, it will convert your perl data structure into JSON without a sweat.
Here's a quick example which I hope will be useful :

#!/usr/bin/perl

use strict;

use warnings;

use JSON;

my %hash;

foreach my $file (@list_of_files) {

open(FILE,"$file");

my @list;

while (<FILE>) {

push(@array,$_);

}

%hash{$file} = [@array]; # store array reference in hash

}

my $obj = \%results;

my $json = new JSON;

my $js = $json->encode($obj, {pretty => 1, indent => 2}); # convert Perl data structure to JSON representation

$output .= "$js\n\n";

print $output;

And done!

Lemma Collocations on Perseus

Kristin Monday, February 01, 2010 1 comment

See UPDATE at the end.

This post actually relates nicely to Mark's recent post. I have recently been working on lemmatized collocation tables for the Greek texts on Perseus. If you just want to see the results, skim to the end, the rest describes how I got there.

It is not so difficult to look up the lemma for each word surrounding a certain search hit, as for these texts the structure to do so is already in place and the information is stored in SQL tables. Efficiency in gathering this information is the main difficulty. Looking up the lemma in the tables now in place can take a couple different SQL queries, which each take up a small chunk of time. For a few lookups, or even a few hundred, this is not too big of a problem. However, for a collocation table spanning five words on either side, we need at least 10 lookups in the databases per hit. The time it takes to do that adds up quite quickly.

So, following a suggestion from Mark, I wrote a script and generated a file with a line for every word in the Perseus database. Basically, it takes each word id starting with 1 on up to 5.5 million something and looks up its lemma. This generated a 5.5 million line file with lines likes this:

2334550 δέ
2334551 ὅς
2334552 ἕκαστος
2334553 ἵππος
2334554 nolemma
2334555 ὅς
2334556 δέ
2334557 πέτομαι
2334558 κονίω
2334559 πεδίον

Now, looking up the words on either side of a hit is much simpler - all you need to know is the "word id" of the hit and you can look at those around it. The "nolemma" entries are primarily punctuation and such which were given word tags at some point.

The size of this massive file however was another hindrance to efficient generation of collocation tables. Although we now know exactly what line we need to look at for the information we need, getting that line is still costly as it could require reading in a couple million lines of a file to get the one we need. After playing around with it a bit, command line grep searching seemed to be the fastest way to go, but even an grep search 10 times per hit adds up fast. So, I tried combining the searches into one massive egrep command to be read into a perl array. My searches looked something like:

egrep "23345[456][0-9]" lemmafile

Giving a window of 30 lines in the file starting at 2334540 and ending at 2334569. This limited the searches to one per hit instead of 10, but it still wasn't fast enough. So, I combine all of the searches like so:

egrep "(2342[234]|33329[678]|...|829[567])[0-9]"

(A bit of accounting for numbers ending in 0 is needed so that a window around 400 doesn't include things in the 490's, but this is not too difficult.)

This looked nice and seemed to work until I tried running it on more hits. It was then coming up with such massive regular expressions to grep for that grep was complaining that they were too big. So, I broke them up into chunks of roughly 350 at a time. Fewer, and the time would go up due to the added number of grep searches; too many more, and grep would overflow again. I may not have hit on the exact time minimizing value, but it is close at least.

Finally, here are some example searches:

Search for logoi.
Search for lemma:logos. (The time estimate is higher than actual load time).

Or, here is the search form:

Search Form.

Make sure that you choose Collocation Table and check the lemma button under the "Refined Search Results" Tab at the bottom.

It can handle most searches, except for very high frequency words. If anyone has ideas on how to make it faster, it could perhaps enable us to get results for all searches. Though perhaps this is not possible without somehow creating and saving those results somewhere.

UPDATE:

After talking to Mark, I altered the way the data is read from the file and now things should be running faster. The reason that all this discussion and speed streamlining for lemmatized collocation tables is necessary is the fact that the texts on Perseus do not have the lemmas embedded in the text. As Mark noted, many of the other databases would allow for much simpler and faster generation of the same data due to the fact that they do have lemmas in the text. However, for the purposes of Perseus, lemmas needed to be separated from the texts to allow them to be more dynamically updated, changed and maintained.

As for the speed, it should now be faster thanks to a handy function in Perl. I had investigated methods for reading a certain line of a file, since I happened to know exactly what lines I needed. However, finding none that did not read the whole contents of the file up to that line, I instead implemented the process described above. I overlooked SEEK. I dismissed it because it starts from a certain byte offset and not a certain line. Nevertheless, we can harness its power by simply padding each line with spaces to ensure every line in our file is the same byte length. With this pointer from Mark and some padding on the lines, knowing the line number and the number of bytes per line is enough to start reading from the exact location in the file that we desire.

MONK Data Under PhiloLogic: 1

Mark Wednesday, January 20, 2010 Leave a Comment

Introduction

As I'm sure you all know, the MONK Project (http://monkproject.org/), directed by Martin Mueller and John Unsworth, has generated a large collection of tagged data some of which has been made public and some of which is limited to CIC or other institutions (http://monkproject.org/downloads/). Each word in this group of different collections is tagged for part of speech, lemma, and normalization. Martin has documented the encoding scheme in great detail at http://panini.northwestern.edu/mmueller/nupos.pdf.

The following is a long post describing in some detail one approach to integrating this kind of information. Some of this will be deeply geeky and you can feel free to skip over sections. There is, towards the bottom of this post, a link to a standard PhiloLogic search form, so you can play with this proof-of-concept build yourself.

Richard and Helma have developed a mechanism to search for part of speech and lemma searching under PhiloLogic for their Greek and Latin databases (link). This is based on some truly inspired hacking by Richard and forms one model of how to handle this kind of functionality. My understanding of this, and Richard please correct me if I am wrong, is that it uses an undocumented feature in the index/search3 subsystem that allows us to have multiple index entries for each word position in the main index. This works and is certainly an approach to be considered as we think about a new series of PhiloLogic.

Build Notes

I have been experimenting with a somewhat different mechanism to handle this kind of problem, which is based on previous examples of mapping multiple word attributes to an index entry, using multiple field "crapser" entries. You may recall that this is the mechanism by which we merged Martin's virtual normalization data to very large collections of early modern English data and is currently running at Northwestern (link). My approach is to index not words, but pairs of surface forms and part of speech tags and to link these to an expanded (5 field) word database (called by crapser) containing the index form, surface form, lemma, part of speech and normalized forms. Here are some index entry forms (and frequencies):

24 conquer:vvb
445 conquer:vvi
143 conquered:vvd
414 conquered:vvn

These map to the word vector database which looks like:

idx               surf      pos        lem        normal
conquered:j     conquered    j       conquer     conquered
conquered:j-vvn conquered    j-vvn   conquer     conquered
conquered:n-vvn conquered    n-vvn   conquer     conquered
conquered:vvd   conquered    vvd     conquer     conquered
conquered:vvn   conquered    vvn     conquer     conquered

To build this I first reduced for fully verbose form of the data in which each token is tagged:

<w eos="0" lem="country" pos="n1" reg="COUNTRY" spe="COUNTRY">COUNTRY</w>

I eliminated all encoding that is redundant, just to make things easier to work with since the files are huge:

<w pos="n1">COUNTRY</w>

Where there is some additional information, I keep it in the encoded document:

<w lem="illustration" pos="n2">ILLUSTRATIONS</w>

I then loaded this data into a very slightly modified PhiloLogic textloader. This simply builds an index representation of the surface form of the word and the part of speech, by getting the PoS from encoding:

if ($thetag =~ /<w/) {
$thepartofspeech = "";
$thetag =~ m/pos="([^"]*)"/i;
$thepartofspeech = $1;
}

and adding this to the index entry:
$theword = $theword . ":" . $thepartofspeech;

When loaded to this point, you have modified index entries. The next step is simply to build a multi-field word vector database (crapser). I did this by reading the input data and adding entries for lemmas or normalizations. This is simply an extension of what is already documented in the "virtual-normalize" directory in the "goodies" in the PhiloLogic release.

The next step was to slightly modify a version of Leonid's original "gimme". The "sequential" version of this function (in the standard Philologic distribution), maps a multi-field (tab delimited) query using regexp patterns in egrep. This is fast and simple. It allows naming of fields, so you can simply specify "lem=justice" and it will generate a regular expression pattern (where TAB = the tab character):

^[^TAB]*TAB[^TAB]*TAB[^TAB]*TABjusticeTAB[^TAB]*$

And you get, of course, full regular expressions. (Note, this renders with some odd spacing, there are no spaces). Swap in this version of crapser and it all appears to run without further modification.

So, to summarize, the implementation does not require any modifications to core system components. It requires only slight modifications to a textloader, which we do all the time for specific databases, and a slightly modified "crapser" with a suitably build word vector database.

The Database

The database has 567 documents containing 38.5 million words (types) and 273,600 index entries. Recall that these are surface form words and part of speech tags and not normal types. The dataset has selections from various sources, including Documenting the American South as well as some British Early Modern texts. It should have full PhiloLogic search and reporting capabilities. You can query the words in the database as usual, simply by typing in words. To force searches on lemmas, normalizations, and parts of speech by specifying (with examples):

lem=conquer
nrm=conquer
pos=pns32

and finally, if you want to get one surface form and part of speech you can search the index entry directly, such as "conquered:vvd". Note that the Part of Speech is specified after a colon and you don't need to specify anything else. This is obviously not a real query interface, but it suggests how we can think about interfaces further along (eg, pull down menus, etc). You can also use regular expressions, such as lem=conque.* Finally, you can combine these, such as "pos=po.* lem=enemy", which means find forms of enemy followed by possessive pronous within three words, such as : "their most mortall enemies". You will need to consult Martin's discussion of the encoding to see all of the parts of speech. It is an extensive and well reasoned scheme.

After all of that, here is the search form. (Reloaded 7/28/11)

Now, before running of to play with this, there are some important notes following which describes how to use this in more detail.

Discussion

This is a proof-of-concept build. In a full implementation, I would need to add some search syntax to allow the user to indicate a set of combined criteria for a single word. I was having some problems coming up with a use case, but I guess one could want to say search for a particular lemma AND part of speech. It would all work with a little massaging. Aside from that, this simple model should support all of the standard PhiloLogic searching and reporting
features. Do let me know if you find something that does not work.

This model supports disambiguating searches, such as to find dog when it is used as a verb. Try "dog:vvi" for hits like "we can dog them" (thanks Russ for this example). It also appears to work properly form most other searches, such as lemmas, normalizations, etc. Part of speech for single entries looks reasonable in terms of performance.

My primary interest, however, in this experiment is to test performance on sequences of parts of speech searching. For example: "pos=po3. pos=j pos=n1" will find sequences like: "their strange confusion" and "his Princely wisedome". Chains of four also seem to work reasonably. Eg: "pos=vvn pos=po3. pos=j pos=n1" returns phrases like "neglected their even elevation", "stimulated their adventurous courage", and "aroused his little troop". You can always find a part of speech after a particular word (lemma): "after pos=po3.".

Now, this is all fine and dandy. Except that doing conjoined searches on parts of speech reveals a significant conceptual difficulty, which I believe also applies to Richard's implementation. Each part of speech generates thousands of surface form index entries. For example:
"pos=vvn pos=po3. pos=j pos=n1"
generates 81,000 unique terms (index entries) in 4 vectors. The evaluation then does a massive join at the level of index entry evaluation. So, it is SLOW and subject to possible memory buffer overflow or other problems. In fact, the system will begin to generate results of this type fairly quickly, due to PhiloLogic's lazy evaluation (start returning results as soon as you have a few). But it can take several minutes to complete the task. We would certainly not want to put this on a high traffic machine, since if you have many similar queries, it would bog it down. Obviously, we could simply test to make sure that users search criteria would not drag the whole system down or simply lock this database to one user at a time, or some other work around. If we got reasonable French NLP, this could be implemented quickly.

However, I believe we have bumped up upon a conceptual problem. To find POS1 and POS2 and POS3 either in a row or within N number of words requires an evaluation of word and/or part of speech positions in documents.

There are a couple of possible solutions, all of which would require consideration of distinct indexing structures. The first is simply to build another kind of NGRAM POS index which would have sequences of parts of speech indexed and mapped to document regions. The second would be a another kind of index which would look like a standard PhiloLogic index entry, except that it would be ONLY part of speech. This would reduce the size of the word vectors, but would not in itself improve the index evaluation to find those sequences that fit the query in the actual documents, since we still have to return to word positions in documents.

We might call this "The Conjoined Part of Speech Problem (CPSP)". It is, in my opinion, a highly specialized type of search and it is not clear just what the use cases might look like in relatively simple languages (English, French) as opposed to Greek, for which Helma makes a convincing case. So, there is a question of just how important this might be. In email communication, Martin makes the case that it would be and that researchers who want this kind of query would be willing to wait a few minutes.

It would be a trivial and useful experiment to run a load where I would index ONLY part of speech information. This would be a good test to see if evaluation speed for conjoined part of speech searches would be reasonable. In fact, Richard and I did a few quick experiments that suggest this would work. The idea would be to distinguish between simple queries -- and run them as usual -- and multiple PoS queries, which would be run on a dedicated index build. So, build parallel indicies. Oddly enuff, in the current architecture, I suspect that one could simply have a switch to say WHICH database to consult dynamically, simply by evaluating the query and then setting the database argument. That would be another one of my famous, award-winning, hall of shame hacks. But it could be made to work.

Martin has also pointed out another issue, which is searching, sorting, and counting of PoS, lemma, and other data. Now, that makes a lot of sense. I want to search for "country" and find distributions of particular parts of speech. Or, I want to do a collocation table searching on a lemma and counting the lemmas around the word. I think all of this is certainly doable -- the latter is something I wrote about some 15 years ago -- with hacks to the various reporting subsystems (not in 3, which is just too much of a mess). In an SOA model of PhiloLogic, this would be quite reasonably handled, ideally by other teams using PhiloLogic if not here at Chicago.

I think these are important issues to raise, but not necessarily resolve at this time, if (when?) we consider the architecture of any future Philologic4 development effort. For example, the current models of report generators would have to know about lemmas, etc. And we would need to at least leave hooks in any future model to support different indexing schemes for things like.

Finally, watch this space. I believe Richard is doing a build of this data using his model as well.

Please do play around with all of this and let me know what you think. One consideration would be implementing this for selected French collections. We would obviously need real virtual normalizers, lemmatizers and PoS identifiers for a broader range of French than we have now.

KWIC Modifications

Kristin Thursday, January 14, 2010 1 comment

I have been working on getting a cleaner output format from KWIC for the Greek texts on Perseus. Helma was desirous of the KWIC output leaving in the word tags which occur in the Perseus texts in order that the word lookup function be usable directly from the KWIC results page. Since KWIC leaves as little formatting as possible, it strips out all tags, including the word tags, from the text. While I worked on that, I also added a few other modifications to KWIC to give a better look to the results page for the Greek texts.

The problem with KWIC for Greek texts is that Greek fonts to not support single-width fonts, which KWIC uses to align the results more cleanly. In addition, the title lines, which give the bibliographic information and link, can be different lengths and this also causes problems aligning the search terms. See for instance, this search.

To solve the word tag problem I just made a few modifications to the KwicFormat subroutine in philosubs. The main edit there was changing the line that stripped all tags into this:

$bf=~ s#<(?!(w |/w))[^>]*># #gi; #keep only word tags

For the alignment issue, things were more complicated. Keeping track of the length of the left side of the line doesn't allow for a consistent place on the page due to the differing widths of the letters. In the end, I modified artfl_kwic to chop the left side of the hit to a size as close to a certain length as possible without breaking any words. Previously, both the right and the left were chopped to a certain length regardless of breaking words and length including tags, often resulting in very little content. Now, only the length of the display string is accounted for and in addition the length is adjusted for the length of the bibliographic title.

I also added a span around the left and right sides of the hit to allow for positioning and alignment using Javascript (and CSS). Then, adding the following lines to the Results Header, the search terms are all lined up in a neat line:

span.left { right:46%; position:absolute; }
span.right { left:54.5%; position:absolute;
         height:18px; overflow:hidden;}

The numbers may look a little messy, but they give nice results. I found that without the decimal, the two sides were a bit too far apart, but there may be another way around that.

The extra bits for formatting the right span are in place of trimming the content of the right side in perl as I did for the left side. I found that the overflow:hidden attribute is quite handy if you can get it to work (it is a bit tempermental). As long as it is found in an absolutely positioned object with restricted size, AND it is contained within an element with overflow set to auto, it should work. It simply hides any content that does not fit within the given boundaries. It gives a very clean look to the right side of the page and even adjusts to different window sides so that the content never leaks to the next line.

Take a look at this page and play with the window size to see what I mean. Unfortunately, there is no such nice property for trimming the overflow off of the left side instead of the right side. That is why I did it in perl instead of Javascript. There is a function called clip in javascript which is designed to clip an image, but again the way it works makes it much easier to trim from the right side than the left side. One could probably twist the clip function enough to make something similar happen for the left side (and make everything nicely adjustable and lovely), but for now, it is happening in perl. (I tried for a while, but my concoctions just seemed to slow things down and not add anything exciting results).

UPDATE: I couldn't resist playing a bit more with the javascript, and now it works like I wanted it to! Now, if you click on the link above, it won't illustrate what I said it would, because the javascript has been improved. I added this function:

function trimKwicLines(){
var contentwidth = $(".content").width();
$(".left").each(function (i) {
var width = $(this).width();
var leftoffset = contentwidth*.4 - width*1 - 2;
$(this).css("left", leftoffset);
});
}

And changed things here and there in the CSS.