shlax: a shallow, lazy XML parser in python

Leave a Comment
Recently, I stumbled upon a paper from the dawn age of XML:"REX: XML Shallow Parsing with Regular Expressions", Robert D. Cameronhttp://www.cs.sfu.ca/~cameron/REX.htmlIt describes how to do something I'd never seen done before: parse the entirety of standard XML syntax in a single regular expression. We've all written short regexes to find some particular feature in an XML document, but we've also all seen those fail because of oddities of whitespace, quoting, linebreaks, etc., that are perfectly legal, but hard to account...
Read More

A Unified Index Construction Library

1 comment
I've spent the last two weeks replacing PhiloLogic's index-construction routines, following my prior work on the query and database interfaces.The legacy index-packing code dates back to sometime before PhiloLogic 2, and is spread over 3 executable programs linked together by a Makefile and some obscure binary state files. Unfortunately, the 3 programs all link to different versions of the same compression library, so they couldn't simply be refactored and recompiled as a single unit.Instead, I worked backwards from the decompression...
Read More

Vector Processing for OHCO

Leave a Comment
I've posted an expanded version of my CI Days talk on Google docs. I'd recommend looking at the speaker notes (click "actions" on the bottom left) since I won't be narrating it in person.The presentation is an attempt to describe, somewhat formally, how PhiloLogic is capable of performing as well as it does. This comes from spending three years learning how Leonid's search core works, and attempting to extend and elucidate whatever I can. It's also the intellectual framework that I'm using to plan new features, like search...
Read More

PhiloLogic proto-binding for Python

Leave a Comment
In an earlier post, I mentioned that I'd try to to call the philologic C routines via ctypes, a Python Foreign Function Interface library. I did, and it worked awesomely well! Ctypes lets you call C functions from python without writing any glue at all in some cases, giving you access to high-performance C routines in a clean, modern programming language. We'd ultimately want a much more hand-crafted approach, but for prototyping interfaces, this is a very, very useful tool.First, I had to compile the search engine as a shared...
Read More

Unix Daemon Recipes

Leave a Comment
I was digging through some older UNIX folkways when I stumbled upon an answer to a long-standing PhiloLogic design question:How do I create a long-running worker process that will neither:1) terminate when it's parent terminates, such as a terminal session or a CGI script, or2) create the dreaded "zombie" processes that clog process tables and eventually crash the system.as it turns out, this is the same basic problem as any UNIX daemon program; this just happens to be one designed to, eventually, terminate. PhiloLogic needs...
Read More

The Joy of Refactoring Legacy Code

Leave a Comment
I've spent the last few weeks rehabbing PhiloLogic's low-level search engine, and I thought I'd write up the process a bit.PhiloLogic is commonly known as being a rather large Perl/CGI project, but all of the actual database interactions are done by our custom search engine, which is in highly optimized C. The flow of control in a typical Philo install looks something like this:--CGI script search3t accepts user requests, and parses them.--CGI passes requests off to a long-running Perl daemon process, called nserver.--nserver...
Read More

Reclassifying the Encyclopédie

Leave a Comment
Diderot and D'Alembert's Encyclopédie might almost have been designed as a document classification exercise. For starters, it comes complete with a branching, hierarchical ontology of classes of knowledge. Out of the 77,000+ articles contained therein, 60,000 are classified according to this system, while 17,000 were left unclassified, providing a ready-made training set and evaluation set, respectively. To make it challenging, within the classified articles, the editors have chosen to apply some classifications with obfuscatory...
Read More

using the JSON perl mod

2 comments
I just thought I'd make a quick blog post on how to use the JSON perl mod. Why use JSON when we have XML, I'll leave that to Russ or Richard, but to make a long story short, easier object handling for the projected javascript driven DVLF. So, the perl JSON module is actually very easy and nice to use, it will convert your perl data structure into JSON without a sweat. Here's a quick example which I hope will be useful : #!/usr/bin/perluse strict;use warnings;use JSON; my %hash;foreach my $file (@list_of_files)  {  ...
Read More

Lemma Collocations on Perseus

1 comment
See UPDATE at the end.This post actually relates nicely to Mark's recent post. I have recently been working on lemmatized collocation tables for the Greek texts on Perseus. If you just want to see the results, skim to the end, the rest describes how I got there.It is not so difficult to look up the lemma for each word surrounding a certain search hit, as for these texts the structure to do so is already in place and the information is stored in SQL tables. Efficiency in gathering this information is the main difficulty. Looking...
Read More

MONK Data Under PhiloLogic: 1

Leave a Comment
Introduction As I'm sure you all know, the MONK Project (http://monkproject.org/), directed by Martin Mueller and John Unsworth, has generated a large collection of tagged data some of which has been made public and some of which is limited to CIC or other institutions (http://monkproject.org/downloads/). Each word in this group of different collections is tagged for part of speech, lemma, and normalization. Martin has documented the encoding scheme in great detail at http://panini.northwestern.edu/mmueller/nupos.pdf. The...
Read More

KWIC Modifications

1 comment
I have been working on getting a cleaner output format from KWIC for the Greek texts on Perseus. Helma was desirous of the KWIC output leaving in the word tags which occur in the Perseus texts in order that the word lookup function be usable directly from the KWIC results page. Since KWIC leaves as little formatting as possible, it strips out all tags, including the word tags, from the text. While I worked on that, I also added a few other modifications to KWIC to give a better look to the results page for the Greek texts.The...
Read More
Next PostNewer Posts Previous PostOlder Posts Home