the PhiloLogic Data Architecture

For the last year or so, I've been arguing that it's time for a round of maintenance work on PhiloLogic's various retrieval sub-systems. In a later post, I'll examine some of the newer data store components out there in the open-source world. First, however, I'd like to enumerate what PhiloLogic's main storage components are, where they live, and how they work, for clarity and economy of reference.

The Main Word Index:

PhiloLogic's central data store is a GDBM hashtable called index that functions, basically, the same way as a Perl hash, but on disk, rather than in memory. It has a set of keys, in this case each unique word in the database. Each key corresponds to a short-ish byte-string value, which can come in two different formats:

For low-frequency words, each key word corresponds to a packed binary data object that contains three components:

A short header that says, "I'm a low-frequency word!"
Total frequency for this word. This is used by the query optimizer.
A compressed binary hitlist for the word, containing the byte offset and object address of every occurrence of the word.

For high frequency words, the structure is similar. A type header is followed by the total frequency, which is followed by an address into the raw block index, called index.1. If you've ever looked at a database directory, you may have noticed that this index.1 file is typically two or three times the size of the main index. That's because it contains the binary hitlists for all the high-frequency words in the database, divided into 2-kilobyte chunks. That's important, because, as Zipf's law will tell us, the most frequent words in a language can be very, very frequent, and thus the hits for a single word could go on for tens or or hundreds of megabytes. By dividing large hitlists into chunks, we can put a limit on memory usage. In a modern system, we could set a higher ceiling- 64k might be reasonable. But architecturally, the chunking algorithms are vital for frequent words or large databases.

The upside of this admittedly complex architecture is PhiloLogic's raison d'etre: its blindingly fast search performance. The downside is that GDBM doesn't support some of the features that we expect, particularly ones that involve more complicated searches than the simple keywords.

Thus, we added a plain-text token table, words.R, that we can grep through quite quickly to get a list of all valid keys that match a specified pattern, and a tab delimited token table, words.R.wom, that we can grep through for various secondary attributes and normalizations of the indexed tokens.

Both of these functions are very fast, due to the high throughput of GNU grep. The only downside is the opacity of the index construction process, which can make modifications to this structure very difficult. That said, it's capable of handling unexpectedly rich data if you understand where everything goes. I'll go into this in more depth in my Perseus whitepaper.

Document Metadata:

Traditional 3-series PhiloLogic keeps the most important information about it's XML document store in a file called docinfo, which contains the filename, size, date, and a few other book-keeping tidbits. The reporting systems use this file for the basic tasks of opening an XML file and reading a section out of it, whether for search results context, or for browsing with getobject.

All other data go in a file called bibliography, which has about 20 fields for author, title, publisher, language, etc., and which the gimme utilities search for bibliographic queries. Traditionally, this is done with GNU egrep, but more recent releases have preferred MySQL for its more sophisticated query language. The result of any query, regardless, is a list of binary document id's to pass through to the search engine as a corpus file.

Text Object Metadata:

PhiloLogic tracks all objects below the document level as either division or paragraph objects, and stores them in two different tables: divindex.raw and subdivindex.raw respectively, and uses the subdocgimme utilities to query the metadata. As before, query results, via SQL or egrep, are pushed off to search3 as a packed binary corpus file. And again, the reporting and retrieval subsystems have their own data structure, in this case the toms, for contextualizing hits, or for retrieval with getobject. Finally, the loader builds several derived data structures called navigation, pagemarks, references, and dividxchild.tab for various internal functions.

As the reader may have noticed, the document and text objects are not as clean or as optimized as the main word index, and even harder to hack coherently. I'll detail my first attempt at a more dynamic object structure for the Perseus corpus in a later post. For now, though, I'll pose a question:

Is it possible to devise a single data structure that can handle all of the functions that our current gang of tables and packed binaries does? In short, these are:

Query objects for arbitrary combinations of properties
Efficiently retrieve file paths and byte offsets for retrieval
Maintain the logical relationships of all these objects
Resolve internal and external references to objects at any depth

My current prototypes fulfill about half of these requirements, #2 & 3. But a new object architecture would ideally be a single data structure that does it all. Can anyone think of needed features that I've missed? What about the kind of metadata that ASP or IWW use? Can anyone think of a component that's totally unnecessary and redundant, or notoriously buggy? Please, let me know.

ARTFL Project Research Blog

the PhiloLogic Data Architecture

0 comments:

Post a Comment

Labels

Popular Posts

Blog Archive

Developed by ARTFL