The Main Word Index:
PhiloLogic's central data store is a GDBM hashtable called index that functions, basically, the same way as a Perl hash, but on disk, rather than in memory. It has a set of keys, in this case each unique word in the database. Each key corresponds to a short-ish byte-string value, which can come in two different formats:
For low-frequency words, each key word corresponds to a packed binary data object that contains three components:
- A short header that says, "I'm a low-frequency word!"
- Total frequency for this word. This is used by the query optimizer.
- A compressed binary hitlist for the word, containing the byte offset and object address of every occurrence of the word.
The upside of this admittedly complex architecture is PhiloLogic's raison d'etre: its blindingly fast search performance. The downside is that GDBM doesn't support some of the features that we expect, particularly ones that involve more complicated searches than the simple keywords.
Thus, we added a plain-text token table, words.R, that we can grep through quite quickly to get a list of all valid keys that match a specified pattern, and a tab delimited token table, words.R.wom, that we can grep through for various secondary attributes and normalizations of the indexed tokens.
Both of these functions are very fast, due to the high throughput of GNU grep. The only downside is the opacity of the index construction process, which can make modifications to this structure very difficult. That said, it's capable of handling unexpectedly rich data if you understand where everything goes. I'll go into this in more depth in my Perseus whitepaper.
Document Metadata:
Traditional 3-series PhiloLogic keeps the most important information about it's XML document store in a file called docinfo, which contains the filename, size, date, and a few other book-keeping tidbits. The reporting systems use this file for the basic tasks of opening an XML file and reading a section out of it, whether for search results context, or for browsing with getobject.
All other data go in a file called bibliography, which has about 20 fields for author, title, publisher, language, etc., and which the gimme utilities search for bibliographic queries. Traditionally, this is done with GNU egrep, but more recent releases have preferred MySQL for its more sophisticated query language. The result of any query, regardless, is a list of binary document id's to pass through to the search engine as a corpus file.
Text Object Metadata:
PhiloLogic tracks all objects below the document level as either division or paragraph objects, and stores them in two different tables: divindex.raw and subdivindex.raw respectively, and uses the subdocgimme utilities to query the metadata. As before, query results, via SQL or egrep, are pushed off to search3 as a packed binary corpus file. And again, the reporting and retrieval subsystems have their own data structure, in this case the toms, for contextualizing hits, or for retrieval with getobject. Finally, the loader builds several derived data structures called navigation, pagemarks, references, and dividxchild.tab for various internal functions.
As the reader may have noticed, the document and text objects are not as clean or as optimized as the main word index, and even harder to hack coherently. I'll detail my first attempt at a more dynamic object structure for the Perseus corpus in a later post. For now, though, I'll pose a question:
Is it possible to devise a single data structure that can handle all of the functions that our current gang of tables and packed binaries does? In short, these are:
- Query objects for arbitrary combinations of properties
- Efficiently retrieve file paths and byte offsets for retrieval
- Maintain the logical relationships of all these objects
- Resolve internal and external references to objects at any depth
0 comments:
Post a Comment