Let's say we can represent the position of some occurrence of the word "cat" as
0 1 2 -1 1 12 7 135556 56
which could be interpreted as
page 56, for example.
A structured, positional index allows us to evaluate phrase queries, positional queries, or metadata queries very efficiently. Unfortunately, storing each of these 9 numbers as 32-bit integers would take 36 bytes of disk space, for every occurence of the word. In contrast, it's actually possible to encode all 9 of the above numbers in just 39 bits, if we store them efficiently--that's a 93% saving. The document field has the value 0, which we can store in a single bit, whereas byte position, our most expensive, can be stored in just 18 bits. The difficulty being that the simple array of integers becomes a single long bit string stored in a hash. First we encode each number in binary, like so
0 1 01 11 1 0011 111 001000011000100001 000111
but this is only 18 bits, so we have to pad it off with 6 extra bits to get an even byte alignment, and then we can store it in our hash table under "cat".
Now, suppose that we use somthing like this format to index a set of small documents with 10,000 words total. We can expect, among other things, a handful of occurrences of "cat", and probably somewhere around a few hundred occurrences of the word "the". In a GDBM table, duplicate keywords aren't permitted--there can be exactly one record of "cat". For a database this size, it would be feasible to append every occurrence into a single long bit string Let's say our text structures require 50 bits to encode, and that we have 5 occurrences of cat. We look up "cat" in GDBM, and get a packed bit string 32 bytes, or 256 bits long. we can divide that by the size of a single occurrence, so we know that we have 5 occurrences and 6 bits of padding.
"The", on the other hand, would be at least on the order of few kilobytes, maybe more. 1 or 2 K of memory is quite cheap on a modern machine, but as your database scales into the millions of words, you could have hundreds of thousands, even millions of occurrences of the most frequent words. At some point, you will certainly not want to have to load megabytes of data into memory at once for each key-word lookup. Indeed, in a search for "the cat", you'd prefer not to read every occurrence of "the" in the first place.
Since PhiloLogic currently doesn't support updating a live database, and all word occurrences are kept in sorted order, it's relatively easy for us to devise an on-disk, cache-friendly data structure that can meet our requirements. Let's divide up the word occurences into 2-kilobyte blocks, and keep track of the first position in each block. Then, we can rapidly skip hundreds of occurrences of a frequent word, like "the", when we know that the next occurence of "cat" isn't in the same document!
Of course, to perform this optimization, we would need to know the frequency of all terms in a query before we scan through them, so we'll have to add that information to the main hash table. Finally, we'd prefer not to pay the overhead of an additional disk seek for low-frequency words, so we'll need a flag in each key-word entry to signal whether we have:
1) a low frequency word, with all occurences stored inline
2) a high frequency word, stored in the block tree.
Just like the actual positional parameters, the frequencies and tree headers can also be compressed to an optimal size on a per-database level. In philologic, this is stored in databasedir/src/dbspecs.h, a c header file that is generated at the same time as the index, then compiled into a custom compression/decompression module for each loaded database, which the search engine can dynamically load and unload at run time.
In a later post, I'll provide some perl code to unpack the indices, and try to think about what a clean search API would look like.