I've spent the last two weeks replacing PhiloLogic's index-construction routines, following my prior work on the query and database interfaces.
Read More
The legacy index-packing code dates back to sometime before PhiloLogic 2, and is spread over 3 executable programs linked together by a Makefile and some obscure binary state files.
Unfortunately, the 3 programs all link to different versions of the same compression library, so they couldn't simply be refactored and recompiled as a single unit.
Instead, I worked backwards from the decompression routines I wrote last month, to write a new index construction library from scratch.
Thus, I had the luxury of being able to define an abstract, high-level interface that meets my four major goals:
1)simple, efficient operation
2)flexible enough for various index formats
3)easy to bind to other languages.
4)fully compatible with 3-series PhiloLogic
The main loop is below. It's pretty clean. All the details are handled by a hit-buffer object named "hb" that does compression, memory management, and database interfacing.
while(1) {
// as long as we read lines from standard input.
if (fgets(line,511,stdin) == NULL) {
hitbuffer_finish(hb);
break;
}
// scan for hits in standard Philo3 format.
state = sscanf(line,
"%s %d %d %d %d %d %d %d %d %s\n",
word, &hit[0],...);
if (state == 10) {
// if we read a valid hit
if ((strcmp(word,hb->word))) {
//if we have a new word...
hitbuffer_finish(hb); // write out the current buffer.
hitbuffer_init(hb, word); // and reinitialize
uniq_words += 1LLU; //LLU for a 64-bit unsigned int.
}
hitbuffer_inc(hb, hit); //add the hit to whichever word you're on.
totalhits += 1LLU;
}
else {
fprintf(stderr, "Couldn't understand hit.\n");
}
}
The code is publicly available on github, but I'm having some problems with their web interface. I'll post a link once it's sorted out.