A Unified Index Construction Library

Richard Wednesday, May 19, 2010 1 comment

I've spent the last two weeks replacing PhiloLogic's index-construction routines, following my prior work on the query and database interfaces.

The legacy index-packing code dates back to sometime before PhiloLogic 2, and is spread over 3 executable programs linked together by a Makefile and some obscure binary state files.

Unfortunately, the 3 programs all link to different versions of the same compression library, so they couldn't simply be refactored and recompiled as a single unit.

Instead, I worked backwards from the decompression routines I wrote last month, to write a new index construction library from scratch.

Thus, I had the luxury of being able to define an abstract, high-level interface that meets my four major goals:

1)simple, efficient operation

2)flexible enough for various index formats

3)easy to bind to other languages.

4)fully compatible with 3-series PhiloLogic

The main loop is below. It's pretty clean. All the details are handled by a hit-buffer object named "hb" that does compression, memory management, and database interfacing.

while(1) {
 // as long as we read lines from standard input.
 if (fgets(line,511,stdin) == NULL) {
   hitbuffer_finish(hb);
   break;
 }
 // scan for hits in standard Philo3 format.
 state = sscanf(line,
           "%s %d %d %d %d %d %d %d %d %s\n",
           word, &hit[0],...);

 if (state == 10) {
   // if we read a valid hit
   if ((strcmp(word,hb->word))) {
     //if we have a new word...
     hitbuffer_finish(hb); // write out the current buffer.
     hitbuffer_init(hb, word); // and reinitialize
     uniq_words += 1LLU; //LLU for a 64-bit unsigned int.
   }
   hitbuffer_inc(hb, hit); //add the hit to whichever word you're on.
   totalhits += 1LLU;
 }
 else {
   fprintf(stderr, "Couldn't understand hit.\n");
 }
}

The code is publicly available on github, but I'm having some problems with their web interface. I'll post a link once it's sorted out.

ARTFL Project Research Blog

A Unified Index Construction Library

1 comment:

Labels

Popular Posts

Blog Archive

Developed by ARTFL