ARTFL Project Research Blog

Richard Wednesday, May 19, 2010 1 comment

I've spent the last two weeks replacing PhiloLogic's index-construction routines, following my prior work on the query and database interfaces.

The legacy index-packing code dates back to sometime before PhiloLogic 2, and is spread over 3 executable programs linked together by a Makefile and some obscure binary state files.

Unfortunately, the 3 programs all link to different versions of the same compression library, so they couldn't simply be refactored and recompiled as a single unit.

Instead, I worked backwards from the decompression routines I wrote last month, to write a new index construction library from scratch.

Thus, I had the luxury of being able to define an abstract, high-level interface that meets my four major goals:

1)simple, efficient operation

2)flexible enough for various index formats

3)easy to bind to other languages.

4)fully compatible with 3-series PhiloLogic

The main loop is below. It's pretty clean. All the details are handled by a hit-buffer object named "hb" that does compression, memory management, and database interfacing.

while(1) {
 // as long as we read lines from standard input.
 if (fgets(line,511,stdin) == NULL) {
   hitbuffer_finish(hb);
   break;
 }
 // scan for hits in standard Philo3 format.
 state = sscanf(line,
           "%s %d %d %d %d %d %d %d %d %s\n",
           word, &hit[0],...);

 if (state == 10) {
   // if we read a valid hit
   if ((strcmp(word,hb->word))) {
     //if we have a new word...
     hitbuffer_finish(hb); // write out the current buffer.
     hitbuffer_init(hb, word); // and reinitialize
     uniq_words += 1LLU; //LLU for a 64-bit unsigned int.
   }
   hitbuffer_inc(hb, hit); //add the hit to whichever word you're on.
   totalhits += 1LLU;
 }
 else {
   fprintf(stderr, "Couldn't understand hit.\n");
 }
}

The code is publicly available on github, but I'm having some problems with their web interface. I'll post a link once it's sorted out.

Vector Processing for OHCO

Richard Thursday, May 06, 2010 Leave a Comment

I've posted an expanded version of my CI Days talk on Google docs. I'd recommend looking at the speaker notes (click "actions" on the bottom left) since I won't be narrating it in person.

The presentation is an attempt to describe, somewhat formally, how PhiloLogic is capable of performing as well as it does. This comes from spending three years learning how Leonid's search core works, and attempting to extend and elucidate whatever I can. It's also the intellectual framework that I'm using to plan new features, like search on line and meter position, metadata, joins, etc. Hopefully, I can get someone who's better at math than I am to help me tighten up the formalities.

Basically, I refer to the infamous OHCO thesis as a useful axiom for translating the features of a text into a set of numerical objects, and then compare the characteristics of this representation to XML or Relational approaches. I'd love to know how interesting/useful/comprehensible others find the presentation, or the concept. What needs more explanation? What gets tedious?

If you look at the speaker notes, you can see me derive a claim that PhiloLogic runs 866 times faster than a relational database for word search. Math is fun!

ARTFL Project Research Blog

A Unified Index Construction Library

Vector Processing for OHCO

Labels

Popular Posts

Blog Archive

Developed by ARTFL