I've spent the last few weeks rehabbing PhiloLogic's low-level search engine, and I thought I'd write up the process a bit.
PhiloLogic is commonly known as being a rather large Perl/CGI project, but all of the actual database interactions are done by our custom search engine, which is in highly optimized C. The flow of control in a typical Philo install looks something like this:
--CGI script search3t accepts user requests, and parses them.
--CGI passes requests off to a long-running Perl daemon process, called nserver.
--nserver spawns a long-running worker process search3 to evaluate the request
--the worker process loads in a compiled decompression module, at runtime, specific to the database.
--search3t watches the results of the worker process
--when the worker is finished, or outputs more than 50 results, search3t passes them off to a report generator.
This architecture is extremely efficient, but as PhiloLogic has accrued features over the years it has started to grow less flexible, and parts of the code base have started to decay. The command line arguments to search3, in particular, are arcane and undocumented. A typical example:
I've long been advocating a much, much simpler interface to the search engine. The holy grail would be a single-point-of-entry that could be installed as a C library, and called from any scripting language with appropriate interfacing code. There are several obstacles, particularly with respect to caching and memory management, but the main one is organizational.
How do you take a 15-year-old C executable, in some state of disrepair, and reconfigure the "good parts" into a modern C library? Slowly and carefully. Modern debugging tools like Valgrind help, as does the collective C wisdom preserved by Google. A particular issue is imperative vs. object-oriented or functional style. Older C programs tend to use a few global variables to represent whatever global data structure they work upon--in effect, what modern OOP practices would call a "singleton" object, but in practice a real headache.
For example, PhiloLogic typically chooses to represent the database being searched as a global variable, often set in the OS's environment. But what if you want to search two databases at once? What if you don't have a UNIX system? An object-oriented representation of the large-scale constructs of a program allows the code to go above and beyond its original purpose.
Read More
PhiloLogic is commonly known as being a rather large Perl/CGI project, but all of the actual database interactions are done by our custom search engine, which is in highly optimized C. The flow of control in a typical Philo install looks something like this:
--CGI script search3t accepts user requests, and parses them.
--CGI passes requests off to a long-running Perl daemon process, called nserver.
--nserver spawns a long-running worker process search3 to evaluate the request
--the worker process loads in a compiled decompression module, at runtime, specific to the database.
--search3t watches the results of the worker process
--when the worker is finished, or outputs more than 50 results, search3t passes them off to a report generator.
This architecture is extremely efficient, but as PhiloLogic has accrued features over the years it has started to grow less flexible, and parts of the code base have started to decay. The command line arguments to search3, in particular, are arcane and undocumented. A typical example:
export SYSTEM_DIR=/path/to/db
export LD_LIBRARY_PATH=/path/to/db/specific/decompression/lib/
search3 -P:binary -E:L=1000000 -S:phrase -E:L=1000000 -C:1 /tmp/corpus.10866
search3 -P:binary -E:L=1000000 -S:phrase -E:L=1000000 -C:1 /tmp/corpus.10866
The internals are quite a bit scarier. Arguments are processed haphazardly in bizarre corners of the code, and many paths and filenames are hard-coded in. And terrifying unsafe type-casts abound. Casting a structure containing an array of ints into an array of ints? Oh my.
I've long been advocating a much, much simpler interface to the search engine. The holy grail would be a single-point-of-entry that could be installed as a C library, and called from any scripting language with appropriate interfacing code. There are several obstacles, particularly with respect to caching and memory management, but the main one is organizational.
How do you take a 15-year-old C executable, in some state of disrepair, and reconfigure the "good parts" into a modern C library? Slowly and carefully. Modern debugging tools like Valgrind help, as does the collective C wisdom preserved by Google. A particular issue is imperative vs. object-oriented or functional style. Older C programs tend to use a few global variables to represent whatever global data structure they work upon--in effect, what modern OOP practices would call a "singleton" object, but in practice a real headache.
For example, PhiloLogic typically chooses to represent the database being searched as a global variable, often set in the OS's environment. But what if you want to search two databases at once? What if you don't have a UNIX system? An object-oriented representation of the large-scale constructs of a program allows the code to go above and beyond its original purpose.
Or maybe I'm just a neat freak--regardless, the [simplified] top-level architecture of 'search3.999' {an asymptotic approach to an as-yet unannounced product} should show the point of it all:
{
static struct option long_options[] = {
{"ascii", no_argument, 0, 'a'},
{"corpussize", required_argument, 0, 'c'},
{"corpusfile", required_argument, 0, 'f'},
{"debug", required_argument, 0, 'd'},
{"limit", required_argument, 0, 'l'},
{0,0,0,0}
};
//...process options with GNU getopt_long...
db = init_dbh_folder(dbname);
if (!method_set) {
strncpy(method,"phrase",256);
}
s = new_search(db,
method,
ascii_set,
corpussize,
corpusfile,
debug,
limit);
status = process_input ( s, stdin );
//...print output...
//...free memory...
return 0
}
An equivalent command-line call would be:
search3999 --ascii --limit 1000000 --corpussize 1 --corpusfile /tmp/corpus.10866 dbname search_method
which is definitely an improvement. It can also print a help message.
Beyond organizational issues, I also ended up rewriting large portions of the decompression routines. The database can now fully configure itself at runtime, which adds about 4 ms to each request, but with the benefit that database builds no longer require compilation. TODO: The overhead can be eliminated if we store that database parameters as integers, rather than as formatted text files.
I think at this point the codebase is clean enough to try hooking up to python, via ctypes, and then experiment with other scripting language bindings. Once I clean up the makefiles I'll put it up on our repository.