The Joy of Refactoring Legacy Code

Leave a Comment
I've spent the last few weeks rehabbing PhiloLogic's low-level search engine, and I thought I'd write up the process a bit.

PhiloLogic is commonly known as being a rather large Perl/CGI project, but all of the actual database interactions are done by our custom search engine, which is in highly optimized C. The flow of control in a typical Philo install looks something like this:

--CGI script search3t accepts user requests, and parses them.
--CGI passes requests off to a long-running Perl daemon process, called nserver.
--nserver spawns a long-running worker process search3 to evaluate the request
--the worker process loads in a compiled decompression module, at runtime, specific to the database.
--search3t watches the results of the worker process
--when the worker is finished, or outputs more than 50 results, search3t passes them off to a report generator.

This architecture is extremely efficient, but as PhiloLogic has accrued features over the years it has started to grow less flexible, and parts of the code base have started to decay. The command line arguments to search3, in particular, are arcane and undocumented. A typical example:

export SYSTEM_DIR=/path/to/db
export LD_LIBRARY_PATH=/path/to/db/specific/decompression/lib/
search3 -P:binary -E:L=1000000 -S:phrase -E:L=1000000 -C:1 /tmp/corpus.10866 

The internals are quite a bit scarier.  Arguments are processed haphazardly in bizarre corners of the code, and many paths and filenames are hard-coded in.  And terrifying unsafe type-casts abound.  Casting a structure containing an array of ints into an array of ints?  Oh my.

I've long been advocating a much, much simpler interface to the search engine. The holy grail would be a single-point-of-entry that could be installed as a C library, and called from any scripting language with appropriate interfacing code. There are several obstacles, particularly with respect to caching and memory management, but the main one is organizational.

How do you take a 15-year-old C executable, in some state of disrepair, and reconfigure the "good parts" into a modern C library? Slowly and carefully. Modern debugging tools like Valgrind help, as does the collective C wisdom preserved by Google. A particular issue is imperative vs. object-oriented or functional style. Older C programs tend to use a few global variables to represent whatever global data structure they work upon--in effect, what modern OOP practices would call a "singleton" object, but in practice a real headache.

For example, PhiloLogic typically chooses to represent the database being searched as a global variable, often set in the OS's environment. But what if you want to search two databases at once? What if you don't have a UNIX system? An object-oriented representation of the large-scale constructs of a program allows the code to go above and beyond its original purpose.

Or maybe I'm just a neat freak--regardless, the [simplified] top-level architecture of 'search3.999' {an asymptotic approach to an as-yet unannounced product} should show the point of it all:

{
    static struct option long_options[] = {
  {"ascii", no_argument, 0, 'a'},
{"corpussize", required_argument, 0, 'c'},
{"corpusfile", required_argument, 0, 'f'},
{"debug", required_argument, 0, 'd'},
{"limit", required_argument, 0, 'l'},
{0,0,0,0}
  };

//...process options with GNU getopt_long...

  db = init_dbh_folder(dbname);
  if (!method_set) {
  strncpy(method,"phrase",256);
    }
  s = new_search(db,
                   method, 
                   ascii_set,
                   corpussize,
                   corpusfile,
                   debug,
                   limit);
  status = process_input ( s, stdin );

//...print output...
//...free memory...
  return 0
}

An equivalent command-line call would be:
search3999 --ascii --limit 1000000 --corpussize 1 --corpusfile /tmp/corpus.10866 dbname search_method
which is definitely an improvement.  It can also print a help message.

Beyond organizational issues, I also ended up rewriting large portions of the decompression routines.  The database can now fully configure itself at runtime, which adds about 4 ms to each request, but with the benefit that database builds no longer require compilation.  TODO: The overhead can be eliminated if we store that database parameters as integers, rather than as formatted text files.

I think at this point the codebase is clean enough to try hooking up to python, via ctypes, and then experiment with other scripting language bindings.  Once I clean up the makefiles I'll put it up on our repository.

Read More

Reclassifying the Encyclopédie

Leave a Comment
Diderot and D'Alembert's Encyclopédie might almost have been designed as a document classification exercise. For starters, it comes complete with a branching, hierarchical ontology of classes of knowledge. Out of the 77,000+ articles contained therein, 60,000 are classified according to this system, while 17,000 were left unclassified, providing a ready-made training set and evaluation set, respectively. To make it challenging, within the classified articles, the editors have chosen to apply some classifications with obfuscatory intent or, at least, result, rendering topic boundaries somewhat fuzzy. The categories span the entire breadth of human knowledge, and the articles range from brief renvois ("see XXX") to protracted philosophical treatises. In short, it has everything to make a machine learner happy, and miserable, in one package.


At ARTFL we've been mining this rich vein for some time now. We presented Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie at Digital Humanities 2007, detailing our initial attempts at classification and the critical interpretation of machine learning results. We followed up at DH 2008 with Twisted Roads and Hidden Paths, in which we expanded our toolkit to include k-nearest-neighbor vector space classifications, and a meta-classifying decision tree. Where we had previously achieved around 72% accuracy in categorizing articles medium-length and long articles using Naive Bayes alone, using multiple classifiers combined in this way we were able to get similar rates of accuracy over the entire encyclopedia, including the very short articles, which are quite difficult to classify due to their dearth of distinctive content. This post describes an effort to productionize the results of that latter paper, in order to insert our new, machine-generated classifications into our public edition of the Encyclopédie.

For the impatient, jump ahead to the ARTFL Encyclopédie search form and start digging. The new machine generated classifications can be searched just as any other item of Philologic metadata, allowing very sophisticated queries to be constructed.

For instance, we can ask questions like "Are there any articles originally classified under Géographie that are reclassified as Philsophie?" In fact there are several, and it's interesting to peruse them and deduce why their original classifications and generated classifications fall as they do. The editors followed a policy of not including biographies in the Encyclopédie, but evidently could not restrain themselves in many cases. Instead of creating a biography class, however, they categorized such entries under the headword corresponding to the region of the notable person's birth, and assigned it the class Géographie. Thus the article JOPOLI contains a discussion of the philosopher Augustin Nyphus, born there in 1472, and hence is classified by our machine learner under Philosophie.

Our goals in re-classifying the Encyclopédie are several: to provide better access for our users by adding class descriptions to previously unclassified articles; to identify articles that are re-classified differently from their original classes, allowing users to find them by their generated classes which are often more indicative of the overall content of an article; and to identify interesting patterns in the authors' uses of their classification system, again primarily by seeing what classes tend to be re-classified differently.


We initially undertook to examine a wide range of classifiers including Naive Bayesian, SVM and KNN vector space, with a range of parameters for word count normalization and other settings. After examining hundreds of such runs, we found two that, combined, provided the greatest accuracy in correctly re-classifying articles to their previous classifications: Naive Bayes, using simple word counts, and KNN, using 50 neighbors and tf-idf values for the feature vectors.


Each classifier alone was right about 64% of the time -- but together, at least one of them was right 77% of the time. If we could only decide which one to trust when they differed on a given classification decision, we could reap a substantial gain in accuracy on the previously classified articles, and presumably get more useful classifications of the unclassified articles. We must note that the class labels for each article, which appear at the beginning of the text, we retained for these runs, giving our classifiers an unfair advantage in re-classifying the articles that had such labels present. The class labels get no more weight, however, than any other words in the article. We retained them because our primary objective is to accurately classify the unclassified articles, which do not contain these labels, but may well contain words from these labels in other contexts.


It turned out that KNN was most accurate on smaller articles and smaller classes, whereas Naive Bayes worked best on longer articles that belonged to bigger classes, which gave us something to go on when deciding which classifier got to make the call when they were at odds with each other. By feeding the article and class meta-data into a simple decision tree classifier, along with the results of each classifier, we were able to learn some rules for deciding which classifier to prefer for a given decision where they disagreed on the class assignment. See the decision tree in the DH 2008 with DH 2008 paper for the details.


Of course, we couldn't make the perfect decision every time, but we were close enough to increase our accuracy on previously classified articles to 73%, 9% higher than the average of the individual classifiers. By using a meta-classifier to learn the relative strengths and weaknesses of the sub-classifiers, we were able to better exploit them to get more interesting data for our users, and peel back another layer of the great Encyclopédie. Additionally, we learned characteristics of the classifiers themselves that will enable us to target their applications more precisely in the future.




P.S.: For all you Diderot-philes, here are the stats on the original and machine-learned classes of the articles he authored:


If any of this piques your interest, please do get in touch with us (artfl dawt project at gee-male dawt com should work). We'd love to discuss our work and possible future directions. Or come check us out in Virginia next week!
Read More

using the JSON perl mod

2 comments
I just thought I'd make a quick blog post on how to use the JSON perl mod. Why use JSON when we have XML, I'll leave that to Russ or Richard, but to make a long story short, easier object handling for the projected javascript driven DVLF.
So, the perl JSON module is actually very easy and nice to use, it will convert your perl data structure into JSON without a sweat.
Here's a quick example which I hope will be useful :

#!/usr/bin/perl
use strict;
use warnings;
use JSON;

my %hash;
foreach my $file (@list_of_files)  {
        open(FILE,"$file");
        my @list;
        while (<FILE>) {
                push(@array,$_);
        }
        %hash{$file} = [@array];    # store array reference in hash
}

my $obj = \%results; 
my $json = new JSON;
my $js = $json-&gt;encode($obj, {pretty =&gt; 1, indent =&gt; 2}); # convert Perl data structure to JSON representation
$output .= "$js\n\n";
print $output;

And done!
Read More
Next PostNewer Posts Previous PostOlder Posts Home