ARTFL Project Research Blog

Richard Tuesday, April 13, 2010 Leave a Comment

In an earlier post, I mentioned that I'd try to to call the philologic C routines via ctypes, a Python Foreign Function Interface library. I did, and it worked awesomely well! Ctypes lets you call C functions from python without writing any glue at all in some cases, giving you access to high-performance C routines in a clean, modern programming language. We'd ultimately want a much more hand-crafted approach, but for prototyping interfaces, this is a very, very useful tool.

First, I had to compile the search engine as a shared library, rather than an executable:

gcc -dynamiclib -std=gnu99 search.o word.o retreive.o level.o gmap.o blockmap.o log.o out.o plugin/libindex.a db/db.o db/bitsvector.o db/unpack.o -lgdbm -o libphilo.dylib

All that refactoring certainly paid off. The search4 executable will now happily link against the shared library with no modification, and so can any other program that wants high-speed text object search:

#!/usr/bin/python

import sys,os
from ctypes import *

# First, we need to get the C standard library loaded in
# so that we can pass python's input on to the search engine.
stdlib=cdll.LoadLibrary("libc.dylib")
stdin = stdlib.fdopen(sys.stdin.fileno(),"r")
# Honestly, that's an architectural error.
# I'd prefer to pass in strings, not a file handle

# Now load in philologic from a shared library
libphilo = cdll.LoadLibrary("./libphilo.dylib")

# Give it a path to the database. The C routines parse the db definitions.
db = libphilo.init_dbh_folder("/var/lib/philologic/databases/mvotest5/")

# now initialize a new search object, with some reasonable defaults.
s = libphilo.new_search(db,"phrase",None,1,100000,0,None)

# Read words from standard input.
libphilo.process_input(s,stdin)

# Then dump the results to standard output.
libphilo.search_pass(s,0)
# Done.

That was pretty easy, right? Notice that there weren't any boilerplate classes. I could hold pointers to arbitrary data in regular variables, and pass them directly into the C subroutines as void pointers. Not safe, but very, very convenient.

Of course, this opens us up for quite a bit more work: the C library really needs a lot more ways to get data in and out than a pair of input/output file descriptors, I would say. In all likelihood, after some more experiments, we'll eventually settle on a set of standard interfaces, and generate lower-level bindings with SWIG, which would alow us to call philo natively from Perl or PHP or Ruby or Java or LISP or Lua or...anything, really.

Ctypes still has some advantages over automatically-generated wrappers, however. In particular, it lets you pass python functions back into C, allowing us to write search operators in python, rather than C--for example, a metadata join, or a custom optimizer for part-of-speech searching. Neat!

Unix Daemon Recipes

Richard Thursday, April 08, 2010 Leave a Comment

I was digging through some older UNIX folkways when I stumbled upon an answer to a long-standing PhiloLogic design question:

How do I create a long-running worker process that will neither:

1) terminate when it's parent terminates, such as a terminal session or a CGI script, or
2) create the dreaded "zombie" processes that clog process tables and eventually crash the system.

as it turns out, this is the same basic problem as any UNIX daemon program; this just happens to be one designed to, eventually, terminate. PhiloLogic needs processes of this nature at various places: most prominently, to allow the CGI interface to return preliminary results.

Currently, we use a lightweight Perl daemon process, called nserver.pl, to accept search requests from the CGI scripts, invoke the search engine, and then clean up the process after it terminates. Effective, but there's a simpler way, with a tricky UNIX idiom.

First, fork(). This allows you to return control to the terminal or CGI script. If you aren't going to exit immediately you should SIGCHLD as well, so that you don't get interrupted later.

Second, have the child process call setsid() to gain a new session, and thus detach from the parent. This prevents terminal hangups from killing the child process.

Third, call fork() again, then immediately exit the (original) child. The new "grandchild" process is now an "orphan", and detached from a terminal, so it will run to completion, and then be reaped by the system, so you can do whatever long-term analytics you like.

A command line example could go like this:


#!/usr/bin/perl
use POSIX qw(setsid);

my $word = $ARGV[0] or die "Usage:searchwork.pl word outfile\n";
my $outfile = $ARGV[1] or die "Usage:searchwork.pl word outfile\n";

print STDERR "starting worker process.\n";
&daemonize;

open(SEARCH, "| search4 --ascii --limit 1000000 /var/lib/philologic/somedb);

print SEARCH "$word\n";
close(SEARCH);

exit;

sub daemonize {
    open STDIN, '/dev/null'   or die "Can't read /dev/null: $!";
    open STDOUT, '>>/dev/null' or die "Can't write to /dev/null: $!";
    open STDERR, '>>/dev/null' or die "Can't write to /dev/null: $!";
    defined(my $childpid = fork)   or die "Can't fork: $!";
    if ($childpid) {
        print STDERR "[parent process exiting]\n";
        exit;
    }
    setsid                    or die "Can't start a new session: $!";
    print STDERR "Child detached from terminal\n";
    defined(my $grandchildpid = fork) or die "Can't fork: $!";
    if ($grandchildpid) {
        print STDERR "[child process exiting]\n";
        exit;
    }
    umask 0;
}

The benefit is that a similar &daemonize subroutine could entirely replace nserver, and thus vastly simplify the installation process. There's clearly a lot more that could be done with routing and control, of course, but this is an exciting proof of concept, particularly for UNIX geeks like myself.

ARTFL Project Research Blog

PhiloLogic proto-binding for Python

Unix Daemon Recipes

Labels

Popular Posts

Blog Archive

Developed by ARTFL