shlax and ElementTree

Leave a Comment
I've just pushed a few commits to the central philo4 repository;
mostly small bugfixes to the makefile and the parser, but I added a convenience method to the shlax XML parser.

As you may know, Python has a really nice XML library called ElementTree, but it has a few quirks:
1) it uses standard, "fussy" XML parsers that choke on the slightest flaw, and
2) it has a formally correct but incomprehensible approach to namespaces that is exceedingly impractical for day-to-day TEI hacking.

In this update, I've added a shlaxtree module to the philo4 distribution that hooks our fault-tolerant, namespace-agnostic XML parser up to ElementTree's XPath evaluator and serialization facilities. It generally prefers the 1.3 version of ElementTree, which is standard in python 2.7, but a simple install in 2.6 and 2.5.

Basically, the method philologic.shlaxtree.parse() will take in a file object, and return the root node of the xml document in the file, assuming it found one. You can use this to make a simple bibliographic extractor like so:


#!/usr/bin/env python
import philologic.shlaxtree as st
import sys
import codecs

for filename in sys.argv[1:]:
file = codecs.open(filename,"r","utf-8")
root = st.parse(file)
header = root.find("teiHeader")
print st.et.tostring(header)
print header.findtext(".//titleStmt/title")
print header.findtext(".//titleStmt/author")



Not bad for 10 lines, no? What's really cool is that you can modify trees, nodes, and fragments before writing them out, with neat recursive functions and what not. I've been using it for converting old SGML dictionaries to TEI--once you get the hang of it, it's much easier than regular expressions, and much easier to maintain and modify as well.
Read More

shlax: a shallow, lazy XML parser in python

Leave a Comment
Recently, I stumbled upon a paper from the dawn age of XML:

"REX: XML Shallow Parsing with Regular Expressions", Robert D. Cameron

It describes how to do something I'd never seen done before: parse the entirety of standard XML syntax in a single regular expression.

We've all written short regexes to find some particular feature in an XML document, but we've also all seen those fail because of oddities of whitespace, quoting, linebreaks, etc., that are perfectly legal, but hard to account for in a short, line-by-line regular expression.

Standard XML parsers, like expat, are fabulous, well maintained, and efficient. However, they have a common achilles heel: the XML standard's insistence that XML processors "MUST" report a fatal error if a document contains unbalanced tags. For working with HTML or SGML based documents, this is disastrous!

In contrast, Cameron's regex-based parser is extremely fault-tolerant--it extracts as much structure from the document as possible, and reports the rest as plain text. Further, it supports "round-tripping": the ability to exactly re-generate a document from parser output, which standard parser typically lack. As a corollary of this property, it becomes possible to report absolute byte offsets, which is a "killer feature" for the purposes of indexing.

Because of all these benefits, I've opted to translate his source code from javascript to python. I call my modified implementation "shlax" [pronounced like "shellacs", sort of], a shallow, lazy XML parser. "Shallow" meaning that it doesn't check for well-formedness, and simply reports tokens, offsets, and attributes as best it can. "Lazy" meaning that it iterates over the input, and yields one object at a time--so you don't have to write 8 asynchronous event handlers to use it, as in a typical SAX-style parser. This is often called a "pull" parser, but "shpux" doesn't sound as good, does it?

If you're interested, you can look at the source at the libphilo github repo. The regular expression itself is built up over the course of about 30 expressions, to allow for maintainability and readability. I've made some further modifications to Cameron's code to fit our typical workflow. I've buffered the text input, which allows us to iterate over a file-handle, rather than a string--this saves vast amounts of memory for processing large XML files, in particular. And I return "node" objects, rather than strings, that contain several useful items of information:
  1. the original text content of the node
  2. the "type" of the node: text, StartTag,EndTag, or Markup[for DTD's, comments, etc.]
  3. any attributes the node has
  4. the absolute byte offset in the string or file
You don't need anything more than that to power PhiloLogic. If you'd like to see an example of how to use it, take a look at my DirtyParser class, which takes as input a set of xpaths to recognize for containers and metadata, and outputs a set of objects suitable for the index builder I wrote about last time.

Oh, and about performance: shlax is noticeably slower than Mark's perl loader. I've tried to mitigate for that in a variety of ways, but in general, python's regex engine is not as fast as perl's. On the other hand, I've recently had a lot of success with running a load in parallel on an 8-core machine, which I'll write about when the code settles. That said, if efficiency is a concern, our best option would be to use well-formed XML with a standard parser.

So, my major development push now is to refactor the loader into a framework that can handle multiple parser backends, flexible metadata recognizers, and multiple simultaneous parser processes. I'll be posting about that as soon as it's ready.
Read More

A Unified Index Construction Library

Leave a Comment
I've spent the last two weeks replacing PhiloLogic's index-construction routines, following my prior work on the query and database interfaces.

The legacy index-packing code dates back to sometime before PhiloLogic 2, and is spread over 3 executable programs linked together by a Makefile and some obscure binary state files.

Unfortunately, the 3 programs all link to different versions of the same compression library, so they couldn't simply be refactored and recompiled as a single unit.

Instead, I worked backwards from the decompression routines I wrote last month, to write a new index construction library from scratch.

Thus, I had the luxury of being able to define an abstract, high-level interface that meets my four major goals:

1)simple, efficient operation
2)flexible enough for various index formats
3)easy to bind to other languages.
4)fully compatible with 3-series PhiloLogic

The main loop is below. It's pretty clean. All the details are handled by a hit-buffer object named "hb" that does compression, memory management, and database interfacing.
while(1) {
// as long as we read lines from standard input.
if (fgets(line,511,stdin) == NULL) {
hitbuffer_finish(hb);
break;
}
// scan for hits in standard Philo3 format.
state = sscanf(line,
"%s %d %d %d %d %d %d %d %d %s\n",
word, &hit[0],...);

if (state == 10) {
// if we read a valid hit
if ((strcmp(word,hb->word))) {
//if we have a new word...
hitbuffer_finish(hb); // write out the current buffer.
hitbuffer_init(hb, word); // and reinitialize
uniq_words += 1LLU; //LLU for a 64-bit unsigned int.
}
hitbuffer_inc(hb, hit); //add the hit to whichever word you're on.
totalhits += 1LLU;
}
else {
fprintf(stderr, "Couldn't understand hit.\n");
}
}


The code is publicly available on github, but I'm having some problems with their web interface. I'll post a link once it's sorted out.
Read More

Vector Processing for OHCO

Leave a Comment
I've posted an expanded version of my CI Days talk on Google docs. I'd recommend looking at the speaker notes (click "actions" on the bottom left) since I won't be narrating it in person.

The presentation is an attempt to describe, somewhat formally, how PhiloLogic is capable of performing as well as it does. This comes from spending three years learning how Leonid's search core works, and attempting to extend and elucidate whatever I can. It's also the intellectual framework that I'm using to plan new features, like search on line and meter position, metadata, joins, etc. Hopefully, I can get someone who's better at math than I am to help me tighten up the formalities.

Basically, I refer to the infamous OHCO thesis as a useful axiom for translating the features of a text into a set of numerical objects, and then compare the characteristics of this representation to XML or Relational approaches. I'd love to know how interesting/useful/comprehensible others find the presentation, or the concept. What needs more explanation? What gets tedious?

If you look at the speaker notes, you can see me derive a claim that PhiloLogic runs 866 times faster than a relational database for word search. Math is fun!
Read More

PhiloLogic proto-binding for Python

Leave a Comment

In an earlier post, I mentioned that I'd try to to call the philologic C routines via ctypes, a Python Foreign Function Interface library. I did, and it worked awesomely well! Ctypes lets you call C functions from python without writing any glue at all in some cases, giving you access to high-performance C routines in a clean, modern programming language. We'd ultimately want a much more hand-crafted approach, but for prototyping interfaces, this is a very, very useful tool.

First, I had to compile the search engine as a shared library, rather than an executable:

gcc -dynamiclib -std=gnu99 search.o word.o retreive.o level.o gmap.o blockmap.o log.o out.o plugin/libindex.a db/db.o db/bitsvector.o db/unpack.o -lgdbm -o libphilo.dylib

All that refactoring certainly paid off. The search4 executable will now happily link against the shared library with no modification, and so can any other program that wants high-speed text object search:

#!/usr/bin/python

import sys,os
from ctypes import *

# First, we need to get the C standard library loaded in
# so that we can pass python's input on to the search engine.
stdlib=cdll.LoadLibrary("libc.dylib")
stdin = stdlib.fdopen(sys.stdin.fileno(),"r")
# Honestly, that's an architectural error.
# I'd prefer to pass in strings, not a file handle

# Now load in philologic from a shared library
libphilo = cdll.LoadLibrary("./libphilo.dylib")

# Give it a path to the database. The C routines parse the db definitions.
db = libphilo.init_dbh_folder("/var/lib/philologic/databases/mvotest5/")

# now initialize a new search object, with some reasonable defaults.
s = libphilo.new_search(db,"phrase",None,1,100000,0,None)

# Read words from standard input.
libphilo.process_input(s,stdin)

# Then dump the results to standard output.
libphilo.search_pass(s,0)
# Done.


That was pretty easy, right? Notice that there weren't any boilerplate classes. I could hold pointers to arbitrary data in regular variables, and pass them directly into the C subroutines as void pointers. Not safe, but very, very convenient.

Of course, this opens us up for quite a bit more work: the C library really needs a lot more ways to get data in and out than a pair of input/output file descriptors, I would say. In all likelihood, after some more experiments, we'll eventually settle on a set of standard interfaces, and generate lower-level bindings with SWIG, which would alow us to call philo natively from Perl or PHP or Ruby or Java or LISP or Lua or...anything, really.

Ctypes still has some advantages over automatically-generated wrappers, however. In particular, it lets you pass python functions back into C, allowing us to write search operators in python, rather than C--for example, a metadata join, or a custom optimizer for part-of-speech searching. Neat!

Read More

Unix Daemon Recipes

Leave a Comment
I was digging through some older UNIX folkways when I stumbled upon an answer to a long-standing PhiloLogic design question:

How do I create a long-running worker process that will neither:

1) terminate when it's parent terminates, such as a terminal session or a CGI script, or
2) create the dreaded "zombie" processes that clog process tables and eventually crash the system.

as it turns out, this is the same basic problem as any UNIX daemon program; this just happens to be one designed to, eventually, terminate. PhiloLogic needs processes of this nature at various places: most prominently, to allow the CGI interface to return preliminary results.

Currently, we use a lightweight Perl daemon process, called nserver.pl, to accept search requests from the CGI scripts, invoke the search engine, and then clean up the process after it terminates. Effective, but there's a simpler way, with a tricky UNIX idiom.

First, fork(). This allows you to return control to the terminal or CGI script. If you aren't going to exit immediately you should SIGCHLD as well, so that you don't get interrupted later.

Second, have the child process call setsid() to gain a new session, and thus detach from the parent. This prevents terminal hangups from killing the child process.

Third, call fork() again, then immediately exit the (original) child. The new "grandchild" process is now an "orphan", and detached from a terminal, so it will run to completion, and then be reaped by the system, so you can do whatever long-term analytics you like.

A command line example could go like this:


#!/usr/bin/perl
use POSIX qw(setsid);

my $word = $ARGV[0] or die "Usage:searchwork.pl word outfile\n";
my $outfile = $ARGV[1] or die "Usage:searchwork.pl word outfile\n";

print STDERR "starting worker process.\n";
&daemonize;

open(SEARCH, "| search4 --ascii --limit 1000000 /var/lib/philologic/somedb);

print SEARCH "$word\n";
close(SEARCH);

exit;

sub daemonize {
open STDIN, '/dev/null' or die "Can't read /dev/null: $!";
open STDOUT, '>>/dev/null' or die "Can't write to /dev/null: $!";
open STDERR, '>>/dev/null' or die "Can't write to /dev/null: $!";
defined(my $childpid = fork) or die "Can't fork: $!";
if ($childpid) {
print STDERR "[parent process exiting]\n";
exit;
}
setsid or die "Can't start a new session: $!";
print STDERR "Child detached from terminal\n";
defined(my $grandchildpid = fork) or die "Can't fork: $!";
if ($grandchildpid) {
print STDERR "[child process exiting]\n";
exit;
}
umask 0;
}


The benefit is that a similar &daemonize subroutine could entirely replace nserver, and thus vastly simplify the installation process. There's clearly a lot more that could be done with routing and control, of course, but this is an exciting proof of concept, particularly for UNIX geeks like myself.
Read More

The Joy of Refactoring Legacy Code

Leave a Comment
I've spent the last few weeks rehabbing PhiloLogic's low-level search engine, and I thought I'd write up the process a bit.

PhiloLogic is commonly known as being a rather large Perl/CGI project, but all of the actual database interactions are done by our custom search engine, which is in highly optimized C. The flow of control in a typical Philo install looks something like this:

--CGI script search3t accepts user requests, and parses them.
--CGI passes requests off to a long-running Perl daemon process, called nserver.
--nserver spawns a long-running worker process search3 to evaluate the request
--the worker process loads in a compiled decompression module, at runtime, specific to the database.
--search3t watches the results of the worker process
--when the worker is finished, or outputs more than 50 results, search3t passes them off to a report generator.

This architecture is extremely efficient, but as PhiloLogic has accrued features over the years it has started to grow less flexible, and parts of the code base have started to decay. The command line arguments to search3, in particular, are arcane and undocumented. A typical example:

export SYSTEM_DIR=/path/to/db
export LD_LIBRARY_PATH=/path/to/db/specific/decompression/lib/
search3 -P:binary -E:L=1000000 -S:phrase -E:L=1000000 -C:1 /tmp/corpus.10866 

The internals are quite a bit scarier.  Arguments are processed haphazardly in bizarre corners of the code, and many paths and filenames are hard-coded in.  And terrifying unsafe type-casts abound.  Casting a structure containing an array of ints into an array of ints?  Oh my.

I've long been advocating a much, much simpler interface to the search engine. The holy grail would be a single-point-of-entry that could be installed as a C library, and called from any scripting language with appropriate interfacing code. There are several obstacles, particularly with respect to caching and memory management, but the main one is organizational.

How do you take a 15-year-old C executable, in some state of disrepair, and reconfigure the "good parts" into a modern C library? Slowly and carefully. Modern debugging tools like Valgrind help, as does the collective C wisdom preserved by Google. A particular issue is imperative vs. object-oriented or functional style. Older C programs tend to use a few global variables to represent whatever global data structure they work upon--in effect, what modern OOP practices would call a "singleton" object, but in practice a real headache.

For example, PhiloLogic typically chooses to represent the database being searched as a global variable, often set in the OS's environment. But what if you want to search two databases at once? What if you don't have a UNIX system? An object-oriented representation of the large-scale constructs of a program allows the code to go above and beyond its original purpose.

Or maybe I'm just a neat freak--regardless, the [simplified] top-level architecture of 'search3.999' {an asymptotic approach to an as-yet unannounced product} should show the point of it all:

{
    static struct option long_options[] = {
  {"ascii", no_argument, 0, 'a'},
{"corpussize", required_argument, 0, 'c'},
{"corpusfile", required_argument, 0, 'f'},
{"debug", required_argument, 0, 'd'},
{"limit", required_argument, 0, 'l'},
{0,0,0,0}
  };

//...process options with GNU getopt_long...

  db = init_dbh_folder(dbname);
  if (!method_set) {
  strncpy(method,"phrase",256);
    }
  s = new_search(db,
                   method, 
                   ascii_set,
                   corpussize,
                   corpusfile,
                   debug,
                   limit);
  status = process_input ( s, stdin );

//...print output...
//...free memory...
  return 0
}

An equivalent command-line call would be:
search3999 --ascii --limit 1000000 --corpussize 1 --corpusfile /tmp/corpus.10866 dbname search_method
which is definitely an improvement.  It can also print a help message.

Beyond organizational issues, I also ended up rewriting large portions of the decompression routines.  The database can now fully configure itself at runtime, which adds about 4 ms to each request, but with the benefit that database builds no longer require compilation.  TODO: The overhead can be eliminated if we store that database parameters as integers, rather than as formatted text files.

I think at this point the codebase is clean enough to try hooking up to python, via ctypes, and then experiment with other scripting language bindings.  Once I clean up the makefiles I'll put it up on our repository.

Read More

Reclassifying the Encyclopédie

Leave a Comment
Diderot and D'Alembert's Encyclopédie might almost have been designed as a document classification exercise. For starters, it comes complete with a branching, hierarchical ontology of classes of knowledge. Out of the 77,000+ articles contained therein, 60,000 are classified according to this system, while 17,000 were left unclassified, providing a ready-made training set and evaluation set, respectively. To make it challenging, within the classified articles, the editors have chosen to apply some classifications with obfuscatory intent or, at least, result, rendering topic boundaries somewhat fuzzy. The categories span the entire breadth of human knowledge, and the articles range from brief renvois ("see XXX") to protracted philosophical treatises. In short, it has everything to make a machine learner happy, and miserable, in one package.


At ARTFL we've been mining this rich vein for some time now. We presented Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie at Digital Humanities 2007, detailing our initial attempts at classification and the critical interpretation of machine learning results. We followed up at DH 2008 with Twisted Roads and Hidden Paths, in which we expanded our toolkit to include k-nearest-neighbor vector space classifications, and a meta-classifying decision tree. Where we had previously achieved around 72% accuracy in categorizing articles medium-length and long articles using Naive Bayes alone, using multiple classifiers combined in this way we were able to get similar rates of accuracy over the entire encyclopedia, including the very short articles, which are quite difficult to classify due to their dearth of distinctive content. This post describes an effort to productionize the results of that latter paper, in order to insert our new, machine-generated classifications into our public edition of the Encyclopédie.

For the impatient, jump ahead to the ARTFL Encyclopédie search form and start digging. The new machine generated classifications can be searched just as any other item of Philologic metadata, allowing very sophisticated queries to be constructed.

For instance, we can ask questions like "Are there any articles originally classified under Géographie that are reclassified as Philsophie?" In fact there are several, and it's interesting to peruse them and deduce why their original classifications and generated classifications fall as they do. The editors followed a policy of not including biographies in the Encyclopédie, but evidently could not restrain themselves in many cases. Instead of creating a biography class, however, they categorized such entries under the headword corresponding to the region of the notable person's birth, and assigned it the class Géographie. Thus the article JOPOLI contains a discussion of the philosopher Augustin Nyphus, born there in 1472, and hence is classified by our machine learner under Philosophie.

Our goals in re-classifying the Encyclopédie are several: to provide better access for our users by adding class descriptions to previously unclassified articles; to identify articles that are re-classified differently from their original classes, allowing users to find them by their generated classes which are often more indicative of the overall content of an article; and to identify interesting patterns in the authors' uses of their classification system, again primarily by seeing what classes tend to be re-classified differently.


We initially undertook to examine a wide range of classifiers including Naive Bayesian, SVM and KNN vector space, with a range of parameters for word count normalization and other settings. After examining hundreds of such runs, we found two that, combined, provided the greatest accuracy in correctly re-classifying articles to their previous classifications: Naive Bayes, using simple word counts, and KNN, using 50 neighbors and tf-idf values for the feature vectors.


Each classifier alone was right about 64% of the time -- but together, at least one of them was right 77% of the time. If we could only decide which one to trust when they differed on a given classification decision, we could reap a substantial gain in accuracy on the previously classified articles, and presumably get more useful classifications of the unclassified articles. We must note that the class labels for each article, which appear at the beginning of the text, we retained for these runs, giving our classifiers an unfair advantage in re-classifying the articles that had such labels present. The class labels get no more weight, however, than any other words in the article. We retained them because our primary objective is to accurately classify the unclassified articles, which do not contain these labels, but may well contain words from these labels in other contexts.


It turned out that KNN was most accurate on smaller articles and smaller classes, whereas Naive Bayes worked best on longer articles that belonged to bigger classes, which gave us something to go on when deciding which classifier got to make the call when they were at odds with each other. By feeding the article and class meta-data into a simple decision tree classifier, along with the results of each classifier, we were able to learn some rules for deciding which classifier to prefer for a given decision where they disagreed on the class assignment. See the decision tree in the DH 2008 with DH 2008 paper for the details.


Of course, we couldn't make the perfect decision every time, but we were close enough to increase our accuracy on previously classified articles to 73%, 9% higher than the average of the individual classifiers. By using a meta-classifier to learn the relative strengths and weaknesses of the sub-classifiers, we were able to better exploit them to get more interesting data for our users, and peel back another layer of the great Encyclopédie. Additionally, we learned characteristics of the classifiers themselves that will enable us to target their applications more precisely in the future.




P.S.: For all you Diderot-philes, here are the stats on the original and machine-learned classes of the articles he authored:


If any of this piques your interest, please do get in touch with us (artfl dawt project at gee-male dawt com should work). We'd love to discuss our work and possible future directions. Or come check us out in Virginia next week!
Read More

using the JSON perl mod

2 comments
I just thought I'd make a quick blog post on how to use the JSON perl mod. Why use JSON when we have XML, I'll leave that to Russ or Richard, but to make a long story short, easier object handling for the projected javascript driven DVLF.
So, the perl JSON module is actually very easy and nice to use, it will convert your perl data structure into JSON without a sweat.
Here's a quick example which I hope will be useful :

#!/usr/bin/perl
use strict;
use warnings;
use JSON;

my %hash;
foreach my $file (@list_of_files)  {
        open(FILE,"$file");
        my @list;
        while (<FILE>) {
                push(@array,$_);
        }
        %hash{$file} = [@array];    # store array reference in hash
}

my $obj = \%results; 
my $json = new JSON;
my $js = $json-&gt;encode($obj, {pretty =&gt; 1, indent =&gt; 2}); # convert Perl data structure to JSON representation
$output .= "$js\n\n";
print $output;

And done!
Read More

Lemma Collocations on Perseus

1 comment
See UPDATE at the end.

This post actually relates nicely to Mark's recent post. I have recently been working on lemmatized collocation tables for the Greek texts on Perseus. If you just want to see the results, skim to the end, the rest describes how I got there.

It is not so difficult to look up the lemma for each word surrounding a certain search hit, as for these texts the structure to do so is already in place and the information is stored in SQL tables. Efficiency in gathering this information is the main difficulty. Looking up the lemma in the tables now in place can take a couple different SQL queries, which each take up a small chunk of time. For a few lookups, or even a few hundred, this is not too big of a problem. However, for a collocation table spanning five words on either side, we need at least 10 lookups in the databases per hit. The time it takes to do that adds up quite quickly.

So, following a suggestion from Mark, I wrote a script and generated a file with a line for every word in the Perseus database. Basically, it takes each word id starting with 1 on up to 5.5 million something and looks up its lemma. This generated a 5.5 million line file with lines likes this:

2334550 δέ
2334551 ὅς
2334552 ἕκαστος
2334553 ἵππος
2334554 nolemma
2334555 ὅς
2334556 δέ
2334557 πέτομαι
2334558 κονίω
2334559 πεδίον

Now, looking up the words on either side of a hit is much simpler - all you need to know is the "word id" of the hit and you can look at those around it. The "nolemma" entries are primarily punctuation and such which were given word tags at some point.

The size of this massive file however was another hindrance to efficient generation of collocation tables. Although we now know exactly what line we need to look at for the information we need, getting that line is still costly as it could require reading in a couple million lines of a file to get the one we need. After playing around with it a bit, command line grep searching seemed to be the fastest way to go, but even an grep search 10 times per hit adds up fast. So, I tried combining the searches into one massive egrep command to be read into a perl array. My searches looked something like:

egrep "23345[456][0-9]" lemmafile

Giving a window of 30 lines in the file starting at 2334540 and ending at 2334569. This limited the searches to one per hit instead of 10, but it still wasn't fast enough. So, I combine all of the searches like so:

egrep "(2342[234]|33329[678]|...|829[567])[0-9]"

(A bit of accounting for numbers ending in 0 is needed so that a window around 400 doesn't include things in the 490's, but this is not too difficult.)

This looked nice and seemed to work until I tried running it on more hits. It was then coming up with such massive regular expressions to grep for that grep was complaining that they were too big. So, I broke them up into chunks of roughly 350 at a time. Fewer, and the time would go up due to the added number of grep searches; too many more, and grep would overflow again. I may not have hit on the exact time minimizing value, but it is close at least.

Finally, here are some example searches:

Search for logoi.
Search for lemma:logos. (The time estimate is higher than actual load time).

Or, here is the search form:

Search Form.

Make sure that you choose Collocation Table and check the lemma button under the "Refined Search Results" Tab at the bottom.

It can handle most searches, except for very high frequency words. If anyone has ideas on how to make it faster, it could perhaps enable us to get results for all searches. Though perhaps this is not possible without somehow creating and saving those results somewhere.


UPDATE:

After talking to Mark, I altered the way the data is read from the file and now things should be running faster. The reason that all this discussion and speed streamlining for lemmatized collocation tables is necessary is the fact that the texts on Perseus do not have the lemmas embedded in the text. As Mark noted, many of the other databases would allow for much simpler and faster generation of the same data due to the fact that they do have lemmas in the text. However, for the purposes of Perseus, lemmas needed to be separated from the texts to allow them to be more dynamically updated, changed and maintained.


As for the speed, it should now be faster thanks to a handy function in Perl. I had investigated methods for reading a certain line of a file, since I happened to know exactly what lines I needed. However, finding none that did not read the whole contents of the file up to that line, I instead implemented the process described above. I overlooked SEEK. I dismissed it because it starts from a certain byte offset and not a certain line. Nevertheless, we can harness its power by simply padding each line with spaces to ensure every line in our file is the same byte length. With this pointer from Mark and some padding on the lines, knowing the line number and the number of bytes per line is enough to start reading from the exact location in the file that we desire.
Read More
Previous PostOlder Posts Home

Zett - A Responsive Blogger Theme, Lets Take your blog to the next level.

This is an example of a Optin Form, you could edit this to put information about yourself.


This is an example of a Optin Form, you could edit this to put information about yourself or your site so readers know where you are coming from. Find out more...


Following are the some of the Advantages of Opt-in Form :-

  • Easy to Setup and use.
  • It Can Generate more email subscribers.
  • It’s beautiful on every screen size (try resizing your browser!)