Monday, February 1, 2010

Lemma Collocations on Perseus

See UPDATE at the end.

This post actually relates nicely to Mark's recent post. I have recently been working on lemmatized collocation tables for the Greek texts on Perseus. If you just want to see the results, skim to the end, the rest describes how I got there.

It is not so difficult to look up the lemma for each word surrounding a certain search hit, as for these texts the structure to do so is already in place and the information is stored in SQL tables. Efficiency in gathering this information is the main difficulty. Looking up the lemma in the tables now in place can take a couple different SQL queries, which each take up a small chunk of time. For a few lookups, or even a few hundred, this is not too big of a problem. However, for a collocation table spanning five words on either side, we need at least 10 lookups in the databases per hit. The time it takes to do that adds up quite quickly.

So, following a suggestion from Mark, I wrote a script and generated a file with a line for every word in the Perseus database. Basically, it takes each word id starting with 1 on up to 5.5 million something and looks up its lemma. This generated a 5.5 million line file with lines likes this:

2334550 δέ
2334551 ὅς
2334552 ἕκαστος
2334553 ἵππος
2334554 nolemma
2334555 ὅς
2334556 δέ
2334557 πέτομαι
2334558 κονίω
2334559 πεδίον

Now, looking up the words on either side of a hit is much simpler - all you need to know is the "word id" of the hit and you can look at those around it. The "nolemma" entries are primarily punctuation and such which were given word tags at some point.

The size of this massive file however was another hindrance to efficient generation of collocation tables. Although we now know exactly what line we need to look at for the information we need, getting that line is still costly as it could require reading in a couple million lines of a file to get the one we need. After playing around with it a bit, command line grep searching seemed to be the fastest way to go, but even an grep search 10 times per hit adds up fast. So, I tried combining the searches into one massive egrep command to be read into a perl array. My searches looked something like:

egrep "23345[456][0-9]" lemmafile

Giving a window of 30 lines in the file starting at 2334540 and ending at 2334569. This limited the searches to one per hit instead of 10, but it still wasn't fast enough. So, I combine all of the searches like so:

egrep "(2342[234]|33329[678]|...|829[567])[0-9]"

(A bit of accounting for numbers ending in 0 is needed so that a window around 400 doesn't include things in the 490's, but this is not too difficult.)

This looked nice and seemed to work until I tried running it on more hits. It was then coming up with such massive regular expressions to grep for that grep was complaining that they were too big. So, I broke them up into chunks of roughly 350 at a time. Fewer, and the time would go up due to the added number of grep searches; too many more, and grep would overflow again. I may not have hit on the exact time minimizing value, but it is close at least.

Finally, here are some example searches:

Search for logoi.
Search for lemma:logos. (The time estimate is higher than actual load time).

Or, here is the search form:

Search Form.

Make sure that you choose Collocation Table and check the lemma button under the "Refined Search Results" Tab at the bottom.

It can handle most searches, except for very high frequency words. If anyone has ideas on how to make it faster, it could perhaps enable us to get results for all searches. Though perhaps this is not possible without somehow creating and saving those results somewhere.


UPDATE:

After talking to Mark, I altered the way the data is read from the file and now things should be running faster. The reason that all this discussion and speed streamlining for lemmatized collocation tables is necessary is the fact that the texts on Perseus do not have the lemmas embedded in the text. As Mark noted, many of the other databases would allow for much simpler and faster generation of the same data due to the fact that they do have lemmas in the text. However, for the purposes of Perseus, lemmas needed to be separated from the texts to allow them to be more dynamically updated, changed and maintained.


As for the speed, it should now be faster thanks to a handy function in Perl. I had investigated methods for reading a certain line of a file, since I happened to know exactly what lines I needed. However, finding none that did not read the whole contents of the file up to that line, I instead implemented the process described above. I overlooked SEEK. I dismissed it because it starts from a certain byte offset and not a certain line. Nevertheless, we can harness its power by simply padding each line with spaces to ensure every line in our file is the same byte length. With this pointer from Mark and some padding on the lines, knowing the line number and the number of bytes per line is enough to start reading from the exact location in the file that we desire.

Wednesday, January 20, 2010

MONK Data Under PhiloLogic: 1

Introduction

As I'm sure you all know, the MONK Project (http://monkproject.org/), directed by Martin Mueller and John Unsworth, has generated a large collection of tagged data some of which has been made public and some of which is limited to CIC or other institutions (http://monkproject.org/downloads/). Each word in this group of different collections is tagged for part of speech, lemma, and normalization. Martin has documented the encoding scheme in great detail at http://panini.northwestern.edu/mmueller/nupos.pdf.

The following is a long post describing in some detail one approach to integrating this kind of information. Some of this will be deeply geeky and you can feel free to skip over sections. There is, towards the bottom of this post, a link to a standard PhiloLogic search form, so you can play with this proof-of-concept build yourself.

Richard and Helma have developed a mechanism to search for part of speech and lemma searching under PhiloLogic for their Greek and Latin databases (link). This is based on some truly inspired hacking by Richard and forms one model of how to handle this kind of functionality. My understanding of this, and Richard please correct me if I am wrong, is that it uses an undocumented feature in the index/search3 subsystem that allows us to have multiple index entries for each word position in the main index. This works and is certainly an approach to be considered as we think about a new series of PhiloLogic.

Build Notes

I have been experimenting with a somewhat different mechanism to handle this kind of problem, which is based on previous examples of mapping multiple word attributes to an index entry, using multiple field "crapser" entries. You may recall that this is the mechanism by which we merged Martin's virtual normalization data to very large collections of early modern English data and is currently running at Northwestern (link). My approach is to index not words, but pairs of surface forms and part of speech tags and to link these to an expanded (5 field) word database (called by crapser) containing the index form, surface form, lemma, part of speech and normalized forms. Here are some index entry forms (and frequencies):
 24 conquer:vvb
445 conquer:vvi
143 conquered:vvd
414 conquered:vvn

These map to the word vector database which looks like:
idx               surf      pos        lem        normal
conquered:j conquered j conquer conquered
conquered:j-vvn conquered j-vvn conquer conquered
conquered:n-vvn conquered n-vvn conquer conquered
conquered:vvd conquered vvd conquer conquered
conquered:vvn conquered vvn conquer conquered

To build this I first reduced for fully verbose form of the data in which each token is tagged:

<w eos="0" lem="country" pos="n1" reg="COUNTRY" spe="COUNTRY">COUNTRY</w>

I eliminated all encoding that is redundant, just to make things easier to work with since the files are huge:

<w pos="n1">COUNTRY</w>

Where there is some additional information, I keep it in the encoded document:

<w lem="illustration" pos="n2">ILLUSTRATIONS</w>

I then loaded this data into a very slightly modified PhiloLogic textloader. This simply builds an index representation of the surface form of the word and the part of speech, by getting the PoS from encoding:

if ($thetag =~ /<w/) {
$thepartofspeech = "";
$thetag =~ m/pos="([^"]*)"/i;
$thepartofspeech = $1;
}

and adding this to the index entry:
$theword = $theword . ":" . $thepartofspeech;

When loaded to this point, you have modified index entries. The next step is simply to build a multi-field word vector database (crapser). I did this by reading the input data and adding entries for lemmas or normalizations. This is simply an extension of what is already documented in the "virtual-normalize" directory in the "goodies" in the PhiloLogic release.

The next step was to slightly modify a version of Leonid's original "gimme". The "sequential" version of this function (in the standard Philologic distribution), maps a multi-field (tab delimited) query using regexp patterns in egrep. This is fast and simple. It allows naming of fields, so you can simply specify "lem=justice" and it will generate a regular expression pattern (where TAB = the tab character):

^[^TAB]*TAB[^TAB]*TAB[^TAB]*TABjusticeTAB[^TAB]*$

And you get, of course, full regular expressions. (Note, this renders with some odd spacing, there are no spaces). Swap in this version of crapser and it all appears to run without further modification.

So, to summarize, the implementation does not require any modifications to core system components. It requires only slight modifications to a textloader, which we do all the time for specific databases, and a slightly modified "crapser" with a suitably build word vector database.

The Database

The database has 567 documents containing 38.5 million words (types) and 273,600 index entries. Recall that these are surface form words and part of speech tags and not normal types. The dataset has selections from various sources, including Documenting the American South as well as some British Early Modern texts. It should have full PhiloLogic search and reporting capabilities. You can query the words in the database as usual, simply by typing in words. To force searches on lemmas, normalizations, and parts of speech by specifying (with examples):

lem=conquer
nrm=conquer
pos=pns32

and finally, if you want to get one surface form and part of speech you can search the index entry directly, such as "conquered:vvd". Note that the Part of Speech is specified after a colon and you don't need to specify anything else. This is obviously not a real query interface, but it suggests how we can think about interfaces further along (eg, pull down menus, etc). You can also use regular expressions, such as lem=conque.* Finally, you can combine these, such as "pos=po.* lem=enemy", which means find forms of enemy followed by possessive pronous within three words, such as : "their most mortall enemies". You will need to consult Martin's discussion of the encoding to see all of the parts of speech. It is an extensive and well reasoned scheme.

After all of that, here is the search form.

Now, before running of to play with this, there are some important notes following which describes how to use this in more detail.

Discussion

This is a proof-of-concept build. In a full implementation, I would need to add some search syntax to allow the user to indicate a set of combined criteria for a single word. I was having some problems coming up with a use case, but I guess one could want to say search for a particular lemma AND part of speech. It would all work with a little massaging. Aside from that, this simple model should support all of the standard PhiloLogic searching and reporting
features. Do let me know if you find something that does not work.

This model supports disambiguating searches, such as to find dog when it is used as a verb. Try "dog:vvi" for hits like "we can dog them" (thanks Russ for this example). It also appears to work properly form most other searches, such as lemmas, normalizations, etc. Part of speech for single entries looks reasonable in terms of performance.

My primary interest, however, in this experiment is to test performance on sequences of parts of speech searching. For example: "pos=po3. pos=j pos=n1" will find sequences like: "their strange confusion" and "his Princely wisedome". Chains of four also seem to work reasonably. Eg: "pos=vvn pos=po3. pos=j pos=n1" returns phrases like "neglected their even elevation", "stimulated their adventurous courage", and "aroused his little troop". You can always find a part of speech after a particular word (lemma): "after pos=po3.".

Now, this is all fine and dandy. Except that doing conjoined searches on parts of speech reveals a significant conceptual difficulty, which I believe also applies to Richard's implementation. Each part of speech generates thousands of surface form index entries. For example:
"pos=vvn pos=po3. pos=j pos=n1"
generates 81,000 unique terms (index entries) in 4 vectors. The evaluation then does a massive join at the level of index entry evaluation. So, it is SLOW and subject to possible memory buffer overflow or other problems. In fact, the system will begin to generate results of this type fairly quickly, due to PhiloLogic's lazy evaluation (start returning results as soon as you have a few). But it can take several minutes to complete the task. We would certainly not want to put this on a high traffic machine, since if you have many similar queries, it would bog it down. Obviously, we could simply test to make sure that users search criteria would not drag the whole system down or simply lock this database to one user at a time, or some other work around. If we got reasonable French NLP, this could be implemented quickly.

However, I believe we have bumped up upon a conceptual problem. To find POS1 and POS2 and POS3 either in a row or within N number of words requires an evaluation of word and/or part of speech positions in documents.

There are a couple of possible solutions, all of which would require consideration of distinct indexing structures. The first is simply to build another kind of NGRAM POS index which would have sequences of parts of speech indexed and mapped to document regions. The second would be a another kind of index which would look like a standard PhiloLogic index entry, except that it would be ONLY part of speech. This would reduce the size of the word vectors, but would not in itself improve the index evaluation to find those sequences that fit the query in the actual documents, since we still have to return to word positions in documents.

We might call this "The Conjoined Part of Speech Problem (CPSP)". It is, in my opinion, a highly specialized type of search and it is not clear just what the use cases might look like in relatively simple languages (English, French) as opposed to Greek, for which Helma makes a convincing case. So, there is a question of just how important this might be. In email communication, Martin makes the case that it would be and that researchers who want this kind of query would be willing to wait a few minutes.

It would be a trivial and useful experiment to run a load where I would index ONLY part of speech information. This would be a good test to see if evaluation speed for conjoined part of speech searches would be reasonable. In fact, Richard and I did a few quick experiments that suggest this would work. The idea would be to distinguish between simple queries -- and run them as usual -- and multiple PoS queries, which would be run on a dedicated index build. So, build parallel indicies. Oddly enuff, in the current architecture, I suspect that one could simply have a switch to say WHICH database to consult dynamically, simply by evaluating the query and then setting the database argument. That would be another one of my famous, award-winning, hall of shame hacks. But it could be made to work.

Martin has also pointed out another issue, which is searching, sorting, and counting of PoS, lemma, and other data. Now, that makes a lot of sense. I want to search for "country" and find distributions of particular parts of speech. Or, I want to do a collocation table searching on a lemma and counting the lemmas around the word. I think all of this is certainly doable -- the latter is something I wrote about some 15 years ago -- with hacks to the various reporting subsystems (not in 3, which is just too much of a mess). In an SOA model of PhiloLogic, this would be quite reasonably handled, ideally by other teams using PhiloLogic if not here at Chicago.

I think these are important issues to raise, but not necessarily resolve at this time, if (when?) we consider the architecture of any future Philologic4 development effort. For example, the current models of report generators would have to know about lemmas, etc. And we would need to at least leave hooks in any future model to support different indexing schemes for things like.

Finally, watch this space. I believe Richard is doing a build of this data using his model as well.

Please do play around with all of this and let me know what you think. One consideration would be implementing this for selected French collections. We would obviously need real virtual normalizers, lemmatizers and PoS identifiers for a broader range of French than we have now.


Thursday, January 14, 2010

KWIC Modifications

I have been working on getting a cleaner output format from KWIC for the Greek texts on Perseus. Helma was desirous of the KWIC output leaving in the word tags which occur in the Perseus texts in order that the word lookup function be usable directly from the KWIC results page. Since KWIC leaves as little formatting as possible, it strips out all tags, including the word tags, from the text. While I worked on that, I also added a few other modifications to KWIC to give a better look to the results page for the Greek texts.

The problem with KWIC for Greek texts is that Greek fonts to not support single-width fonts, which KWIC uses to align the results more cleanly. In addition, the title lines, which give the bibliographic information and link, can be different lengths and this also causes problems aligning the search terms. See for instance, this search.

To solve the word tag problem I just made a few modifications to the KwicFormat subroutine in philosubs. The main edit there was changing the line that stripped all tags into this:

$bf=~ s#<(?!(w |/w))[^>]*># #gi; #keep only word tags

For the alignment issue, things were more complicated. Keeping track of the length of the left side of the line doesn't allow for a consistent place on the page due to the differing widths of the letters. In the end, I modified artfl_kwic to chop the left side of the hit to a size as close to a certain length as possible without breaking any words. Previously, both the right and the left were chopped to a certain length regardless of breaking words and length including tags, often resulting in very little content. Now, only the length of the display string is accounted for and in addition the length is adjusted for the length of the bibliographic title.

I also added a span around the left and right sides of the hit to allow for positioning and alignment using Javascript (and CSS). Then, adding the following lines to the Results Header, the search terms are all lined up in a neat line:
span.left { right:46%; position:absolute; }
span.right { left:54.5%; position:absolute;
height:18px; overflow:hidden;}
The numbers may look a little messy, but they give nice results. I found that without the decimal, the two sides were a bit too far apart, but there may be another way around that.

The extra bits for formatting the right span are in place of trimming the content of the right side in perl as I did for the left side. I found that the overflow:hidden attribute is quite handy if you can get it to work (it is a bit tempermental). As long as it is found in an absolutely positioned object with restricted size, AND it is contained within an element with overflow set to auto, it should work. It simply hides any content that does not fit within the given boundaries. It gives a very clean look to the right side of the page and even adjusts to different window sides so that the content never leaks to the next line.

Take a look at this page and play with the window size to see what I mean. Unfortunately, there is no such nice property for trimming the overflow off of the left side instead of the right side. That is why I did it in perl instead of Javascript. There is a function called clip in javascript which is designed to clip an image, but again the way it works makes it much easier to trim from the right side than the left side. One could probably twist the clip function enough to make something similar happen for the left side (and make everything nicely adjustable and lovely), but for now, it is happening in perl. (I tried for a while, but my concoctions just seemed to slow things down and not add anything exciting results).

UPDATE: I couldn't resist playing a bit more with the javascript, and now it works like I wanted it to! Now, if you click on the link above, it won't illustrate what I said it would, because the javascript has been improved. I added this function:

function trimKwicLines(){
var contentwidth = $(".content").width();
$(".left").each(function (i) {
var width = $(this).width();
var leftoffset = contentwidth*.4 - width*1 - 2;
$(this).css("left", leftoffset);
});
}


And changed things here and there in the CSS.

Monday, December 14, 2009

Natural Language Morphology Queries in Perseus

Natural language queries are now possible on Perseus under Philologic. Previously, Richard had implemented searching for various parts of speech in various forms. For instance, as noted in the About page for Perseus, a search for 'pos:v*roa*' will return all the instances of perfect active aorist verbs in the selected corpus. Now, a search for 'form:could-I-please-have-some-perfect-active-optatives?' will return the same results. In fact, searching for 'form:perf-act-opt', 'form:perfect-active-optative', 'form:perfection-of-action-optimizations', or 'form:perfact-actovy-opts-pretty-please' will all accomplish this same task. Note that the dashes are necessary between the words, otherwise a search for plural nouns written as 'form:plural nouns' will actually be searching for any plural word followed by the word "nouns", which will fail. I carefully chose shorter forms of all the keywords, such as "impf" and "ind" for "imperfect" and "indicative" so that a search including any word starting with "ind" will match indicatives regardless of what follows the 'd'. Hopefully, there are no overlapping matches (such as using "im" to abbreviate "imperfect" which would also match "imperative"). If you do encounter any, please let me know. Potentially, we could put a list of acceptable abbreviations somewhere, although they are fairly straightforward and typing the full term out is always a fail-safe method.

Basically, the modified crapser script simply translates searches beginning with "form:" into the corresponding "pos:" search. Using a hash of regular expressions and string searching, it simply returns the corresponding code. In the previous example, the search is actually looking for "pos:....roa..". Notice that it fills in the empty space of the code with dots, allowing them to be anything. I implemented an alternative filler, the dash, so that when you search for something like "form:perf-act-opt-exact", you will actually be searching for "pos:----roa--" (and your search will fail because there are no terms that are only and exactly perfect active optative without other specifications).

One limitation that this method of natural language querying has is that it cannot match the versatility of the "pos:" searches. That is, because it selects either dots or dashes as fillers, you cannot get a mixture of them in your search. You cannot run a search such as "pos:v-.sroa---". However, this limitation will likely have little effect for the average user and the user needing such a search can still obtain it using the "pos:" method. An alternative method involving drop down input boxes for each slot of the code would enable the full power of the pos searches, but it would also be potentially more tedious to implement and potentially tedious to use as well. Such a input form would require the user to know more about the encoding than the "form:" searching I implemented does. For example, a user would need to know that "verb" is required in the first slot, even if "aorist optative" makes that the only possibility. Whereas searching for 'form:aorist-optative' works without the user ever needing to know that a 'v' is required in the first slot.

Encyclopédie: Similar Article Identification II

After doing a series of revisions as part of my last post this subject (link), I thought it might be helpful to provide an update posting. We have been interested in teasing out how the VSM handles small vs large articles and to get some sense of why various similar articles are selected. Over the weekend, I reran the vector space similarity function on 39,218 articles, taking some 29 hours. I excluded some 150 surface forms of words in a stopword list, all sequences of numbers (and roman numerals), as well as features (in this case word stems) found in more than 1568 and less than 35 articles. This last step removed features like blanch, entend, mort, and so on. Thus, I removed some 600 features, leaving 10,157 features used for the calculation. Here is the search form:

Headword: (e.g. tradition)
Author: (e.g. Holbach)
Classification: (e.g. Horlogerie)
English Class: (e.g. Clockmaking)
Size (words): (e.g. 250- or 250-1000)
Show Top: articles (e.g. 10 or 50)

The number of matching terms for small articles can be, of course, very small. For example, article "Tout-Bec" (62 words) is left with four stems [amer 1|oiseau 2|ornith 1|bec 3]. The first most of the most similar articles is Rhinoceros (Hist. nat. Ornith.) -- remember, only the main article here -- matches on three stems:
word               frq1     frq2
bec 3 5
oiseau 2 2
ornith 1 1
Are these similar? Well, both very small articles refer to kinds of rare birds that are notable by their beaks, one with a very large beak and one that looks like it has two or more beaks. It is also important to note that "ornith" (the class of knowledge) in both is picked up by this example. The next article down (Pipeliene) matches on:
amer                1        1
bec 3 1
oiseau 2 2
The third most similar in this example is "Connoissance des Oiseaux par le bec & par les pattes.", a plate legend, with as you expect, lots of beaks. This matches on two stems, bec and oiseau.

It seems that the size of the query article, now that I have eliminated many function words and other extraneous data, carries a significant impact. The larger the article, the more possible matches you will get (Zipf's Law applies). Longer articles will tend to be most similar to other longer articles, and shorter will match better to shorter. So, similarity would appear to be a function of relative frequencies of common features and the length of the articles. We saw this in our original examination of the Encyclopédie and the Dictionnaire de Trévoux, and had built in some restrictions in terms of size as well as comparing articles with the same first letter rather than all to all. As far as I can tell, the kind of more of feature pruning shown here does not have a significant impact on larger articles.

User feedback might be significant in determining just how many features and what kinds of features are required to get more interesting matches. For any pair, we could store the VSM score, the sizes, and the matching features along with the user rating of the match. That might generate some actionable data for future applications.

[Aside: In some cases, similar passages lead to possibly related plates and legends. Cadrature, for example, links to numerous plate legends dealing with clockmaking.]

Sunday, December 13, 2009

Mapping Encyclopédie classes of knowledge to LDA generated topics

As was described in my previous blog entry, I've been working on comparing the results given by LDA generated topics with the classes of knowledge identified by the philosophes in the Encyclopédie. My initial experiment was to try to see if out of 5000 articles belonging to 100 classes of knowledge, with 50 articles per class, I would find those 100 topics using an LDA topic modeler. My conclusion was that it didn't find all of them, but still found quite a few. Since then, I have played a bit more with this dataset and have come up with better results.
Since a topic modeler will give you the topic proportion per article (I just use the top three), what I tried to do this time was to draw up a table with each class of knowledge, and what the topic modeler identified in terms of topics for each class of knowledge. Before looking at this, it's important to keep in mind that in the sample of articles I used, there are 50 articles per class of knowledge. Therefore, the closer the number of the dominant topic in a class of knowledge gets to 50, the better the topic modeler will have done in identifying the class of knowledge and in reproducing the human classification.
Of course, the classification of articles in the Encyclopédie can be at times a little puzzling. The articles were written by a large number of people and therefore the classification is not always consistent. With that in mind, one should not expect to get perfect matches using a topic modeler. Moreover, since the topic modeler will assume that each article is about N number of topics, the calculation might be further off.
For my experiment, I settled on 107 topics, of which I eliminated 7, which were identified as stopwords lists. When looking at the results of this experiment, there are 41 classes of knowledge in which we find 40 or more articles grouped within the same LDA topic. This means that 41% of the classes of knowledge were identified with a great level of accuracy. If we look at topics that have more than 25 articles matching the same class of knowledge we get up to 83 classes (or 83%).
If we look at those results, there are strange flaws, such as physique and divination that don't seem to be identified. This might be due to a miscalculation, but I have yet to figure out what it could be. Highly specialized classes, such as corroyerie, poésie, or astronomie get excellent matches, which is to be expected.
This experiment also gave us an idea of what the percentage of LDA topics are to be considered as stopwords lists. Between 5 and 10% of the topics should be discarded when using an LDA classifier.
Finally, we should consider that LDA generated topics do not systematically match human identified topics. An unsupervised model is bound to give different results, it would be interesting to see how well supervised LDA (sLDA) would do in our particular test case.

Monday, December 7, 2009

Index Design Notes 1: PhiloLogic Index Overview

I've been playing around with some perl code in response to several questions about the structure of PhiloLogic's main word index--I'll post it soon, but in the meantime, I thought I'd try to give a conceptual overview of how the index works. As you may know, PhiloLogic's main index data structure is a hash table supporting O(1) lookup of any given keyword. You may also know that PhiloLogic only stores integers in the index: all text objects are represented as hierarchical addresses, something like a normalized, fixed-width Xpointer.

Let's say we can represent the position of some occurrence of the word "cat" as
0 1 2 -1 1 12 7 135556 56
which could be interpreted as
document 0,
book 1,
chapter 2,
section ,
paragraph 1,
sentence 12,
word 7,
byte 135556,
page 56, for example.

A structured, positional index allows us to evaluate phrase queries, positional queries, or metadata queries very efficiently. Unfortunately, storing each of these 9 numbers as 32-bit integers would take 36 bytes of disk space, for every occurence of the word. In contrast, it's actually possible to encode all 9 of the above numbers in just 39 bits, if we store them efficiently--that's a 93% saving. The document field has the value 0, which we can store in a single bit, whereas byte position, our most expensive, can be stored in just 18 bits. The difficulty being that the simple array of integers becomes a single long bit string stored in a hash. First we encode each number in binary, like so
0 1 01 11 1 0011 111 001000011000100001 000111

but this is only 18 bits, so we have to pad it off with 6 extra bits to get an even byte alignment, and then we can store it in our hash table under "cat".

Now, suppose that we use somthing like this format to index a set of small documents with 10,000 words total. We can expect, among other things, a handful of occurrences of "cat", and probably somewhere around a few hundred occurrences of the word "the". In a GDBM table, duplicate keywords aren't permitted--there can be exactly one record of "cat". For a database this size, it would be feasible to append every occurrence into a single long bit string Let's say our text structures require 50 bits to encode, and that we have 5 occurrences of cat. We look up "cat" in GDBM, and get a packed bit string 32 bytes, or 256 bits long. we can divide that by the size of a single occurrence, so we know that we have 5 occurrences and 6 bits of padding.

"The", on the other hand, would be at least on the order of few kilobytes, maybe more. 1 or 2 K of memory is quite cheap on a modern machine, but as your database scales into the millions of words, you could have hundreds of thousands, even millions of occurrences of the most frequent words. At some point, you will certainly not want to have to load megabytes of data into memory at once for each key-word lookup. Indeed, in a search for "the cat", you'd prefer not to read every occurrence of "the" in the first place.

Since PhiloLogic currently doesn't support updating a live database, and all word occurrences are kept in sorted order, it's relatively easy for us to devise an on-disk, cache-friendly data structure that can meet our requirements. Let's divide up the word occurences into 2-kilobyte blocks, and keep track of the first position in each block. Then, we can rapidly skip hundreds of occurrences of a frequent word, like "the", when we know that the next occurence of "cat" isn't in the same document!

Of course, to perform this optimization, we would need to know the frequency of all terms in a query before we scan through them, so we'll have to add that information to the main hash table. Finally, we'd prefer not to pay the overhead of an additional disk seek for low-frequency words, so we'll need a flag in each key-word entry to signal whether we have:
1) a low frequency word, with all occurences stored inline
or
2) a high frequency word, stored in the block tree.

Just like the actual positional parameters, the frequencies and tree headers can also be compressed to an optimal size on a per-database level. In philologic, this is stored in databasedir/src/dbspecs.h, a c header file that is generated at the same time as the index, then compiled into a custom compression/decompression module for each loaded database, which the search engine can dynamically load and unload at run time.

In a later post, I'll provide some perl code to unpack the indices, and try to think about what a clean search API would look like.

Thursday, December 3, 2009

Encyclopédie: Similar Article Identification

The Vector Space Model (VSM) is a classic approach to information retrieval. We integrated this as a standard function in PhiloMine and have used it for a number of specific research projects, such as identifying borrowings from the Dictionnaire de Trévoux in the Encyclopédie, which is described in our forthcoming paper "Plundering Philosophers" and related talks[1]. While originally developed by Gerard Salton[2] in 1975 as a model for classic information retrieval, where a user submits a query and gets results in an ranked relevancy list, the algorithm is also very useful to identify similar blocks of text, such as encyclopedia articles or other delimited objects. Indeed, this kind of use of the VSM was proposed by Salton and Singhal[3] in a paper presented months before Salton's death. They demonstrated the use of VSM to produce links between parts of documents, forming a type of automatic hypertext:
The capability of generating weighted vectors for arbitrary texts also makes it possible to decompose individual documents into pieces and explore the relationships between these text pieces. [...] Such insights can be used for picking only the "good" parts of the document to be presented to the reader.
Salton and Singhal further argued that manual link creation would be impractical for huge amounts of text, but these conclusions may have had limited influence given the general interest at that time in human generated hypertext links on the WWW.

Based on earlier work using PhiloMine, we have seen a number of "interesting" -- and at times unexpected -- connections between articles in the Encyclopédie, often drawing connections between previously unrelated articles, if by unrelated we mean having different authors, classes of knowledge and few cross-references (renvois) between them. One might consider this kind of similarity measure between articles as a kind of intertextual discovery tool, where the system would propose articles possibly related to a specific article.

The Vector Space Model functions by comparing a query vector to all of the vectors in a corpus, making it an expensive calculation, not always suitable to real time use. In this experiment, I have recast the VSM implementation in PhiloMine to function as a batch job to generate a database of 27,753 Encyclopédie articles (those with 100 or more words) with the 20 most similar articles for each article. To do this, I pruned features (word stems) which more than 8,325 and less than 41 articles, resulting in a vector size of 10,431 features. I used a standard French word stemmer to reduce lexical variation and a Log Normalization function to handle variations in article sizes. The task took about 17 hours to run.

Update (December 7): I have replaced the VSM build above with the same on 39,200 articles -- all articles with 60 or more words -- which took about 29 hours to run. I pruned features found in more than 11,200 documents and less than 50, leaving 9,710 features. This may change some results by adding more small articles. Note, this is about as large a VSM task as can be performed in memory using perl hashes, since anything large runs out of memory. If we want to go larger, probably store vectors on disk and TIE them to perl hashes.

The results for a query shows the 20 most similar articles, ranked by the similarity score, where an exact match is equal to 1. For example, the article OUESSANT (Modern Geography) -- based on 27,000 articles -- is related to the articles VERTU [0.274], Luxe [0.267], ECONOMIE ou OECONOMIE [0.265], POPULATION [0.263], CHRISTIANISME [0.261], SOCIÉTÉ [0.256], AVERTISSEMENT DES ÉDITEURS (suite) [0.255], MANICHÉISME [0.254], CYNIQUE, secte de philosophes anciens [0.254], Gout [0.250], EDUCATION [0.248] and so on. This reflects the discussion of the moral conditions of the inhabitants of the small island off the coast of Brittany.

You can give it a try using this form (again now for 39,200 articles):

Headword: (e.g. tradition)
Author: (e.g. Holbach)
Classification: (e.g. Horlogerie)
English Class: (e.g. Clockmaking)
Size (words): (e.g. 250- or 250-1000)
Show Top: articles (e.g. 10 or 50)

[Dec 9: I added word count info for each article. You can restrict searches to articles in ranges of size. Also, now storing 50 top matches, which you can limit. Showing matching articles which are smaller than source article. Dec 10: added function to display matching stems for any pairwise comparison for inspection.]

There are a number of other options that I might add to the VSM calculations, including using TF-IDF as an alternative normalization weighting scheme and use of virtual normalization to again reduce lexical variations and improve the performance of the stemming algorithm. I have also thought of using Latent Semantic Analysis as another way to handle similarity weighting, but given that we have many query terms, it is not clear that LSA would help all that much.

In a real production environment, I think we will add a "similar article link" from articles in the Encyclopédie. We have talked about having users rank the quality of the similarity performance. The scores assigned are somewhat helpful in ranking, but not in assessing an absolute number, since they can vary by the size of the input article. VSM is an unsupervised learning model. It is not clear to me that we could integrate user evaluations in any systematic fashion, but this is certainly an interesting subject of further consideration.

As always, please let me know what you think. I have a couple of general queries. I have used main and sub articles (as well plate legends, etc.) as units of similarity calculation. Should I use main entries only? I also limited this to articles with more than 100 words. At 50 words, we have some 43,000 articles. Should I do this for a full implementation?

References

[1] See Timothy Allen, Stéphane Douard, Charles Cooney, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, and Robert Voyer, "Plundering Philosophers: Identifying Sources of the Encyclopédie", Journal of the Association for History and Computing (forthcoming 2009). Also, see Ceglowski, Maxiej. 2003: "Building a Vector Space Search Engine in Perl", Perl.com [http://www.perl.com/pub/a/2003/02/19/engine.html].

[2] Salton, G., A. Wong, and C. S. Yang. 1975: "A Vector Space Model for Automatic Indexing," Communications of the ACM 18/11: 613-620.

[3] Singhal, A. and Salton, G. 1995: "Automatic Text Broswing Using Vector Space Model" in Proceedings of the Dual-Use Technologies and Applications Conference 318-324.

Friday, November 20, 2009

Frequencies in the Greek and Latin texts

Earlier this year Mark built a frequency query for the French texts (affectionately named wordcount.pl)
Kristin has now implemented this for our Greek and Latin texts. If you wonder what's new about this: Word count for individual documents has always been there in PhiloLogic loads, but the difference here is that you can see frequencies over the entire corpus, or a subset of works/authors.

You can find the forms here:
http://perseus.uchicago.edu/LatinFrequency.html
http://perseus.uchicago.edu/GreekFrequency.html

Update: Forms moved to the 'production site', perseus.uchicago.edu. You can now specify genre as well. Stay tuned for further stats, meant to provide a friendly reminder of Zipf's Law.

Note: the counts are raw frequency counts, without lemmatization.
I have edited the search form a tiny bit - let me know if you encounter any problems.

Wednesday, November 18, 2009

Do LDA generated topics match human identified topics?

I've been experimenting lately on how LDA generated topics and the Encyclopédie classes of knowledge match. The experiment was conducted in the following way:
- I chose 100 classes of knowledge in the Encyclopédie, and picked 50 articles of each.
- I then ran a first LDA topic trainer choosing 100 topics.
- I then proceeded to identify each generated topic and name after the Encyclopédie classes of knowledge.
- My plan was then to look at the topic proportions per article and see if the top topic would correspond to its class of knowledge. Would the computer manage to classify the articles in the same way the encyclopedists had?
I was not able to get that far when choosing 100 topics for my first LDA run. This is because LDA will always generate a couple topics which aren't really topics, but are just lists of very common words and they just happen to be used in the same documents. Therefore, one should always disregard these topics and focus on the others. What this means is that I had to add a couple more topics to my LDA run in order to get 100 identifiable topics. So I settled with 103 topics. I found 3 distributions of words which were unidentifiable, so I dismissed them.
The results show that LDA topics and the Encyclopédie classes of knowledge do not match (see links to results below). Some do very well, like Artillerie, for which the corresponding distribution of words is :
canon piece poudre artillerie boulet fusil ligne calibre mortier bombe feu charge culasse livre met chambre pouce lumiere roue affut diametre coup batterie levier bouche ame flasque balle tourillon tire
Other distribution of words make sense in themselves but do not match any of the original classes of knowledge. For instance, there is no topic on 'teinture', 'peinture'. What we get instead is a mixture of both classes of knowledge which could be identified as colors :
couleur rouge blanc bleu tableau jaune verd peinture ombre teinture noir toile tableaux nuance papier etoffe bien teint peintre pinceau trait teinturier melange veut figure teindre feuille beau sert colle
Now the topic modeler is not wrong here. It's telling us that these words tend to occur together, which is true. Another significant example is the one with 'Boutonnier', 'Soie', and 'Rubanier' :
soie fil rouet corde brin tour main bouton gauche longueur boutonnier droite attache bout fils tourner sert molette noeud cordon doigt piece emerillon moule broche ouvrage ruban rochet branche aiguille
What we get here is a topic about the art of making clothes, which is more general than 'Boutonnier' or 'Rubanier'.
For this to actually work, the philosophes would have had to have been extremely rigorous in their choice of vocabulary, because this is what LDA expects. Also, another problem is that LDA considers that each document is a mixture of topics, and not made out of one topic. So if one document is exclusively focused on one topic, LDA will still try to extract a certain number of topics out of it. If this is the case, then you are going to get some topics which are mere subdivisions of the class of knowledge in this document. The reason why our experiment broke down could be that the LDA topic trainer created new subdivisions for some classes of knowledge, or regrouped several classes of knowledge. These are all valid as topics, but do not correspond to human identified topics.

Link to results