Natural language queries are now possible on Perseus under Philologic. Previously, Richard had implemented searching for various parts of speech in various forms. For instance, as noted in the About page for Perseus, a search for 'pos:v*roa*' will return all the instances of perfect active aorist verbs in the selected corpus. Now, a search for 'form:could-I-please-have-some-perfect-active-optatives?' will return the same results. In fact, searching for 'form:perf-act-opt', 'form:perfect-active-optative', 'form:perfection-of-action-optimizations',...
Encyclopédie: Similar Article Identification II
After doing a series of revisions as part of my last post this subject (link), I thought it might be helpful to provide an update posting. We have been interested in teasing out how the VSM handles small vs large articles and to get some sense of why various similar articles are selected. Over the weekend, I reran the vector space similarity function on 39,218 articles, taking some 29 hours. I excluded some 150 surface forms of words in a stopword list, all sequences of numbers (and roman numerals), as well as features...
Mapping Encyclopédie classes of knowledge to LDA generated topics
As was described in my previous blog entry, I've been working on comparing the results given by LDA generated topics with the classes of knowledge identified by the philosophes in the Encyclopédie. My initial experiment was to try to see if out of 5000 articles belonging to 100 classes of knowledge, with 50 articles per class, I would find those 100 topics using an LDA topic modeler. My conclusion was that it didn't find all of them, but still found quite a few. Since then, I have played a bit more with this dataset and...
Index Design Notes 1: PhiloLogic Index Overview
I've been playing around with some perl code in response to several questions about the structure of PhiloLogic's main word index--I'll post it soon, but in the meantime, I thought I'd try to give a conceptual overview of how the index works. As you may know, PhiloLogic's main index data structure is a hash table supporting O(1) lookup of any given keyword. You may also know that PhiloLogic only stores integers in the index: all text objects are represented as hierarchical addresses, something like a normalized, fixed-width...
Encyclopédie: Similar Article Identification
The Vector Space Model (VSM) is a classic approach to information retrieval. We integrated this as a standard function in PhiloMine and have used it for a number of specific research projects, such as identifying borrowings from the Dictionnaire de Trévoux in the Encyclopédie, which is described in our forthcoming paper "Plundering Philosophers" and related talks[1]. While originally developed by Gerard Salton[2] in 1975 as a model for classic information retrieval, where a user submits a query and gets results in an ranked...
Frequencies in the Greek and Latin texts
Earlier this year Mark built a frequency query for the French texts (affectionately named wordcount.pl)Kristin has now implemented this for our Greek and Latin texts. If you wonder what's new about this: Word count for individual documents has always been there in PhiloLogic loads, but the difference here is that you can see frequencies over the entire corpus, or a subset of works/authors.You can find the forms here:http://perseus.uchicago.edu/LatinFrequency.htmlhttp://perseus.uchicago.edu/GreekFrequency.htmlUpdate: Forms moved...
Do LDA generated topics match human identified topics?
I've been experimenting lately on how LDA generated topics and the Encyclopédie classes of knowledge match. The experiment was conducted in the following way:- I chose 100 classes of knowledge in the Encyclopédie, and picked 50 articles of each.- I then ran a first LDA topic trainer choosing 100 topics. - I then proceeded to identify each generated topic and name after the Encyclopédie classes of knowledge. - My plan was then to look at the topic proportions per article and see if the top topic would correspond to its class...
Section Highlighting in Philologic
In many of the Perseus texts currently loaded under philologic, the section labels would overlap and be unreadable. These labels come from the milestone tags in the xml text and are placed along the edge of the text. One particularly problematic text in this regard was the New Testament, as the sections were verses and were thus often small sections of text.In order to fix the overlapping issue, I wrote a little bit of javascript to hide the tags which would be placed in the same position as a previous tag. I also added a function...
Towards PhiloLogic4
Earlier this year I wrote a long discussion paper called "Renovating PhiloLogic" which provided an overview of the system architecture, a frank review of the strengths and (many) failings of the current implementation of the 3 series of PhiloLogic, and proposed a general design model for what would effectively be a complete reimplementation of the system, retaining only selected portions of the existing code base. While we are still discussing this, often in great detail, a few general objectives for any future renovation...
Encyclopédie under KinoSearch
One of the things that I have wanted to do for a while is to examine implementations of Lucene, both as a search tool to complement PhiloLogic and possibly as a model for future PhiloLogic renovations. Late this summer, Clovis identified a particular nice open source, perl implementation of Lucene called KinoSearch. This looks like it will fit both bills very nicely indeed. As a little experiment, I loaded 73,000 articles (and other objects) from the Encyclopédie, and cooked up a super simple query script. This allows...
back to comparing similar documents
I mentioned a little while ago some work I did on comparing one document with the rest of the corpus it belongs to ( the examples I used in that blog post will not give the same results anymore, the results might not be as good, I haven't optimized the new code for the Encyclopédie yet). The idea behind it was to use the topic proportions for each article generated from LDA, and come up with a set of calculations to decide which document(s) was closest to the original document. The reason why I'm mentioning here once more...
Supervised LDA: Preliminary Results on Homer
While Clovis has been running LDA tests on Encyclopédie texts using the Mallet code, I have been running some tests using the sLDA algorithm. After a few minor glitches, Richard and I managed to get the sLDA code, written by Chong Wang and David Blei, from Blei's website up and running. Unlike LDA, sLDA (Supervised Latent Dirichlet Allocation), requires a training set of documents paired with corresponding class labels or responses. As Blei suggests, these can be categories, responses, ratings, counts or many other things....
Encyclopédie Renvois Search/Linker
During the summer (2009), a user (UofC PhD, tenured elsewhere) wrote to ask if there was any way to search the Encyclopédie and "generate a list of all articles that cross-reference a given article". We went back and forth a bit, and I slapped a little toy together and let him play with it, to which his reply was "Oh, this is cool! Five minutes of playing with the search engine and I can tell you it shows fun stuff...". This is, of course, an excellent suggestion which we have talked about in the past, usually in the context...
Archives Parlementaires: lèse (more)
As I mentioned in my last in this thread, I was a bit surprised to see just how prevalent the construction lèse nation had become early in the Revolution. The following is a sorted KWIC of lEse in the AP, with the object type restricted to "cahiers", resulting in 38 occurrences. These are, of course, the complaints sent to the King, reflecting relatively early developments of Revolutionary discourse. Keeping in mind all of the caveats regarding this data, we can see some interesting and possibly contradictory uses:CAHIER:...
Topic Based Text Segmentation Goodies
As you may recall, Clovis ran some experiments this summer (2009) applying a perl implementation of Marti Heart's TextTiling algorithm to perform topic based text segmentation on different French documents (see his blog post and related files). Clovis reasonably suggests that some types of literary documents, such as epistolary novels, may be more suitable candidates than other types, because they do not have the same degree of structural cohesion. Now, as I mentioned in my first discussion of the Archives Parlementaires,...
Archives Parlementaires: lèse collocations

The collocation table function of PhiloLogic is a quick way to look at changes in word use. Lèse majesté, treason or injuries against the dignity of the sovereign or state, is a common expression. The collocation table below shows terms around "lese | leze | lèse | lèze | lése | léze" in ARTFL Frantext (550 documents, 1700-1787) with majesté being by far the most common.It is interesting to note that the construction "lèse nation"...
Archives Parlementaires (I)
A couple of weeks ago, some ARTFL folks discussed the notion of outlining some research and/or development projects that we will be, or would like to be, working on the coming months. We discussed a wide range of possibilities that could involve substantive work, using some of the systems we have already developed or are working on, or more purely technical work. Everyone came up with some pretty interesting projects and proposals, and we decided that it might be entertaining and useful for each of us to outline a specific...
Epub to tei lite converter
This is just to let you know that we now have an epub to tei converter. It can be found here:http://artfl.googlecode.com/files/epub_parser.tarAs you'll notice, there are three files in this archive. The first one is epub_parser.sh. It's the only one you need to edit. Specify the paths (where the epub files are and where you want your tei files to be in) without slashes and just execute epub_parser.sh. The second one is parser.pl which is called by epub_parser.sh. The third one is entities.pl which handles html entities and...
Text segmentation code and usage
Here's a quick explanation on how to use the text segmentation perl module called Lingua-FR-Segmenter. You can find here: http://artfl.googlecode.com/files/Lingua-FR-Segmenter-0.1.tar.gz It's not available on cpan as it's just a hacked version of Lingua::EN::Segmenter::TextTiling made to work with French. The first thing to do before installing it is to install Lingua::EN::Segmenter::TextTiling which will get you all the required dependencies (cpan -i Lingua::EN::Segmenter::TextTiling). When you install the French segmenter,...
Classifying the Echo de la Fabrique
I've been working lately on trying to classify the Echo de la Fabrique, a 19th century newspaper, using LDA. The official website is located at http://echo-fabrique.ens-lsh.fr/. The installation I used is strictly meant for experimentation on topic modeling.
The dataset I used is significantly smaller than the Encyclopédie, which means that the algorithm has fewer articles with which to generate topics. This makes the whole process trickier since choosing the right number of topics suddenly becomes more important. I suspect...
Some Classification Experiments
Since Clovis has running some experiments to see how well Topic Modeling using LDA might be used to predict topics on unseen instances, I thought I would back track a bit and write a bit about some experiments I ran last year which may be salient for future for comparative experimentation or even to begin thinking about putting some of our classification work into some level of production. I am presuming that you are basically familiar with some of the classifiers and problems with the Encyclopédie ontology. These are described...