Since Clovis has running some experiments to see how well Topic Modeling using LDA might be used to predict topics on unseen instances, I thought I would back track a bit and write a bit about some experiments I ran last year which may be salient for future for comparative experimentation or even to begin thinking about putting some of our classification work into some level of production. I am presuming that you are basically familiar with some of the classifiers and problems with the Encyclopédie ontology. These are described...
Collocation Notes

Since we are planning a proposal that will use collocation as a main component for yet another grant/project proposal, I thought I would give some background notes for future reference. One of the more popular reporting features in PhiloLogic is the collocation table. This is a very simple mechanism. It counts the words around a search term or list of terms (the user sets the span and can turn of function word filtering) and reports...
Finding related articles using topic modeling
While still working on the topic inferencer, I started hacking at another possibility which is made possible by topic modeling, that is finding closely related texts within a corpus. There are several ways of doing this. What I chose to do was to consider the top three topics in each article and their respective proportion, and weigh it against the whole corpus. Here's a link to a search form where you can search for similar articles in the Encyclopedie :http://robespierre.uchicago.edu/topic_modeling/search.form.htmlIn order...
Some Notes on Theme-Rheme in PhiloLogic

One of the more arcane, and probably rarely used, functions in PhiloLogic is an experimental reporting scheme that I rather tentatively named "word in clause position analysis" or "theme-rheme," which is briefly described in the PhiloLogic user manual. I proposed this in talk titled "Making Space: Women's Writing in France, 1600-1950," which I gave at the ACH-ALLC and COCH/COSH conferences in 2004 (and drafted a good chunk of a paper...
Topic inference using the Encyclopédie trained model
While trying to use the Encyclopédie trained topic model on the Mémoires de Trévoux, something quite unexpected happened, the topic modeler was finding it hard to find topics that matched the Trévoux articles. You can see those results here:http://robespierre.uchicago.edu/topic_modeling/inference/encyclo2trevoux.txtSince the topic inference feature in mallet is relatively new, I though of creating a model out of the Trévoux, and then compare the topic proportion generated from the topic trainer with the one generated using...
Proportions of topics in Encyclopédie articles
This is a follow-up to my previous blog entry about topic modeling in the Encyclopédie. As the title of this post suggests, I will be showing here the proportions of topics per article. Instead of just posting those results without any further comment, I would like to focus on 12 random articles to see what kind of results one could get. My feeling about this is that the best results are in the 300 topic model. What do you think? Note that there is still a lot of room for some refinement.Examples from the 42 topic model :http://docs.google.com/View?id=dgrbcw9z_69gk9w5tgcExamples...
the PhiloLogic Data Architecture
For the last year or so, I've been arguing that it's time for a round of maintenance work on PhiloLogic's various retrieval sub-systems. In a later post, I'll examine some of the newer data store components out there in the open-source world. First, however, I'd like to enumerate what PhiloLogic's main storage components are, where they live, and how they work, for clarity and economy of reference.The Main Word Index:PhiloLogic's central data store is a GDBM hashtable called index that functions, basically, the same way as...
Preliminary results on topic modeling in the Encyclopédie
Following up on Mark's comments on topic modeling using Latent Dirichlet Allocation, or LDA, I went on to explore some implementations of this algorithm to see what type of results we would get on some of the data sets we have. I first started using David Blei's code, but it ended being to complex to use, the documentation was very elusive. So I starting to look at another tool, Mallet, which also includes an implementation of LDA.Here are the first results I've come up with when running it against the Encyclopédie. The main...