While Clovis has been running LDA tests on Encyclopédie texts using the Mallet code, I have been running some tests using the sLDA algorithm. After a few minor glitches, Richard and I managed to get the sLDA code, written by Chong Wang and David Blei, from Blei's website up and running.
Unlike LDA, sLDA (Supervised Latent Dirichlet Allocation), requires a training set of documents paired with corresponding class labels or responses. As Blei suggests, these can be categories, responses, ratings, counts or many other things. In my experiments on Homeric texts, I used only two classes, corresponding to Homer's two major works: the Iliad and the Odyssey. Akin to LDA, topics are inferred from the given texts and a model is made of the data. This model, having seen the class labels of the texts it was trained on, can then be used to infer the class labels of previously unseen data.
For my experiments, I modified the xml versions of the Homer texts that we have on hand using a few simple perl scripts. Getting the xml transformed into an acceptable format for Wang's code required a bit of finagling, but was not too terrible. My scripts first took the xml and split it into books (the 24 books of the Iliad and likewise for the Odyssey), then stripped the xml tags from the text. Saving out four books from each text for applying the inference step, I took the rest of the books and output the corresponding data file necessary for input into the algorithm (data format here).
I played around a bit with leaving out words that occurred extremely frequently or extremely rarely. For the results I am posting here, the English vocabulary was vast and I cut it down to words that occurred between 10 and 60 times. This probably cuts it down too much though, so it would be good to try some variations. Richard has suggested also cutting out the proper nouns before running sLDA in order to focus more on the semantic topics. For the Greek vocabulary, I used the words occurring between 3 and 100 times, after stripping out the accents.
Running the inference part of sLDA on the 8 books that I had saved out seemed to work quite well. It got all 8 correctly labeled as to whether they belonged to the Iliad or to the Odyssey. In a reverse run, the inference was able to again achieve 100 percent accuracy on labeling the 40 books after having been trained on only the 8 remaining books.
The raw results of the trials give a matrix of betas with a column for each word, and a row for each topic. These betas thus give a log based weighting of each word in each topic. Following this are the etas, with a column for each topic and a row for each class. These etas give the weightings of each topic in each class, as far as I understand it. Richard and I slightly altered the sLDA code to output an eta for each class, rather than one less than the number of classes as it was giving us. As far as we understand the algorithm as presented in Blei's paper, it should be giving us an eta for each class. Our modification didn't seem to break anything, so we are assuming that it worked, as the results seem to be looking nice. Using the final model data, I have a perl script that outputs the top words in each topic along with the top topics in each class. These are the results that I am giving below.
Results of my sLDA Experiments on Homer:
Also, samples of the output from Blei and Wang's code, corresponding to the English Text with 100 topics:
Final Model: gives the betas and the etas which I used to output my results
Likelihood: the likelihood of these documents, given the model
Inferred Labels: Iliad has label '0', Odyssey has label '1'.
Inferred Likelihood: the likelihood the previously unseen texts
I have not played around much with the gammas, but they seem to give a weighting of each topic in each document. Thus you could figure out for which book of the Iliad or the Odyssey a specific topic was the most prevalent. It would be interesting to see if this correctly pinpoints which book the cyclops comes in for instance, as this is a fairly easily identifiable topic in most of the trials.