Following up on Mark's comments on topic modeling using Latent Dirichlet Allocation, or LDA, I went on to explore some implementations of this algorithm to see what type of results we would get on some of the data sets we have. I first started using David Blei's code, but it ended being to complex to use, the documentation was very elusive. So I starting to look at another tool, Mallet, which also includes an implementation of LDA.
Here are the first results I've come up with when running it against the Encyclopédie. The main issue when using topic modeling is, as described in this article, coming up with the right number of topics as the results differ quite a bit depending on this number. I haven't quite settled yet for a particular number. Below are the topics I've come up with. Let me know what you think, which version(s) seems the more accurate. I would argue that the question comes down to how focused do we want each topic to be, or how broad do we want those topics to be without losing any accuracy. Please let me know if there are some words you think I could eliminate (less noise, more accuracy). Several hints would be useful, such as pinpointing a topic that doesn't make sense, a word that seems out of place somewhere (probably some noise to be eliminated during another run). Note that the list of words that I delete from the articles (so far a little over 300) could very well be used for other 18th century French texts, if not for different periods from 1650 to today with some tweaks here and there. Thanks.
Version with 42 topics:
http://robespierre.uchicago.edu/topic_modeling/42topics-encyclo.txt
Version with 100 topics:
http://robespierre.uchicago.edu/topic_modeling/100topics-encyclo.txt
Version with 150 topics:
http://robespierre.uchicago.edu/topic_modeling/150topics-encyclo.txt
Version with 200 topics:
http://robespierre.uchicago.edu/topic_modeling/200topics-encyclo.txt
Version with 250 topics:
http://robespierre.uchicago.edu/topic_modeling/250topics-encyclo.txt
Version with 300 topics:
http://robespierre.uchicago.edu/topic_modeling/300topics-encyclo.txt
Version with 350 topics:
http://robespierre.uchicago.edu/topic_modeling/350topics-encyclo.txt
These results are just the preliminary step. The interesting part is the topics proportions per document. I'll show some results in another post.
This looks really interesting. After quickly surveying the results, I'd go with 150 topics, but that's just intuition. How often do you see words repeated among topics?
ReplyDeleteThat's an interesting point. What LDA will do for you is determine the context in which a word is used, therefore the word 'graine' could be used in an topic about trees as well as in a topic about natural remedies. No word is bound to one topic, but some words will only come up in a specific topic. Note that I cut the word list for each topic at 20 words, there are many more that follow.
ReplyDelete