Mapping Encyclopédie classes of knowledge to LDA generated topics

Leave a Comment
As was described in my previous blog entry, I've been working on comparing the results given by LDA generated topics with the classes of knowledge identified by the philosophes in the Encyclopédie. My initial experiment was to try to see if out of 5000 articles belonging to 100 classes of knowledge, with 50 articles per class, I would find those 100 topics using an LDA topic modeler. My conclusion was that it didn't find all of them, but still found quite a few. Since then, I have played a bit more with this dataset and have come up with better results.
Since a topic modeler will give you the topic proportion per article (I just use the top three), what I tried to do this time was to draw up a table with each class of knowledge, and what the topic modeler identified in terms of topics for each class of knowledge. Before looking at this, it's important to keep in mind that in the sample of articles I used, there are 50 articles per class of knowledge. Therefore, the closer the number of the dominant topic in a class of knowledge gets to 50, the better the topic modeler will have done in identifying the class of knowledge and in reproducing the human classification.
Of course, the classification of articles in the Encyclopédie can be at times a little puzzling. The articles were written by a large number of people and therefore the classification is not always consistent. With that in mind, one should not expect to get perfect matches using a topic modeler. Moreover, since the topic modeler will assume that each article is about N number of topics, the calculation might be further off.
For my experiment, I settled on 107 topics, of which I eliminated 7, which were identified as stopwords lists. When looking at the results of this experiment, there are 41 classes of knowledge in which we find 40 or more articles grouped within the same LDA topic. This means that 41% of the classes of knowledge were identified with a great level of accuracy. If we look at topics that have more than 25 articles matching the same class of knowledge we get up to 83 classes (or 83%).
If we look at those results, there are strange flaws, such as physique and divination that don't seem to be identified. This might be due to a miscalculation, but I have yet to figure out what it could be. Highly specialized classes, such as corroyerie, poésie, or astronomie get excellent matches, which is to be expected.
This experiment also gave us an idea of what the percentage of LDA topics are to be considered as stopwords lists. Between 5 and 10% of the topics should be discarded when using an LDA classifier.
Finally, we should consider that LDA generated topics do not systematically match human identified topics. An unsupervised model is bound to give different results, it would be interesting to see how well supervised LDA (sLDA) would do in our particular test case.

Next PostNewer Post Previous PostOlder Post Home

0 comments:

Post a Comment