Do LDA generated topics match human identified topics?

Clovis Wednesday, November 18, 2009 1 comment

I've been experimenting lately on how LDA generated topics and the Encyclopédie classes of knowledge match. The experiment was conducted in the following way:

- I chose 100 classes of knowledge in the Encyclopédie, and picked 50 articles of each.

- I then ran a first LDA topic trainer choosing 100 topics.

- I then proceeded to identify each generated topic and name after the Encyclopédie classes of knowledge.

- My plan was then to look at the topic proportions per article and see if the top topic would correspond to its class of knowledge. Would the computer manage to classify the articles in the same way the encyclopedists had?

I was not able to get that far when choosing 100 topics for my first LDA run. This is because LDA will always generate a couple topics which aren't really topics, but are just lists of very common words and they just happen to be used in the same documents. Therefore, one should always disregard these topics and focus on the others. What this means is that I had to add a couple more topics to my LDA run in order to get 100 identifiable topics. So I settled with 103 topics. I found 3 distributions of words which were unidentifiable, so I dismissed them.

The results show that LDA topics and the Encyclopédie classes of knowledge do not match (see links to results below). Some do very well, like Artillerie, for which the corresponding distribution of words is :

canon piece poudre artillerie boulet fusil ligne calibre mortier bombe feu charge culasse livre met chambre pouce lumiere roue affut diametre coup batterie levier bouche ame flasque balle tourillon tire

Other distribution of words make sense in themselves but do not match any of the original classes of knowledge. For instance, there is no topic on 'teinture', 'peinture'. What we get instead is a mixture of both classes of knowledge which could be identified as colors :

couleur rouge blanc bleu tableau jaune verd peinture ombre teinture noir toile tableaux nuance papier etoffe bien teint peintre pinceau trait teinturier melange veut figure teindre feuille beau sert colle

Now the topic modeler is not wrong here. It's telling us that these words tend to occur together, which is true. Another significant example is the one with 'Boutonnier', 'Soie', and 'Rubanier' :

soie fil rouet corde brin tour main bouton gauche longueur boutonnier droite attache bout fils tourner sert molette noeud cordon doigt piece emerillon moule broche ouvrage ruban rochet branche aiguille

What we get here is a topic about the art of making clothes, which is more general than 'Boutonnier' or 'Rubanier'.

For this to actually work, the philosophes would have had to have been extremely rigorous in their choice of vocabulary, because this is what LDA expects. Also, another problem is that LDA considers that each document is a mixture of topics, and not made out of one topic. So if one document is exclusively focused on one topic, LDA will still try to extract a certain number of topics out of it. If this is the case, then you are going to get some topics which are mere subdivisions of the class of knowledge in this document. The reason why our experiment broke down could be that the LDA topic trainer created new subdivisions for some classes of knowledge, or regrouped several classes of knowledge. These are all valid as topics, but do not correspond to human identified topics.

Link to results

1 comment:

MarkNovember 19, 2009 at 2:18 PM
Thank you, Clovis. Very interesting, indeed. Would it make sense to run the topic model on the articles from which you derived the topics and see how many for each class of knowledge get grouped together? A poor man's cross validation. We might then have a way to measure this.

ARTFL Project Research Blog

Do LDA generated topics match human identified topics?

1 comment:

Labels

Popular Posts

Blog Archive

Developed by ARTFL