Finding related articles using topic modeling

While still working on the topic inferencer, I started hacking at another possibility which is made possible by topic modeling, that is finding closely related texts within a corpus. There are several ways of doing this. What I chose to do was to consider the top three topics in each article and their respective proportion, and weigh it against the whole corpus. Here's a link to a search form where you can search for similar articles in the Encyclopedie :
In order to use it, you should paste the url of the article you're looking at. You'll then get a list of links to various articles that should be similar in content to the one you selected. A lot of tinkering can be done with the calculation of similarity, therefore I very well might have made some bad jugement here and there. This is therefore work in progress, therefore you might get strange results. But if you go through the whole list of results you might see some interesting things.
I would like to give you two examples I've tried that work really well. The first one is the article Economie by Rousseau ( which gives very good results), and if you look at link 24, which is according to my (flawed) calculation the 24th closest article, you'll see an example of an article that would have been hard to find and link to Rousseau. The second example is Question by Jaucourt. Among the top 20, a lot concern various methods of torture, spread out in different classes of knowledge. Let me know what you think.
Next PostNewer Post Previous PostOlder Post Home


  1. This is very interesting indeed. I ran the article Firmament, and got 7 links: Arc-en-ciel, Pié de vent, Reflet, ETINCELLEMENT, DIFFRACTION, DÉFLEXION, Apparence.

    If I understand you correctly, we could quite reasonably compute the top N topics for each article, and then run all of the articles against one another at the topic level, store these results, and propose to users a if you like this article, here are some others that may be relevant. We could, of course, do the same with vector space calculations. I suspect that LDA is probably better and would certainly allow a much more rapid set of calculations than raw vector space.

    Nice work!!

  2. Thanks. Yes this is exactly what I had in mind. I actually have several more examples of articles that work really well. Then there's the problem of optimizing all the calculations.
    Another idea I had was to generate even more topics (I'm using 300 at the moment), something like 500, and use that only for similarity search. The topics themselves might not make much sense, but we maybe would get better results for similarity. Something to investigate.