back to comparing similar documents

Leave a Comment
I mentioned a little while ago some work I did on comparing one document with the rest of the corpus it belongs to ( the examples I used in that blog post will not give the same results anymore, the results might not be as good, I haven't optimized the new code for the Encyclopédie yet). The idea behind it was to use the topic proportions for each article generated from LDA, and come up with a set of calculations to decide which document(s) was closest to the original document. The reason why I'm mentioning here once more is that I've been through that code again, cleaned it up quite a bit, improved its performance, tweaked the calculations. Basically, I made it usable for other people but myself. Last time I built a basic search form to use with Encyclopédie articles. This time I'm going to show the command line version, which has a couple more options than the web version.
In the web version, I was using both the top three topics in each document, and their individual proportion within that document. For instance, Document A would have topic 1, 2 and 3 as its main topics. Topic1 would have a proportion of 0.36, Topic2 0.12, Topic3 0.09. In the command line version, there's the option of only using the topics, without the proportion. The order of importance of each topic is of course still respected. Depending on the corpus you're looking at, you might want to use one model rather than the other. It does give different results. One could of course tweak this some more and decide to only take the proportion of the prominent topic, therefore giving it more importance. There is definitely room for improvement.
There was also another option that was left out of the web version. By default, I set a tolerance level, that is the score needed by each document in order to be given as a result of the query. In the command line version, I made it possible to define this tolerance in order to get more or fewer results. This option is currently only possible with the refined model (the one with topic proportions). The code is currently living in
It's called There's some documentation in the header to explain how to use it. It's fairly simple. I might do some more work on it, and will update the script accordingly.
There are other applications of this script besides using on a corpus made of well defined documents. One could very well imagine applying this to a corpus subdivided in chunks of text using a text segmentation algorithm. On could then try to find passages on the same topic(s) using a combination of LDA and this script. The Archives parlementaires could be a good test case.
Another option would be to run every document of a corpus against the whole corpus and store all the results in a SQL database. This would allow having a corpus where each document can be linked to various others according to the mixture of topics they are made of.
I will try to give more concrete results some time soon.
Next PostNewer Post Previous PostOlder Post Home


Post a Comment