Classifying the Echo de la Fabrique

Leave a Comment
I've been working lately on trying to classify the Echo de la Fabrique, a 19th century newspaper, using LDA. The official website is located at http://echo-fabrique.ens-lsh.fr/. The installation I used is strictly meant for experimentation on topic modeling.
The dataset I used is significantly smaller than the Encyclopédie, which means that the algorithm has fewer articles with which to generate topics. This makes the whole process trickier since choosing the right number of topics suddenly becomes more important. I suspect that adding more articles to this dataset will yield better results. I settled for 55 topics, and found a name corresponding to the general idea conveyed by each distribution of words. I then proceeded to add those topics to each tei file and loaded it into philologic. I chose to include 4 topics per article, or fewer if topics didn't reach the mark of 0.1.
The work I've done so far on LDA has already shown several things about its accuracy in generating meaningful topics and in properly classifying text. It tends to work really well with topics that are concept driven. For instance, in the Echo de la Fabrique , the topic 'justice' works really well. Same thing goes with 'Hygiène' associated with words like 'choléra' or 'eau'. On the other hand, there are some distribution of words which were not identifiable as topics. Those topics have been marked as 'Undetermined' with a number such as 'Undetermined1' to distinguish each undetermined topic. And then, there are also topics like 'Petites annonces' or 'Misère ouvrière ' which are not as concept driven, and therefore are subject to more inaccuracies. Once again, I believe that having more articles from the same source would partially improve this problem : more documents, more training for the topic modeler, reduced dependency on concepts.
Each topic has a number attached to it. This number represents the importance of the topic for each article. To get the most prominent topic, search for e.g. 'justice 1', 'justice 2' for the second topic, 'justice 3' for the third topic, and 'justice 4' for the fourth topic. If you want a search for all four, just type 'justice'. Note that the classification tends to be more accurate with the first topic than with the other three, but that 's not always the case.
Anyway, without further ado, here is the search form:
http://artfl-project.uchicago.edu/node/95
Please let me know if you have any comments, suggestions. Any feedback is much appreciated.
Next PostNewer Post Previous PostOlder Post Home

0 comments:

Post a Comment