This is just to let you know that we now have an epub to tei converter. It can be found here:http://artfl.googlecode.com/files/epub_parser.tarAs you'll notice, there are three files in this archive. The first one is epub_parser.sh. It's the only one you need to edit. Specify the paths (where the epub files are and where you want your tei files to be in) without slashes and just execute epub_parser.sh. The second one is parser.pl which is called by epub_parser.sh. The third one is entities.pl which handles html entities and...
Text segmentation code and usage
Here's a quick explanation on how to use the text segmentation perl module called Lingua-FR-Segmenter. You can find here: http://artfl.googlecode.com/files/Lingua-FR-Segmenter-0.1.tar.gz It's not available on cpan as it's just a hacked version of Lingua::EN::Segmenter::TextTiling made to work with French. The first thing to do before installing it is to install Lingua::EN::Segmenter::TextTiling which will get you all the required dependencies (cpan -i Lingua::EN::Segmenter::TextTiling). When you install the French segmenter,...
Classifying the Echo de la Fabrique
I've been working lately on trying to classify the Echo de la Fabrique, a 19th century newspaper, using LDA. The official website is located at http://echo-fabrique.ens-lsh.fr/. The installation I used is strictly meant for experimentation on topic modeling.
The dataset I used is significantly smaller than the Encyclopédie, which means that the algorithm has fewer articles with which to generate topics. This makes the whole process trickier since choosing the right number of topics suddenly becomes more important. I suspect...