ARTFL Project Research Blog

Epub to tei lite converter

Clovis Friday, September 25, 2009 Leave a Comment

This is just to let you know that we now have an epub to tei converter. It can be found here:
http://artfl.googlecode.com/files/epub_parser.tar
As you'll notice, there are three files in this archive. The first one is epub_parser.sh. It's the only one you need to edit. Specify the paths (where the epub files are and where you want your tei files to be in) without slashes and just execute epub_parser.sh. The second one is parser.pl which is called by epub_parser.sh. The third one is entities.pl which handles html entities and is also called by epub_parser.sh. Before running it, make sure all three scripts are in the same directory.
A sample philologic load can be found here:
http://artflx.uchicago.edu/philologic/epubtest.whizbang.form.html
Of course, this is just a proof of concept and will only be used only for text search and machine learning purposes. Some things will have to be tuned up. Note that I put a div1 every ten pages since there is no way to recognize chapters in the original epub files.

Text segmentation code and usage

Clovis Friday, September 25, 2009 Leave a Comment

Here's a quick explanation on how to use the text segmentation perl module called Lingua-FR-Segmenter. You can find here: http://artfl.googlecode.com/files/Lingua-FR-Segmenter-0.1.tar.gz It's not available on cpan as it's just a hacked version of Lingua::EN::Segmenter::TextTiling made to work with French. The first thing to do before installing it is to install Lingua::EN::Segmenter::TextTiling which will get you all the required dependencies (cpan -i Lingua::EN::Segmenter::TextTiling). When you install the French segmenter, make test will fail, so don't run it. That's normal since I haven't changed the example which is for the English version of the module. An example of how it can be used :

#!/usr/bin/perl use strict; use warnings; use Lingua::FR::Segmenter::TextTiling qw(segments); use lib '.'; my $text; my $count; while (<>) { $text .= $_; } my $num_segment_breaks = 100000; # safe number so that we don't run out of segment breaks my @segments = segments($num_segment_breaks,$text); foreach (@segments) { $count++; print; print "\n----------SEGMENT_BREAK----------\n" if exists $segments[$count]; }

There are other possibilities, but this is the basic one which will segment the text whenever there's a topic shift. Some massaging is necessary in order to get good results, and the changes needed are different from one text to the next. Basically separate paragraphs with a newline.

Classifying the Echo de la Fabrique

Clovis Friday, September 18, 2009 Leave a Comment

I've been working lately on trying to classify the Echo de la Fabrique, a 19th century newspaper, using LDA. The official website is located at http://echo-fabrique.ens-lsh.fr/. The installation I used is strictly meant for experimentation on topic modeling.
The dataset I used is significantly smaller than the Encyclopédie, which means that the algorithm has fewer articles with which to generate topics. This makes the whole process trickier since choosing the right number of topics suddenly becomes more important. I suspect that adding more articles to this dataset will yield better results. I settled for 55 topics, and found a name corresponding to the general idea conveyed by each distribution of words. I then proceeded to add those topics to each tei file and loaded it into philologic. I chose to include 4 topics per article, or fewer if topics didn't reach the mark of 0.1.
The work I've done so far on LDA has already shown several things about its accuracy in generating meaningful topics and in properly classifying text. It tends to work really well with topics that are concept driven. For instance, in the Echo de la Fabrique , the topic 'justice' works really well. Same thing goes with 'Hygiène' associated with words like 'choléra' or 'eau'. On the other hand, there are some distribution of words which were not identifiable as topics. Those topics have been marked as 'Undetermined' with a number such as 'Undetermined1' to distinguish each undetermined topic. And then, there are also topics like 'Petites annonces' or 'Misère ouvrière ' which are not as concept driven, and therefore are subject to more inaccuracies. Once again, I believe that having more articles from the same source would partially improve this problem : more documents, more training for the topic modeler, reduced dependency on concepts.
Each topic has a number attached to it. This number represents the importance of the topic for each article. To get the most prominent topic, search for e.g. 'justice 1', 'justice 2' for the second topic, 'justice 3' for the third topic, and 'justice 4' for the fourth topic. If you want a search for all four, just type 'justice'. Note that the classification tends to be more accurate with the first topic than with the other three, but that 's not always the case.
Anyway, without further ado, here is the search form:
https://artflsrv03.uchicago.edu/philologic4/echofabrique/ (Update: this is under PhiloLogic4 and has only Topics 1 enabled at this time.)
Please let me know if you have any comments, suggestions. Any feedback is much appreciated.

ARTFL Project Research Blog

Epub to tei lite converter

Text segmentation code and usage

Classifying the Echo de la Fabrique

Labels

Popular Posts

Blog Archive

Developed by ARTFL