Text segmentation code and usage

Leave a Comment

Here's a quick explanation on how to use the text segmentation perl module called Lingua-FR-Segmenter. You can find here: http://artfl.googlecode.com/files/Lingua-FR-Segmenter-0.1.tar.gz It's not available on cpan as it's just a hacked version of Lingua::EN::Segmenter::TextTiling made to work with French. The first thing to do before installing it is to install Lingua::EN::Segmenter::TextTiling which will get you all the required dependencies (cpan -i Lingua::EN::Segmenter::TextTiling). When you install the French segmenter, make test will fail, so don't run it. That's normal since I haven't changed the example which is for the English version of the module. An example of how it can be used :

#!/usr/bin/perl
use strict;
use warnings;
use Lingua::FR::Segmenter::TextTiling qw(segments);
use lib '.';

my $text;
my $count;
while (<>) {
$text .= $_;
}
my $num_segment_breaks = 100000; # safe number so that we don't run out of segment breaks
my @segments = segments($num_segment_breaks,$text);
foreach (@segments) {
$count++;
print;
print "\n----------SEGMENT_BREAK----------\n" if exists $segments[$count];
}

There are other possibilities, but this is the basic one which will segment the text whenever there's a topic shift. Some massaging is necessary in order to get good results, and the changes needed are different from one text to the next. Basically separate paragraphs with a newline.

Next PostNewer Post Previous PostOlder Post Home

0 comments:

Post a Comment