Topic Based Text Segmentation Goodies

Leave a Comment
As you may recall, Clovis ran some experiments this summer (2009) applying a perl implementation of Marti Heart's TextTiling algorithm to perform topic based text segmentation on different French documents (see his blog post and related files). Clovis reasonably suggests that some types of literary documents, such as epistolary novels, may be more suitable candidates than other types, because they do not have the same degree of structural cohesion. Now, as I mentioned in my first discussion of the Archives Parlementaires, I suspect that this collection may be particularly well to topic based segmentation. At the end of his post, Clovis also suggests that we might be able to test how well a particular segmentation approach is working by using a clustering algorithm, such as LDA Topic Modeling, to see if the segments can be shown to be reasonably cohesive. Both topic segmentation and modeling are difficult to assess because human readers/evaluators can have rather different opinions, leading to problems in "inter-rater reliability", which is probably a more vexing problem in the humanities and related areas of textual studies than in other domains.

Earlier this year (and a bit last year), I also ran some experiments on some 18th century English materials, such as Hume's History of England and the Federalist Papers. Encouraged by these results, particularly on the Federalist Papers, I have accumulated a number of newer algorithms, packages, and papers which may be useful for future work in this area. These are on my machine (for ARTFL folks, let me know if you want to know where), but I will not redistribute here as a couple of packages require non-redistribution or other limitations. I am putting links to some of the source files, when I have them.

Since Heart's original work, there have been a number of different approaches to topic based text segmentation. Clovis and I have tried to make note of much of this work on our CiteULike references (segmentation). There is some overlap with Shlomo's list. In no particular order of preference or chronology, here is what I have so far. I will also try to provide some details on using these when I have a chance to run them up.

From the Columbia NLP group (, we have both Min-Yan Kan's Segmenter and Michael Galley's LCSeg. These required signing a use agreement, which I have in my office. The release archives for both have papers, some test data,

I spent some time trying to track down Freddy Choi's C99 algorithm and implementation described in some papers in the early part of this decade. I finally tracked it all down on the WayBack Machine at Internet Archive (link, thank you!!), which also has some papers, software, data and implementations of TextTiling and other approaches from that period. It appears several of the packages below use C99 and some of the code from this.

I was going to reference Utiyama and Isihara's implementation (TextSeg), but in the few months since I assembled this list, the link has (also) gone dead:
This appears to be a combination of approaches.

Igor Malioutov's MinCut code (2006) is available from his page:

There appears to be some info on TextTiling in Simon Cozens (2006), "Advanced Perl Programming".

We also want to check out Beeferman et. al. (link) since I recall that this group had done some interesting work. I have Beeferman's implementation of TextTiling in C, but don't think I have run across anything else.

If you run across anything useful, please blog it here or let me know. Papers should be noted on our CiteUlike. Thanks!!
Next PostNewer Post Previous PostOlder Post Home


Post a Comment