Modeling Revolutionary Discourse

Leave a Comment
Modeling Revolutionary Discourse

As part of our lead work on ARTFL’s NEH funded Intertextual Bridges project, we are pleased to release a prototype build of the Newberry Library’s French Revolution Collection (FRC), which integrates topic model browsing and search, relevancy searching, and full PhiloLogic4 services, in a set of interrelated functions. This post will describe the current state of this work, document some of the functionalities, and provide an outline of our next steps of development.

In 2017, the Newberry library released digital copies of more than 35,000 pamphlets totalling approximately 850,000 pages of it’s extremely rich holdings related to the French Revolution. Shortly thereafter, ARTFL project released versions of the Newberry FRC under PhiloLogic4 of this unparalleled resource. In a subsequent post, we described the collection, some of the capabilities of this initial installation and preliminary results using the tools deployed in this build.

We have two builds of the FRC under PhiloLogic. The first is simply a load of the entire collection of 38,377 documents as it was downloaded towards the end of 2017. We applied some error correction functions, which we recently modified slightly applied to the installation (search form). The bulk of our work has been aimed at the FRC collection for works from 1787-1799 with the aim to improve the data and metadata as well as remove duplicate documents. The 2017 release of the FRC at ARTFL contained 26,455 documents, where duplicates were identified by metadata comparison. Using data generated our new sequence alignment package TextPair, which identified both similar passages and possibly duplicated documents, we further reduced the collection to 25,935 documents. 

We currently have three entry points to collection. The basic component which underlies the whole system is PhiloLogic, our corpus query engine which houses the words index, the structure and the metadata of the collection:

To facilitate the discovery of documents relevant to search queries, we added on a ranked-relevance engine, called Whoosh, which is built on top of the PhiloLogic index:

Finally, as an additional way of exploring the topics and discourses that run through the FRC, we built a topic-modeling browser called TopoLogic, which also leverages the PhiloLogic instance:

While all three systems have specific capabilities and reporting features and function as discrete units, because they share a single data feed model (built from the PhiloLogic index), they are designed to be interoperable, and hence provide links across one another. It is our belief that there is no all-encompassing algorithmic approach to text analysis, and that topic-modeling provides one view that may be worth exploring, but no more so than other methods.

TopoLogic is the latest entry in our quest to build value-added services on top of the standard PhiloLogic index, and leverage topic-modeling techniques to offer an alternate way of exploring text collections. Topic-modeling, the algorithmic technique which we use for this new navigational tool, is an unsupervised machine learning approach designed to facilitate the exploration of large collections of texts where no topical information is provided. As such, this computational method can be a truly useful way of gaining a sense of the topical structure of a corpus -- i.e. to find out what's in there -- and how words are clustered together to form meaningful discourses.

TopoLogic builds upon the topics and semantic fields generated by the algorithm to provide a web-based navigation system which lets users explore topics and discourses across time, as well as word usage within different contexts. The interaction of the three different schemes allows the user to navigate between alternative ways of considering topics across the collection. The following slides are designed to give some idea of how users may navigate between topics, word searches and other capabilities provided by these different systems.

In our experience, there are a number of caveats to consider when using this algorithmic approach to text analysis. First, while topic-modeling is able to uncover relationships between words and documents without a training corpus (thus its unsupervised nature), it does require a certain number of priors, such as the number of topics to uncover, in order to function. In other words, the user of such method needs to determine (through trial and error) what that user deems to be the more meaningful representation of the corpus. Our experience has shown us that slight changes in the underlying texts (such as adding or removing a couple texts), or in the preprocessing steps (such as removing additional function words), can lead to drastically different results. All in all, we have always taken a very measured approach to our interpretation of topic models, and we strongly discourage against relying upon them as the sole source for text analysis.

The systems complement each other by providing checks on the results of particular functions. For example, in slide X above, we present the top 50 documents for topic 19 as measured by topic weight. In using a rank relevancy search for the top 10 tokens for topic 19, we arrive at a rather different list. The differences are due to the interaction of weighting schemes and relevancy measures. Both are useful approaches, but do, by design, deliver somewhat different results.

It is our pleasure to acknowledge that the Newberry Library has released this extraordinary resource under the Open Data Commons Attribution License, ODC-BY 1.0. We believe that this splendid collection and the Newberry’s release of all of the data will facilitate a generation of ground-breaking work in Revolutionary studies. If you find the collection useful, please do contact the Newberry Library to congratulate them on this wonderful initiative and how their efforts contribute to your research. Clovis & Mark
Next PostNewer Post Previous PostOlder Post Home


Post a Comment