Topic Models in the Intertextual Hub

ARTFL’s NEH funded Intertextual Bridges project is an effort to facilitate distant and close readings across a large heterogeneous set of collections of 18th century French documents. These range from Revolutionary pamphlets and newspapers to the great works of Enlightenment in the original French as well as translations of many English texts. This post and associated slide show (see below), will provide an overview of the many ways which we attempt to use topic models as a way to search and navigation the collections. In two previous blog posts, Tracing Revolutionary Discourses and Modeling Revolutionary Discourse, we provided an overview of some the development implementations and offered some initial observations arising from our use of topic models in this effort. While the description of the procedures and implementation of both posts are reasonably current, we have made significant progress in the intervening months. Thus, our discussion of Topic Models in this post builds upon our previous posts.

The Intertextual Hub (https://intertextual-hub.uchicago.edu/) makes extensive use of Topic Models to provide search services, analytics and one form of document navigation[1]. This is an extension of the TopoLogic package which functions as an add-on to ARTFL's PhiloLogic4 text analysis system. Topic Models are generated by invoking the ARTFL Text Preprocessing Library (ATPL), to extract metadata and word data from the standard representations generated by PhiloLogic4. This allows us to use PhiloLogic4 services to support navigation back to the text. The ATPL supports the treatment of files as either entire documents or as collections of sub-units depending on the available data markup and has a variety of NLP, normalization, and other parameters that can be adjusted for tasks such as Topic Modeling. For Hub Topic Models, we use modernized unigram nouns longer than 2 letters. These are directed to the TopoLogic generator which supports another layer of vector parameters, typically using NMF vectors with TF-IDF weightings. For the primary topic model in the Hub, we selected to use 150 topics across all of the collections, which seem to give the best balance of reasonably coherent topics and number of obscure or meaningless topics. In addition, we generated two Topic Models of 100 topics each using the same parameters based on documents from 1700-1788 and 1789-1799, which we believe will facilitate exploration of topics from each period.

It is important to note that the tuning of Topic Models is based on selection and application of a large number of parameters, from number of topics to which words to use, which change the nature of the resulting topics significantly. These judgements are based to a certain degree on what we expect to observe.

For example, a topic which contains "citoyen patrie petition commune concitoyen secours moyen defenseur arrete magistrat" (accents removed) as the most heavily weighted terms, quite reasonably, as shown in the graph, is found to be most heavily weighted during the years of the Revolution. This reliance on expected results, even though they may be perfectly reasonable, does point to a significant limitation of the approach. Topic Models are extremely useful heuristics which can help summarize and navigate the contents of large collections, but should be used with due care as they can reflect parameter selection in ways that can skew results in various ways.

The Intertextual Hub, offers several ways to use Topic Models. From the top down, as it were, with the ability to navigate the collections starting with topics as well as the ability to select the top weighted terms from any of the 150 topics restricted by any available bibliographic data (dates, authors, collections, etc.) returning a list of documents (which may be parts of documents or entire texts depending on available encoding) ordered by relevance to the query. Just as important, however, is the ability to identify the most important topics for any document and to find other texts that share the same topic distributions which is another way to measure how similar the documents are.

As shown in the last few slides above, we have included two 100 topic Models derived using the same parameters from documents predating the Revolution and those from 1789-1799.

Pre-Revolutionary 100: https://intertextual-hub.uchicago.edu/topologic/prerev100
Revolutionary 100: https://intertextual-hub.uchicago.edu/topologic/rev100/

These are both full installations of Topologic and not directly linked to the Intertextual Hub. Users may block copy topic words from one Model and apply these to the full set of documents using the Search and Retrieval functions of the Hub. Some topics, such as 77 from the Revolutionary Model (pont, canal, ingenieur, navigation, riviere, chaussee, travail, construction, reparation, devis), are probably not significantly different from the ancien régime considerations. Other topics, however, are more clearly identified as having Revolutionary concerns. Topic 46 of the Revolutionary 100 (election, scrutin, nomination, electeur, suffrage, majorite, liste, membre, votant, pluralite) reflect contemporary concerns. Searching for this list of words in documents from 1700-1787 (run search), returns an interesting list of documents, the first six of which are chapters from La Rochefoucauld's Constitutions des treize États-Unis de l'Amérique (1783)

Running one's eye down the list of documents suggests suggests that the discourse regarding elections found its origins in a number of examples from England, the emerging US states, and some other European states. There is also an interesting mix of well know names, Rousseau and Voltaire, authors who would become better known during the Revolution such as Brissot, and numerous less known writers.

The Intertextual Hub is designed to offer potentially interesting texts to consider. We employ Topic Models to provide granular search across the collections as well as to point to similar documents based on the current context. Finally, we can track topics derived from documents of a later period, to early instances, potentially revealing connections that can offer new evaluations of these texts.

Notes

[1] There is an extensive literature on the use of topic models in digital humanities including JDH 2012.

ARTFL Project Research Blog

Topic Models in the Intertextual Hub

1 comment:

Labels

Popular Posts

Blog Archive

Developed by ARTFL