Tracing Revolutionary Discourses

Leave a Comment
In our previous blog post in this series, Modeling Revolutionary Discourse, we outlined the integration of various analytic services and entry points to one of the collections -- the French Revolutionary Collection (FRC) -- we are using as part of ARTFL’s NEH funded Intertextual Bridges project.  This provided three distinct ways to approach the richness of the Newberry Library collection, through PhiloLogic4 search and analysis capabilities, through our new TopoLogic instance, and via a ranked relevance retrieval model.  We demonstrated the utility of different models of access and analysis and ways that combining these results could be used to pose different kinds of questions.  For example, using lists of topic words as the basis of rank relevance search can reveal unexpected relationships between documents and discourses.  

The Intertextual Bridges project is based on building ways to visualize and navigate relationships between disparate sets of collections.  For this project, we have started with seven different collections, representing a wide array of documentary materials concerning the French Revolution.  These include the Newberry FRC, the Archives Parlementaires (AP), the Baudouin Collection of Revolutionary Lawsthe Journaux de Marat, as well as 18th century holdings from the ARTFL Frantext Collection, the Goldsmith-Kress Collection, and French holdings of ECCO.  The collections differ from each other in important ways and require specific search and retrieval schemes to allow for proper handling.  The individual speakers of the AP are searchable as part of particular sessions where as the Newberry does not have such data identified.  Simply doing a single build all of the collections into one database instance would reduce the analytic capabilities to the lowest common denominator.  Collection integration properly requires initial builds reflecting the specifics of each dataset, followed by abstraction to a top level interface.  


The first stage of database integration is development of a top level search and retrieval scheme.  For this preliminary work, each of the target collections we built as a separate PhiloLogic4 instance.  We then used the ARTFL Text Preprocessing Library, to extract metadata and word data from the standard representations generated by PhiloLogic4.  This allows us to use PhiloLogic4 services to support navigation back to the text.  The data extraction program allows the treatment of files as either entire documents or as collections of sub-units depending on the available data markup.  The FRC, for example, does not have internal subdivisions and it is treated as one text element per document.  By contrast, the Revolutionary Laws are tagged with divisions reflecting specific laws and other elements.  The Frantext selections and ECCO selections are typically divided into chapters.  Indexing and accessing text elements significantly improves search and retrieval tasks.  


For the purposes of our prototype, we are using the Python Whoosh indexing and search library. We expect to move to a more scalable ranked-relevance search engine for the final product. We have release an instance of our Whoosh-based search tool at:

     https://artflsrv03.uchicago.edu/mark/hub/multipledb.whoosh.html
The search form allows the user to input a list of terms to find and to limit results to the specific collections and/or to time periods.  Results are ordered by a standard relevancy calculation and we have appended a simple count of authors and titles at the bottom of the report. Note that we have turned links to the full text off at this time, since the underlying PhiloLogic4 instances are on an internal research machine which we expect to be updating in the future.  A full implementation will have full links to the documents and other functions, such as TopoLogic, as outlined in our previous blog post.  


For the query "grain subsistance recolte marche farine quantite pain denree prix bled" the search will return many results, displaying the first 100 (by default) instances, showing the relevance score of the document as well optional snippets as shown on the left.  
The snippets may be omitted from the report, which then generates a list of 

corresponding documents.  The search will retrieve and score subsections of documents, such as chapters or sessions in the same way as entire documents.  On the right one finds the continuation of the query for "grain subsistance...".  Limiting the query to the Revolutionary Laws collection will find specific laws on this subject, such as "Décret sur la police du commerce des grains l'approvisionnement des marchés des armées. Du 7 vendémiaire" of Year IV followed by (again in order of relevance to the query words) Décret qui fixe un maximum du prix des grains, farines et fourrages, et prononce des peines contre l'exportation. [11-9-1793].  

Rank relevance retrieval across multiple collections is a useful way to identify documents and passages of interest.  We are also finding that combining this type of query with word vectors representing Revolutionary topics to be a powerful tool to trace aspects of Revolution discourses to often unexpected sources.  We have included two topic models generated from the 26,000 documents Newberry French Revolution collection.  As described earlier, topic models are unsupervised techniques to identify topics in collections of documents.  Topic models identify the topic mix for every document in a collection and well lists of weighted words that are associated with each topic.  The TopoLogic instance of the 50 topic model can be found on
     https://artflsrv03.uchicago.edu/topic-modeling-browser/frc1787_99/



We have included the top ten words in each topic with a link to the ranked relevance search for that topic across all of the collections.  Clicking on Search will will query the words in this list against the Whoosh database.  The parameters are set to display the top 200 documents or sections from the entire collection.  We have also included the same data for a 100 topic model instance (click here).  No single topic model can properly capture the complexity of Revolutionary discourses.  Comparing the lists of 50 and 100 topics, you will find some are complementary, while others emerge only in the 100 topic model.  

While the static searches (clicking on Search with the set parameters) are useful, we recommend that you examine topics in more detail.  You can block copy the words from any of the topics to the search box and set the parameters as you see fit.  We have included on the search form one example.  This is a query for the words of Topic 4 (in the 50 topic model) "constitution pouvoir droit liberte nation peuple autorite homme principe propriete" in documents published before 1789, using the OR operator, and displaying the top 500 instances (click here to run this search).  This will return a list of documents or sections from pre-Revolutionary sources as shown on the right, led off by a translation of David Ramsay's History of the American Revolution and including the state constitution of Massachusetts. Scrolling down to the list of authors, one finds an interesting list of expected and rather unexpected authors including:


  • Du Buat, M. le comte (Louis-Gabriel), : 21
  • Mirabeau, Victor de Riquetti, marquis de, : 20
  • Holbach, Paul Henri Thiry, baron d', : 15
  • De Lolme, Jean Louis, : 14
  • Helvetius, : 13
  • Chamfort, Sébastien Roch Nicholas, : 11
  • Le Trosne, M. (Guillaume François), : 10
  • Mirabeau, Gabriel-Honoré de Riquetti, comte de, : 9
  • Le Mercier de La Rivière, Pierre-Paul, : 8
  • Bodin, Jean, : 8
  • Hume, David, : 7
  • Brissot de Warville, J.-P. (Jacques-Pierre), : 7
  • Franklin, Benjamin, : 6
  • Condorcet, Jean-Antoine-Nicolas de Caritat, Marquis de, : 6
Taking the words from Topic 43: "religion culte pretre eglise dieu fanatisme morale autel clerge divinite" and restricting the results to the 18th century holdings of ARTFL Frantext reveals the strong showing of Holbach (accounting for seven of the top ten most relevant sections) and Helvétius .  The top titles, recalling the sections are counted individually is also suggestive:


  • Lettres juives : 52
  • De l'homme : de ses facultés intellectuelles et de son éducation : 35
  • Essay sur l'hist. génèrale / Voltaire. : 25
  • Le christianisme dévoilé, ou, Examen des principes et des effets de la religion Chrétienne : 17
  • Dictionnaire philosophique : Comprenant les 118 articles parus sous ce titre du vivant de Voltaire, avec leurs suppléments parus dans les Questions sur l'Encyclopédie. : 15
  • Le comte de Valmont : 12
  • Système de la nature, ou, Des loix du monde physique du monde moral : 12
  • Voyage du jeune anacharsis : 11
  • Histoire critique de Jésus-Christ ou analyse raisonnée des Évangiles : 10
  • De la philosophie de la nature : 10
  • Les helviennes : 10
  • Les Incas, ou, La destruction de l'empire du Pérou : 10
  • La contagion sacrée ou Histoire naturelle de la superstition OU Tableau des effets que les opinions religieuses ont produits sur la terre. Tome I : 9
  • Le compère Mathieu : 8
  • Traité sur la tolérance : 8
Moving this time to the 100 topic model, we can look for traces of topic 80 "convention jugement mort royaute inviolabilite souverainete peine tyran crime depute" in pre-1789 texts. In essence, we are asking whether this topic on the tyrannical nature of the sovereignty of the king, so prevalent in revolutionary discourse, has any echoes in earlier texts. It is interesting to see in the results a mix of theoretical works (such as  Bodin's De la république, or Pufendorf's Droit de la nature et des gens), historical accounts (Raynal's Histoire du parlement d'Angleterre, or Boulainvillier's Etat de la France), or literary sources (Voltaire's Cromwell, or Mercier's L'an deux mille quatre cent quarante), thus providing researchers with a broad and diverse overview of discussions of this topic in the pre-revolutionary period. 

In highlighting the possibility of using word vectors that emerge from topic models of Revolutionary discourses, we might be guilty of teleological readings of these earlier texts.  This one approach is simply to demonstrate the the possibility of combining mixtures of algorithms to propose unexpected texts of potentially related interest.  As we move forward, we will be including topic models of the 18th century collections, to allow tracing of earlier topics into the Revolutionary era.  This is another level of navigation that we believe will help guide researchers through large collections, providing access to smaller segments of text are that more tightly focussed on specific issues and topics.  


-- The ARTFL Team



Read More

Modeling Revolutionary Discourse

Leave a Comment
Modeling Revolutionary Discourse

As part of our lead work on ARTFL’s NEH funded Intertextual Bridges project, we are pleased to release a prototype build of the Newberry Library’s French Revolution Collection (FRC), which integrates topic model browsing and search, relevancy searching, and full PhiloLogic4 services, in a set of interrelated functions. This post will describe the current state of this work, document some of the functionalities, and provide an outline of our next steps of development.

In 2017, the Newberry library released digital copies of more than 35,000 pamphlets totalling approximately 850,000 pages of it’s extremely rich holdings related to the French Revolution. Shortly thereafter, ARTFL project released versions of the Newberry FRC under PhiloLogic4 of this unparalleled resource. In a subsequent post, we described the collection, some of the capabilities of this initial installation and preliminary results using the tools deployed in this build.

We have two builds of the FRC under PhiloLogic. The first is simply a load of the entire collection of 38,377 documents as it was downloaded towards the end of 2017. We applied some error correction functions, which we recently modified slightly applied to the installation (search form). The bulk of our work has been aimed at the FRC collection for works from 1787-1799 with the aim to improve the data and metadata as well as remove duplicate documents. The 2017 release of the FRC at ARTFL contained 26,455 documents, where duplicates were identified by metadata comparison. Using data generated our new sequence alignment package TextPair, which identified both similar passages and possibly duplicated documents, we further reduced the collection to 25,935 documents. 

We currently have three entry points to collection. The basic component which underlies the whole system is PhiloLogic, our corpus query engine which houses the words index, the structure and the metadata of the collection:
          https://artflsrv03.uchicago.edu/philologic4/frc1787-99rev2b/
To facilitate the discovery of documents relevant to search queries, we added on a ranked-relevance engine, called Whoosh, which is built on top of the PhiloLogic index:
          https://artflsrv03.uchicago.edu/mark/frc/frc1787-99.whoosh.html
Finally, as an additional way of exploring the topics and discourses that run through the FRC, we built a topic-modeling browser called TopoLogic, which also leverages the PhiloLogic instance:
          https://artflsrv03.uchicago.edu/topic-modeling-browser/frc1787_99/.
While all three systems have specific capabilities and reporting features and function as discrete units, because they share a single data feed model (built from the PhiloLogic index), they are designed to be interoperable, and hence provide links across one another. It is our belief that there is no all-encompassing algorithmic approach to text analysis, and that topic-modeling provides one view that may be worth exploring, but no more so than other methods.

TopoLogic is the latest entry in our quest to build value-added services on top of the standard PhiloLogic index, and leverage topic-modeling techniques to offer an alternate way of exploring text collections. Topic-modeling, the algorithmic technique which we use for this new navigational tool, is an unsupervised machine learning approach designed to facilitate the exploration of large collections of texts where no topical information is provided. As such, this computational method can be a truly useful way of gaining a sense of the topical structure of a corpus -- i.e. to find out what's in there -- and how words are clustered together to form meaningful discourses.

TopoLogic builds upon the topics and semantic fields generated by the algorithm to provide a web-based navigation system which lets users explore topics and discourses across time, as well as word usage within different contexts. The interaction of the three different schemes allows the user to navigate between alternative ways of considering topics across the collection. The following slides are designed to give some idea of how users may navigate between topics, word searches and other capabilities provided by these different systems.



In our experience, there are a number of caveats to consider when using this algorithmic approach to text analysis. First, while topic-modeling is able to uncover relationships between words and documents without a training corpus (thus its unsupervised nature), it does require a certain number of priors, such as the number of topics to uncover, in order to function. In other words, the user of such method needs to determine (through trial and error) what that user deems to be the more meaningful representation of the corpus. Our experience has shown us that slight changes in the underlying texts (such as adding or removing a couple texts), or in the preprocessing steps (such as removing additional function words), can lead to drastically different results. All in all, we have always taken a very measured approach to our interpretation of topic models, and we strongly discourage against relying upon them as the sole source for text analysis.

The systems complement each other by providing checks on the results of particular functions. For example, in slide X above, we present the top 50 documents for topic 19 as measured by topic weight. In using a rank relevancy search for the top 10 tokens for topic 19, we arrive at a rather different list. The differences are due to the interaction of weighting schemes and relevancy measures. Both are useful approaches, but do, by design, deliver somewhat different results.

It is our pleasure to acknowledge that the Newberry Library has released this extraordinary resource under the Open Data Commons Attribution License, ODC-BY 1.0. We believe that this splendid collection and the Newberry’s release of all of the data will facilitate a generation of ground-breaking work in Revolutionary studies. If you find the collection useful, please do contact the Newberry Library to congratulate them on this wonderful initiative and how their efforts contribute to your research. Clovis & Mark
Read More
Next PostNewer Posts Previous PostOlder Posts Home