Topic Models in the Intertextual Hub

1 comment


ARTFL’s NEH funded Intertextual Bridges project is an effort to facilitate distant and close readings across a large heterogeneous set of collections of 18th century French documents. These range from Revolutionary pamphlets and newspapers to the great works of Enlightenment in the original French as well as translations of many English texts. This post and associated slide show (see below), will provide an overview of the many ways which we attempt to use topic models as a way to search and navigation the collections. In two previous blog posts, Tracing Revolutionary Discourses
and Modeling Revolutionary Discourse, we provided an overview of some the development implementations and offered some initial observations arising from our use of topic models in this effort.  While the description of the procedures and implementation of both posts are reasonably current, we have made significant progress in the intervening months.  Thus, our discussion of Topic Models in this post builds upon our previous posts.  

The Intertextual Hub (https://intertextual-hub.uchicago.edu/) makes extensive use of Topic Models to provide search services, analytics and one form of document navigation[1].  This is an extension of the TopoLogic package which functions as an add-on to ARTFL's PhiloLogic4 text analysis system.   Topic Models are generated by invoking the ARTFL Text Preprocessing Library (ATPL), to extract metadata and word data from the standard representations generated by PhiloLogic4. This allows us to use PhiloLogic4 services to support navigation back to the text. The ATPL supports the treatment of files as either entire documents or as collections of sub-units depending on the available data markup and has a variety of NLP, normalization, and other parameters that can be adjusted for tasks such as Topic Modeling.  For Hub Topic Models, we use modernized unigram nouns longer than 2 letters.  These are directed to the TopoLogic generator which supports another layer of vector parameters, typically using NMF vectors with TF-IDF weightings.  For the primary topic model in the Hub, we selected to use 150 topics across all of the collections, which seem to give the best balance of reasonably coherent topics and number of obscure or meaningless topics.  In addition, we generated two Topic Models of 100 topics each using the same parameters based on documents from 1700-1788 and 1789-1799, which we believe will facilitate exploration of topics from each period. 

It is important to note that the tuning of Topic Models is based on selection and application of a large number of parameters, from number of topics to which words to use, which change the nature of the resulting topics significantly.  These judgements are based to a certain degree on what we expect to observe.  
For example, a topic which contains "citoyen patrie petition commune concitoyen secours moyen defenseur arrete magistrat" (accents removed) as the most heavily weighted terms, quite reasonably, as shown in the graph, is found to be most heavily weighted during the years of the Revolution.  This reliance on expected results, even though they may be perfectly reasonable, does point to a significant limitation of the approach.  Topic Models are extremely useful heuristics which can help summarize and navigate the contents of large collections, but should be used with due care as they can reflect parameter selection in ways that can skew results in various ways. 

The Intertextual Hub, offers several ways to use Topic Models.  From the top down, as it were, with the ability to navigate the collections starting with topics as well as the ability to select the top weighted terms from any of the 150 topics restricted by any available bibliographic data (dates, authors, collections, etc.) returning a list of documents (which may be parts of documents or entire texts depending on available encoding) ordered by relevance to the query.  Just as important, however, is the ability to identify the most important topics for any document and to find other texts that share the same topic distributions which is another way to measure how similar the documents are.  



As shown in the last few slides above, we have included two 100 topic Models derived using the same parameters from documents predating the Revolution and those from 1789-1799.  
These are both full installations of Topologic and not directly linked to the Intertextual Hub.   Users may block copy topic words from one Model and apply these to the full set of documents using the Search and Retrieval functions of the Hub. Some topics, such as 77 from the Revolutionary Model  (pont, canal, ingenieur, navigation, riviere, chaussee, travail, construction, reparation, devis), are probably not significantly different from the ancien régime considerations.  Other topics, however, are more clearly identified as having Revolutionary concerns.  Topic 46 of the Revolutionary 100 (election, scrutin, nomination, electeur, suffrage, majorite, liste, membre, votant, pluralite) reflect contemporary concerns.  Searching for this list of words in documents from 1700-1787 (run search), returns an interesting list of documents, the first six of which are chapters from La Rochefoucauld's Constitutions des treize États-Unis de l'Amérique (1783)


Running one's eye down the list of documents suggests suggests that the discourse regarding elections found its origins in a number of examples from England, the emerging US states, and some other European states.   There is also an interesting mix of well know names, Rousseau and Voltaire, authors who would become better known during the Revolution such as Brissot, and numerous less known writers.  

The Intertextual Hub is designed to offer potentially interesting texts to consider.  We employ Topic Models to provide granular search across the collections as well as to point to similar documents based on the current context.  Finally, we can track topics derived from documents of a later period, to early instances, potentially revealing connections that can offer new evaluations of these texts.  



Notes

[1] There is an extensive literature on the use of topic models in digital humanities including JDH 2012.  



Read More

Reading the Bibliothèque de l'homme public in the Hub

Leave a Comment

The Intertextual Hub (https://intertextual-hub.org/) is an NEH funded project to develop a reading environment that aims to situate specific documents in their broader context of intertextual relations, whether in the form of direct or indirect borrowings, shared topics with other texts or parts of texts, or other kinds of lexical similarity. Relationships discovered by text mining algorithms among texts in large, heterogeneous collections can fruitfully inform and guide traditional close-reading approaches.  


The document collections in the Intertextual Hub can approached in several ways. Viewed from the top or most abstract level, one may search the entire set of collections for specific topics or themes (see related discussion) What follows here is, is an examination of a specific document or a set of documents from, as it were, the bottom up. Using the Bibliothèque de l’homme public (BHP) as a point of departure we are interested in aspects of reading the document which include:
  • similar passage identification, such as reuses, citations, paraphrasing,
  • identification of similar chapters, parts and selections, and,
  • thematic and semantic relationships between documents. 
All of these relationships are established from wider patterns identified by techniques generally known as distant reading. The slides shown below present a step by step itinerary of how one can navigate in the Hub starting from a single document.

The BHP was published between February 1790 and April 1792 by Condorcet and several others, spanning some 28 tomes.  The full title gives an indication of the nature of the project: Bibliothèque de l'homme public et Analyse raisonnée des principaux ouvrages français et étrangers sur la politique en général, la législation, les finances, la police, l'agriculture et le commerce en particulier, et sur le droit naturel et public.  (BNF Link
It was one of numerous efforts by Condorcet to contribute to public instruction and he published a number of pieces, most notably his Cinq Mémoires sur l'instruction publique (1791) and the discussion of Smith referenced below.  As Tourneux notes, however that his role was not clearly defined: 
 
Barbier l'attribue à l'abbé Balestrier de Canilhac, dont le nom ne figure ni sur les titres, ni dans les avant-propos. Celui de Peyssonnel disparait au tome VI et Condorcet est seul nommé à partir du tome XI. Ce recueil, qui avait pour but de mettre autant que possible la science du gouvernement et de l'administration à la portée de tout le monde.... (Tourneux, Vol 2 p. 648).

While the BHP was aimed the education and raising awareness of newly minted French citizens by publishing the "analysis of well-known works, both ancient and modern.” (Faccarello-Steiner 2002, p. 82), it was not always well received as noted in the Journal des révolutions, 1790, VII, p. 9-10 link):

Bibliothèque de l'homme public, par MM. de Condorcet, Chapelier et Peyssonnel ; le premier n'y travaillera point, le second n'y travaillera guère ; le dernier est vieux et cacochyme, il est froid et lent, deux qualités que n'avaient point Bayle, le Clerc et l'abbé Prévost.

It featured extended discussions and extracts of numerous French, English as well as classical authors, including major figures such as Aristotle, Machiavel, Bodin, Hobbes, Locke, Smith, Montesquieu, and Hume, as well a contemporary figures such as Mirabeau and Raynal and lesser known authors such as Guicciardini.  While generally expository, not all of the discussions were intended to be positive:

La vivacité naturelle à l'esprit françois, l'économie du tems , l'ennui qu'entraîne un long ouvrage sur des matières, aussi sérieuses, le caractère national, tout concourt à nous faire adopter la méthode Analytique. [...]  On fera connoître aussi tous les ouvrages relatifs à ce plan, à mesure qu'ils paroîtront: on se permettra même des réflexions critiques, sans toutefois blesser l'amour-propre des auteurs: la malignité aigrit, & n'éclaire pas mieux qu'elle ne corrige.  (Bib homme public, 1790, vol 1 pp. vi & viii)
Smith's Wealth of Nations, for example, is extensively covered, taking up some 220 pages of the BHP. Diatkine (1993) argues that the summary is "very inaccurate", going on to suggest 
[T]he summary published by Bibliotheque de I'Homme Public is the Wealth of Nations minus the 'Invisible Hand'. This shortcoming is too systematic to be attributed to a casualness of approach or to technical difficulties. We are in the presence paradox: here is a book which seems to be very important, yet completely misunderstood. (pp 219-220)
The (BHP) is a highly intertextual collection with a significant number of direct and indirect references to a large number of major authors as well as relatively minor texts. It reflects a distillation and selection of late Enlightenment views on the nature of government and society.  Reading the BHP in the context of the Intertextual Hub allows one to navigate this collection with an eye to the intellectual inheritance and as well as later influences of the authors and texts had during the Revolution.






There are, of course, a great number of texts in the collects deployed in the Intertextual Hub that have many borrowed, reused, or paraphrased passages that can be identified.  For example, the two volume  Les délassemens d'un homme d'esprit, ou nouveau recueil de pensées amusantes, extraites des meilleurs auteurs (1780) is made up of numerous extracts (link to search) organized by theme or subject, such as chapters on SPECTACLES and JALOUSIE.  

This post will be followed by others which we hope will outline the various search and navigation facilities of the Intertextual Hub with a focus on step itineraries from specific starting points.  

Please do post comments below or email us at artfl@artfl.uchicago.edu.  

References

Diatkine D. (1993), "A French Reading of the Wealth of Nations in 1790". In: Mizuta H., Sugiyama C. (eds) Adam Smith: International Perspectives. Palgrave Macmillan, London.  (DOI)

Faccarello, Gilbert and Steiner, Philippe. 2002. The diffusion of the work of Adam Smith in French Language. In Tribe, Keith (ed.), A Critical Bibliography of Adam Smith, London, Pickering and Chatto, pp. 61-119 (link)

Tourneux, M., Bibliographie de l'histoire de Paris pendant la Révolution française, Paris 1890-1913 (BNF)






Read More
Next PostNewer Posts Previous PostOlder Posts Home