In our previous blog post in this series, Modeling Revolutionary Discourse, we outlined the integration of various analytic services and entry points to one of the collections -- the French Revolutionary Collection (FRC) -- we are using as part of ARTFL’s NEH funded Intertextual Bridges project. This provided three distinct ways to approach the richness of the Newberry Library collection, through PhiloLogic4 search and analysis capabilities, through our new TopoLogic instance, and via a ranked relevance retrieval model. We demonstrated the utility of different models of access and analysis and ways that combining these results could be used to pose different kinds of questions. For example, using lists of topic words as the basis of rank relevance search can reveal unexpected relationships between documents and discourses.
The Intertextual Bridges project is based on building ways to visualize and navigate relationships between disparate sets of collections. For this project, we have started with seven different collections, representing a wide array of documentary materials concerning the French Revolution. These include the Newberry FRC, the Archives Parlementaires (AP), the Baudouin Collection of Revolutionary Laws, the Journaux de Marat, as well as 18th century holdings from the ARTFL Frantext Collection, the Goldsmith-Kress Collection, and French holdings of ECCO. The collections differ from each other in important ways and require specific search and retrieval schemes to allow for proper handling. The individual speakers of the AP are searchable as part of particular sessions where as the Newberry does not have such data identified. Simply doing a single build all of the collections into one database instance would reduce the analytic capabilities to the lowest common denominator. Collection integration properly requires initial builds reflecting the specifics of each dataset, followed by abstraction to a top level interface.
The first stage of database integration is development of a top level search and retrieval scheme. For this preliminary work, each of the target collections we built as a separate PhiloLogic4 instance. We then used the ARTFL Text Preprocessing Library, to extract metadata and word data from the standard representations generated by PhiloLogic4. This allows us to use PhiloLogic4 services to support navigation back to the text. The data extraction program allows the treatment of files as either entire documents or as collections of sub-units depending on the available data markup. The FRC, for example, does not have internal subdivisions and it is treated as one text element per document. By contrast, the Revolutionary Laws are tagged with divisions reflecting specific laws and other elements. The Frantext selections and ECCO selections are typically divided into chapters. Indexing and accessing text elements significantly improves search and retrieval tasks.
For the purposes of our prototype, we are using the Python Whoosh indexing and search library. We expect to move to a more scalable ranked-relevance search engine for the final product. We have release an instance of our Whoosh-based search tool at:
https://artflsrv03.uchicago.edu/mark/hub/multipledb.whoosh.html
The search form allows the user to input a list of terms to find and to limit results to the specific collections and/or to time periods. Results are ordered by a standard relevancy calculation and we have appended a simple count of authors and titles at the bottom of the report. Note that we have turned links to the full text off at this time, since the underlying PhiloLogic4 instances are on an internal research machine which we expect to be updating in the future. A full implementation will have full links to the documents and other functions, such as TopoLogic, as outlined in our previous blog post.
For the query "grain subsistance recolte marche farine quantite pain denree prix bled" the search will return many results, displaying the first 100 (by default) instances, showing the relevance score of the document as well optional snippets as shown on the left.
The snippets may be omitted from the report, which then generates a list of
corresponding documents. The search will retrieve and score subsections of documents, such as chapters or sessions in the same way as entire documents. On the right one finds the continuation of the query for "grain subsistance...". Limiting the query to the Revolutionary Laws collection will find specific laws on this subject, such as "Décret sur la police du commerce des grains l'approvisionnement des marchés des armées. Du 7 vendémiaire" of Year IV followed by (again in order of relevance to the query words) Décret qui fixe un maximum du prix des grains, farines et fourrages, et prononce des peines contre l'exportation. [11-9-1793].
Rank relevance retrieval across multiple collections is a useful way to identify documents and passages of interest. We are also finding that combining this type of query with word vectors representing Revolutionary topics to be a powerful tool to trace aspects of Revolution discourses to often unexpected sources. We have included two topic models generated from the 26,000 documents Newberry French Revolution collection. As described earlier, topic models are unsupervised techniques to identify topics in collections of documents. Topic models identify the topic mix for every document in a collection and well lists of weighted words that are associated with each topic. The TopoLogic instance of the 50 topic model can be found on
https://artflsrv03.uchicago.edu/topic-modeling-browser/frc1787_99/
We have included the top ten words in each topic with a link to the ranked relevance search for that topic across all of the collections. Clicking on Search will will query the words in this list against the Whoosh database. The parameters are set to display the top 200 documents or sections from the entire collection. We have also included the same data for a 100 topic model instance (click here). No single topic model can properly capture the complexity of Revolutionary discourses. Comparing the lists of 50 and 100 topics, you will find some are complementary, while others emerge only in the 100 topic model.
While the static searches (clicking on Search with the set parameters) are useful, we recommend that you examine topics in more detail. You can block copy the words from any of the topics to the search box and set the parameters as you see fit. We have included on the search form one example. This is a query for the words of Topic 4 (in the 50 topic model) "constitution pouvoir droit liberte nation peuple autorite homme principe propriete" in documents published before 1789, using the OR operator, and displaying the top 500 instances (click here to run this search). This will return a list of documents or sections from pre-Revolutionary sources as shown on the right, led off by a translation of David Ramsay's History of the American Revolution and including the state constitution of Massachusetts. Scrolling down to the list of authors, one finds an interesting list of expected and rather unexpected authors including:
Read More
The Intertextual Bridges project is based on building ways to visualize and navigate relationships between disparate sets of collections. For this project, we have started with seven different collections, representing a wide array of documentary materials concerning the French Revolution. These include the Newberry FRC, the Archives Parlementaires (AP), the Baudouin Collection of Revolutionary Laws, the Journaux de Marat, as well as 18th century holdings from the ARTFL Frantext Collection, the Goldsmith-Kress Collection, and French holdings of ECCO. The collections differ from each other in important ways and require specific search and retrieval schemes to allow for proper handling. The individual speakers of the AP are searchable as part of particular sessions where as the Newberry does not have such data identified. Simply doing a single build all of the collections into one database instance would reduce the analytic capabilities to the lowest common denominator. Collection integration properly requires initial builds reflecting the specifics of each dataset, followed by abstraction to a top level interface.
The first stage of database integration is development of a top level search and retrieval scheme. For this preliminary work, each of the target collections we built as a separate PhiloLogic4 instance. We then used the ARTFL Text Preprocessing Library, to extract metadata and word data from the standard representations generated by PhiloLogic4. This allows us to use PhiloLogic4 services to support navigation back to the text. The data extraction program allows the treatment of files as either entire documents or as collections of sub-units depending on the available data markup. The FRC, for example, does not have internal subdivisions and it is treated as one text element per document. By contrast, the Revolutionary Laws are tagged with divisions reflecting specific laws and other elements. The Frantext selections and ECCO selections are typically divided into chapters. Indexing and accessing text elements significantly improves search and retrieval tasks.
For the purposes of our prototype, we are using the Python Whoosh indexing and search library. We expect to move to a more scalable ranked-relevance search engine for the final product. We have release an instance of our Whoosh-based search tool at:
https://artflsrv03.uchicago.edu/mark/hub/multipledb.whoosh.html
The search form allows the user to input a list of terms to find and to limit results to the specific collections and/or to time periods. Results are ordered by a standard relevancy calculation and we have appended a simple count of authors and titles at the bottom of the report. Note that we have turned links to the full text off at this time, since the underlying PhiloLogic4 instances are on an internal research machine which we expect to be updating in the future. A full implementation will have full links to the documents and other functions, such as TopoLogic, as outlined in our previous blog post.
For the query "grain subsistance recolte marche farine quantite pain denree prix bled" the search will return many results, displaying the first 100 (by default) instances, showing the relevance score of the document as well optional snippets as shown on the left.
The snippets may be omitted from the report, which then generates a list of
corresponding documents. The search will retrieve and score subsections of documents, such as chapters or sessions in the same way as entire documents. On the right one finds the continuation of the query for "grain subsistance...". Limiting the query to the Revolutionary Laws collection will find specific laws on this subject, such as "Décret sur la police du commerce des grains l'approvisionnement des marchés des armées. Du 7 vendémiaire" of Year IV followed by (again in order of relevance to the query words) Décret qui fixe un maximum du prix des grains, farines et fourrages, et prononce des peines contre l'exportation. [11-9-1793].
Rank relevance retrieval across multiple collections is a useful way to identify documents and passages of interest. We are also finding that combining this type of query with word vectors representing Revolutionary topics to be a powerful tool to trace aspects of Revolution discourses to often unexpected sources. We have included two topic models generated from the 26,000 documents Newberry French Revolution collection. As described earlier, topic models are unsupervised techniques to identify topics in collections of documents. Topic models identify the topic mix for every document in a collection and well lists of weighted words that are associated with each topic. The TopoLogic instance of the 50 topic model can be found on
https://artflsrv03.uchicago.edu/topic-modeling-browser/frc1787_99/
We have included the top ten words in each topic with a link to the ranked relevance search for that topic across all of the collections. Clicking on Search will will query the words in this list against the Whoosh database. The parameters are set to display the top 200 documents or sections from the entire collection. We have also included the same data for a 100 topic model instance (click here). No single topic model can properly capture the complexity of Revolutionary discourses. Comparing the lists of 50 and 100 topics, you will find some are complementary, while others emerge only in the 100 topic model.
While the static searches (clicking on Search with the set parameters) are useful, we recommend that you examine topics in more detail. You can block copy the words from any of the topics to the search box and set the parameters as you see fit. We have included on the search form one example. This is a query for the words of Topic 4 (in the 50 topic model) "constitution pouvoir droit liberte nation peuple autorite homme principe propriete" in documents published before 1789, using the OR operator, and displaying the top 500 instances (click here to run this search). This will return a list of documents or sections from pre-Revolutionary sources as shown on the right, led off by a translation of David Ramsay's History of the American Revolution and including the state constitution of Massachusetts. Scrolling down to the list of authors, one finds an interesting list of expected and rather unexpected authors including:
- Du Buat, M. le comte (Louis-Gabriel), : 21
- Mirabeau, Victor de Riquetti, marquis de, : 20
- Holbach, Paul Henri Thiry, baron d', : 15
- De Lolme, Jean Louis, : 14
- Helvetius, : 13
- Chamfort, Sébastien Roch Nicholas, : 11
- Le Trosne, M. (Guillaume François), : 10
- Mirabeau, Gabriel-Honoré de Riquetti, comte de, : 9
- Le Mercier de La Rivière, Pierre-Paul, : 8
- Bodin, Jean, : 8
- Hume, David, : 7
- Brissot de Warville, J.-P. (Jacques-Pierre), : 7
- Franklin, Benjamin, : 6
- Condorcet, Jean-Antoine-Nicolas de Caritat, Marquis de, : 6
- Lettres juives : 52
- De l'homme : de ses facultés intellectuelles et de son éducation : 35
- Essay sur l'hist. génèrale / Voltaire. : 25
- Le christianisme dévoilé, ou, Examen des principes et des effets de la religion Chrétienne : 17
- Dictionnaire philosophique : Comprenant les 118 articles parus sous ce titre du vivant de Voltaire, avec leurs suppléments parus dans les Questions sur l'Encyclopédie. : 15
- Le comte de Valmont : 12
- Système de la nature, ou, Des loix du monde physique du monde moral : 12
- Voyage du jeune anacharsis : 11
- Histoire critique de Jésus-Christ ou analyse raisonnée des Évangiles : 10
- De la philosophie de la nature : 10
- Les helviennes : 10
- Les Incas, ou, La destruction de l'empire du Pérou : 10
- La contagion sacrée ou Histoire naturelle de la superstition OU Tableau des effets que les opinions religieuses ont produits sur la terre. Tome I : 9
- Le compère Mathieu : 8
- Traité sur la tolérance : 8
Moving this time to the 100 topic model, we can look for traces of topic 80 "convention jugement mort royaute inviolabilite souverainete peine tyran crime depute" in pre-1789 texts. In essence, we are asking whether this topic on the tyrannical nature of the sovereignty of the king, so prevalent in revolutionary discourse, has any echoes in earlier texts. It is interesting to see in the results a mix of theoretical works (such as Bodin's De la république, or Pufendorf's Droit de la nature et des gens), historical accounts (Raynal's Histoire du parlement d'Angleterre, or Boulainvillier's Etat de la France), or literary sources (Voltaire's Cromwell, or Mercier's L'an deux mille quatre cent quarante), thus providing researchers with a broad and diverse overview of discussions of this topic in the pre-revolutionary period.
In highlighting the possibility of using word vectors that emerge from topic models of Revolutionary discourses, we might be guilty of teleological readings of these earlier texts. This one approach is simply to demonstrate the the possibility of combining mixtures of algorithms to propose unexpected texts of potentially related interest. As we move forward, we will be including topic models of the 18th century collections, to allow tracing of earlier topics into the Revolutionary era. This is another level of navigation that we believe will help guide researchers through large collections, providing access to smaller segments of text are that more tightly focussed on specific issues and topics.
-- The ARTFL TeamIn highlighting the possibility of using word vectors that emerge from topic models of Revolutionary discourses, we might be guilty of teleological readings of these earlier texts. This one approach is simply to demonstrate the the possibility of combining mixtures of algorithms to propose unexpected texts of potentially related interest. As we move forward, we will be including topic models of the 18th century collections, to allow tracing of earlier topics into the Revolutionary era. This is another level of navigation that we believe will help guide researchers through large collections, providing access to smaller segments of text are that more tightly focussed on specific issues and topics.