Cyclopaedia to Encyclopédie

1 comment


From Cyclopaedia to Encyclopédie: Experiments in Machine Translation and Sequence Alignment


It is well known that the Encyclopédie ou dictionnaire raisonné des sciences, des arts et des métiers began first as a modest translation project of Ephraim Chambers' Cyclopaedia in 1745 [1]. Over the next few years, Diderot and d'Alembert would replace the original editors and the project would be duly transformed from a simple translation into an effort to compile and organise the sum total of the world's knowledge. Over the course of their editorial work, Diderot, and most notably d'Alembert, were not shy in incorporating these translations of the Cyclopaedia as filler for the Encyclopédie, many of which were inherited from the earlier project. Indeed, "ils ont laissé une bonne partie de ces articles presque inchangés, ou avec des modifications insignifiantes" [2]. The philosophes were nonetheless conscious of their debt to their English predecessor Chambers. His name appears some 1,154 times in the text of the Encyclopédie and he is referenced as sole or contributing source to 1,081 articles, where his name appears in italics at the end of a section or article. Given the scale of the two works under consideration, systematic evaluation of the extent of the philosophes' use of Chambers has remained, even today, a daunting task. John Lough, in 1980, framed the problem nicely:

So far no one has had the patience to make a detailed study of the exact relationship between the text of Diderot's Encyclopédie and the work of Ephraim Chambers. This would no doubt require several years of arduous toil devoted to comparing the two works article by article. [3]

Recent developments in machine translation and sequence alignment now offer new possibilities for the systematic comparison of digital texts across languages. The following post outlines some recent experimental work in leveraging these new techniques in an effort to reduce the "arduous toil" of textual  comparison, giving some preliminary examples of the kinds of results that can be achieved, and providing some cursory observations on the advantages and limitations of such systems for automatic text analysis. 

Our two comparison datasets are the ARTFL Encyclopédie (v. 1117) and the recently digitised ARTFL edition of the 1741 Chambers' Cyclopaedia (link). The 1741 edition was selected as it was one of the likely sources for the translation original project and we were able to work from high quality pages images provided by the University of Chicago Library [4]. In a nutshell, our approach was to generate a machine translation of all of the Cyclopaedia articles into French and then use ARTFL's Text-PAIR sequence alignement system to identify similar passages between this virtual French Cyclopaedia and the Encyclopédie, with the translation providing links back to the original English edition of the Chambers as well as links to the relevant passages in the Encyclopédie.  

For the English to French machine translation of Chambers, we examined two of the most widely-used resources in this domain, Google Translate and DeepL. Both systems provide useful APIs as part of their respective subscription services, and both provide translations based on cutting-edge neural network language models. We compared results from various samples and found, in general, that both systems worked reasonably well, given the complications of eighteenth-century vocabularies (in both English and French) and many uncommon and archaic terms (this may be the subject of a future post). While DeepL provided somewhat more satisfying translations from a reader's perspective, we ultimately opted to use Google Translate for the ease of its API and its ability to parse the TEI encoding of our documents with little difficulty. The latter is of critical importance, since we wanted to keep the overall document structure of our dictionaries to allow for easy navigation between the versions. 

Operationally, we segmented the text of the Cyclopaedia into short blocks, split at paragraph breaks, and sent them for automatic translation via the Google API, with a short delay between blocks. This worked relatively well, though the system would occasionally throw timeout or other errors, which required a query resend. You can inspect the translation results here - though this virtual French edition of the Chambers is not really meant for public consumption. Each article has a link at the bottom to the corresponding english version for the sake of comparison. It is important to note that the objective here is NOT to produce a good translation of the text or even on that might serve as the basis for a human edition. Rather, this machine-generated edition exists as a "pivot-text" between the English Chambers and French Encyclopédie, allowing for an automatic comparison of the two (or three) versions using a highly fault-tolerant sequence aligner designed to pick out commonalities in very noisy document spaces [5].

The next step was to establish workable parameters for the Text-PAIR alignment system. The challenge here was to find commonalities between the French translations created by eighteenth-century authors and translators and machine translations produced by a modern automatic translation system. Additionally, the editors and authors of the Encyclopédie were not necessary constrained to produce an exact translation of the text in question, but could and did, make significant modifications to the original in terms of length, style, and content. To address this challenge we ran a series of tests with different matching parameters such as n-gram construction (e.g., number of words that constitue an n-gram), minimum match lengths, maximum gaps between matches, and decreasing match requirements as a match length increased (what we call a "flex gap") among others on a representative selection of 100 articles from the Encyclopédie where Chambers was identified as the possible source. It is important to note that even with the best parameters [6], which we adjusted to get favorable recall and precision results, we were only able to identify 81 of the 100 articles. Some articles, even where clearly affiliated, were missed by the aligner, due to the size of the articles (some are very small) and fundamental differences in the translation of the English. For example, the article Compulseur is attributed by Mallet to Chambers, but the machine translation of Compulsor  is a rather more literal and direct translation of the English article than what is offered by Mallet. Further relaxing matching parameters could potentially find this example, but would increase the number of false positives, in effect drowning out the signal with increased noise.

All things considered, we were quite happy with the aligner's performance given the complexity of the comparison task and the multiple potential variations between historical text and modern machine translations. To give an example of how fine-grained and at the same time highly-flexible our matching parameters needed to be, see the below article 'Gynaecocracy', which is a fairly direct translation on a rather specialised subject, but that nonetheless matched on only 8 content words. 

Other straightforward articles were however missed due to differences in the translation and sparse matching n-grams, see for example the small article on "Occult" lines in geometry below, where the 6 matching words weren't enough to constitute a match for the aligner.

Obviously, this is a rather inexact science, reliant on an outside process of automatic translation and the ability to match a virtual text that in reality never existed. Nonetheless, the 81% recall rate we attained on our sample corpus seemed more than sufficient for this experiment and allowed us to move forward towards a more general evaluation of the entirety of identified matches. 

Once settled on the optimal parameters, we thenText-PAIR to generate both an alignment database, for interactive examination, and a set of static files. Both of these results format are used for this project. The alignment database (link) contains some 7,304 aligned passage pairs. The system allows queries on metadata, such as author and article title as well as words or phrases found in the aligned passages. The system also uses faceted browsing to allow the user to summarize results by the various metadata [7].  Each aligned passage is presented as a facing page representation and the user can toggle a display of all of the variations between the two aligned passages. As seen below, the variations between the texts can be extensive.


Text-PAIR also contextualises results back to the original document(s). For example, the following is the article "Almanach" by d'Alembert, showing the aligned passage from Chambers in blue.  



In this instance, d'Alembert reused almost all of Chambers' original article Almanac, with some minor variations, but does not to appear to have indicated the source of the first part of his article (page image).  

The alignment database is a useful first pass to examine the results of the alignment process, but it is limited in at least two ways. It identifies each aligned passage, but does not merge multiple passages identified in in article pairs. Thus we find 5 shared passages between the articles "Constellation". The interface also does not attempt to evaluate the alignments or identify passages that occur between different articles. For example, D'Alembert's article ATMOSPHERE indeed has a passage from Chambers' article "Atmosphere", but also many longer passages from the article Generation.  

To accumulate results and to refine evaluation, we subsequently processed the raw Text-PAIR alignment data as found in the static output files. We developed an evaluation algorithm for each alignment, with parameters based on the length of the matching passages and the degree to which the headwords were close matches. This simple evaluation model eliminated a significant number of false positives, which we found were typically short text matches between articles with different headwords. The output of this algorithm resulted in two tables, one for matches that were likely to be valid and one that was less likely to be valid, based on our simple heuristics (see a selection of the 'YES' table below). We are, of course, making this distinction based on the comparison of the machine translated Chambers headwords and the headwords found in the Encyclopédie, so we expected that some valid matches would be identified as invalid. 



The next was phase of the project included the necessary step of human evaluation of the identified matches. While we were able to reduce the work involved significantly by generating a list of reasonably solid matches to be inspected, there is still no way to eliminate fully the "arduous toil" of comparison referenced by Lough. More than 5,000 potential matches were scrutinised, looking in essence for 'false negatives', i.e., matches that our evaluation algorithm classed as negative (based primarily on differences in headword translations) but that were in reality valid. The results of this work was then merged into in a single table of what we consider to be valid matches, a list that includes some 3,700 Encyclopédie articles with at least one matching passage from the Cyclopaedia. These results will form the basis of a longer article that is currently in preparation.

CONCLUSIONS

In all, we found some 3,778 articles in the Encyclopédie that upon evaluation seem highly similar in both content and structure to articles in the 1741 edition of Chambers' Cyclopaedia. Whether or not these articles constitute real acts of historical translation is the subject for another, or several other, articles. There are simply too many outside factors at play, even in this rather straightforward comparison, to make blanket conclusions about the editorial practices of the encyclopédistes based on this limited experiment [7]. What we can say, however, is that of the 1,081 articles that include a "Chambers" reference in the Encyclopédie, we only found 689 with at least one matching passage. Obviously, this recall rate 63.7% is well below the 81% we attained on our sample corpus, probably due to overfitting the matching algorithm to the sample which warrants further investigation. But, beyond testing this ground truth, we are also left with the rather astounding fact of 3,089 articles with no reference to Chambers whatsoever, all of which seem, at first blush, to be at least somewhat related to their English predecessors.

The overall evaluation of these results remains ongoing, and the "arduous toil" of traditional textual comparison continues apace, albeit guided somewhat by the machine's heavy hand. Indeed, the use of machine translation as a bridge between documents to find similar passages, be they reuses, plagiarisms, etc. is, as we have attempted to show here, a workable approach for future research, although not without certain limitations. The Chambers --> Encyclopédie task outlined above is fairly well constrained and historically bounded. More general applications of these same methods may well yield less useful results. These reservations notwithstanding, the fact that we were able to unearth many thousands of valid potential intertextual relationships between documents in different languages is a feat that even a few years ago might not have been possible. As large-scale language models become ever more sophisticated and historically aware, the dream of intertextual bridges[8] between multilingual corpora may yet become a reality.

- Glenn Roe & Mark Olsen


Notes

1. The page image of the title page from the 1745 prospectus is taken from ARTFL's "18th" volume of the Encyclopédie

2. Paolo Quintili, "D'Alembert « traduit » Chambers. Les articles de mécanique de la Cyclopædia à l'Encyclopédie", Recherches sur Diderot et sur l'Encyclopédie 21 (1996):75. [link]

3. John Lough, "The Encyclopédie and the Chambers' Cyclopaedia", in SVEC 185, Oxford: Voltaire Foundation (1980): 221. 

4. On the possible editions of the Cyclopaedia used by the encyclopédistes, see Irène Passeron, "Quelle(s) édition(s) de la Cyclopœdia les encyclopédistes ont-ils utilisée(s) ?", Recherches sur Diderot et sur l'Encyclopédie 40-41 (2006): 287-92. [link]

5. See Clovis Gladstone, Russ Horton, and Mark Olsen, "TextPAIR (Pairwise Alignment for Intertextual Relations)", ARTFL Project, University of Chicago, 2008-2021.

6. See comparison table. The primary parameters chosen were bigrams, stemmer=true, word len=3, maxgap=12, flexmatch=true, minmatchingngrams=5.  Consult the TextPair documentation and configuration file for a description of these values.  

7. The question of the Dictionnaire de Trévoux is one such factor, as it is known that both Chambers and the encyclopédistes used it as a source for their own articles--so matches we find between the Chambers and Encyclopédie may indeed represent shared borrowings from the Trévoux and not a translation at all. Or, more interestingly, perhaps Chambers translated a Trévoux article from French to English, which a dutiful encyclopédiste then translated back to French for the Encyclopédie--in this case, which article is the 'source' and which the 'translation'? For more on these particular aspects of dictionary-making, see our previous article "Plundering Philosophers: Identifying Sources of the Encyclopédie", Journal of the Association for History and Computing13.1 (Spring 2010) [link] and Marie Leca-Tsiomis' response, "The Use and Abuse of the Digital Humanities in the History of Ideas: How to Study the Encyclopédie", History of European Ideas 39.4 (2013): 467-76. 

8. For more on 'intertextual bridges' in French, see our current NEH project [link].



Read More

Federated Search and PhiloLogic -- from works to (someday) words

Leave a Comment
Over the past several years, the ARTFL Project has been developing the code infrastructure for the Intertextual Hub reading environment that federates heterogeneous text collections, extracting data from individual PhiloLogic4 instances and exposing that data to text analysis algorithms in order to allow users to navigate between individual and larger groups of texts related through shared themes, ideas, and passages.

We have now adapted components of this infrastructure to enable federated bibliographic searching on all of the text collections running under PhiloLogic. With the PhiloLogic Federated Bibliography Search database, we offer a simple, yet flexible search system that allows users to search for texts across approximately 90 individual collections in nearly a dozen languages. We currently allow search by author, title, and collection language. Searches can be further delimited by access type and by date range. So for example, a search for titles containing the word “slavery” written in English between 1750 and 1800 yields 38 results from the American Archives Collection, ECCO-TCP, and the Evans Early American Imprint Collection:


https://artflsrv03.uchicago.edu/cgi-bin/federated_bibliography/federated_bib_search.py?author=&title=slavery&language=english&start_date=1750&end_date=1800&sort_by=

Search results contain links to work titles and collections. In results, we note the access status of the collection, whether open or limited to subscribing institutions or to users at the University of Chicago. This same search can be expanded across French and English collections by using a Boolean “OR” and entering “slavery OR esclavage” in the title field:

https://artflsrv03.uchicago.edu/cgi-bin/federated_bibliography/federated_bib_search.py?author=&title=slavery+OR+esclavage&language=&start_date=1750&end_date=1800&sort_by=

This search yields several titles in the open-access Newberry French Revolution Collection, one in the Frantext collection, and one -- a play entitled “L’Esclavage des Noirs, ou L’Heureux Naufrage, Drame” -- in the Théâtre Classique collection.

We envision this bibliographic search system to be the first of many such tools that permit search across the entirety of our collections. In the Intertextual Hub, users can conduct word or topic vector searches across all seven of the 18th-century French collections included in it. Results are returned ranked by relevance. For example, see these results for a search using a topic vector that contains astronomical terms:

https://intertextual-hub.uchicago.edu/search?limit=100&stemmed=yes&words=soleil%20lune%20rayon%20etoile%20chaleur%20nuit%20montagne%20ciel%20astre%20lumiere&binding=OR

Taking inspiration from this federated search approach, we would create a mechanism that enables combined metadata and fulltext queries across all PhiloLogic instances -- or at least a logically coherent subset thereof -- at once, in real time. Users would no longer be constrained to working inside single collections, but could conduct searches across multiple collections and potentially in multiple languages. For example, instead of searching for “slavery OR esclavage” only in titles, users could search for those terms in any number of collections running under PhiloLogic.

The technical details of such a search scheme remain to be hashed out, of course. But the great thing about PhiloLogic4 is that its fundamental architecture makes it possible to create standalone widgets or external apps that query database instances via an API and then repackage and render search results independently. For example, ARTFL’s PhiloReader apps for both Android and iOS work in exactly this way, and from the beginning were meant to be a demonstration of PhiloLogic’s server capabilities (download the Encyclopédie reader apps here and here).

Encyclopédie app search suggestionsEncyclopédie app metadata query results

These screenshots illustrate a simple example of the Encyclopédie app interacting with the PhiloLogic4 API. In the left screenshot, the app gets metadata search suggestions dynamically, in this case "Astronomie | Géographie". Query results for articles with that classification appear in the right screenshot.

For a federated search system, a client would send queries to however many PhiloLogic instances; gather and sort query results or links to query results; then present those results to the user. Again, we would first have to work out certain details before creating a search system like this, such as determining the exact nature of query results; whether and how to perform relevance ranking on results; whether we would need to integrate certain kinds of reporting features into PhiloLogic as a parallel development activity, etc.

However we proceed, the experience of building the Intertextual Hub has taught us that we can tap into the indexing, processing, and reporting capabilities of PhiloLogic to draw together many individual, heterogeneous text collections and create larger-scale research environments that allow users to engage in text analysis of an incredibly broad scope.
Read More

Topic Models and Word Vectors

Leave a Comment

 










The Intertextual Hub is built around several different algorithms to facilitate document search, similarity and navigation. In previous posts in this series, I have examined the applications of sequence alignment, topic modeling, and document similarity in various contexts. A primary objective of the Hub is to direct attention to particular documents that may be of interest.  Arriving at a specific document to consult, the user is offered two views. One is a document browse mode, which provides links to similar documents and borrowed passages if detected. The second is to consult the Topic Distribution of the document. 

The left side of image above is top element of the Topic Model report for the Dénonciation a toutes les puissances de l'Europe : d'un plan de conjuration contre sa tranquilité général (link to text), an anonymous attack on the Club de 1789 published in 1790.  As mentioned in an earlier post in this series, the first topic, number 123, is clearly about elections, which does indeed reflect a section describing elections in the club constitution.  The lesser weighted topics in the document, 114, 111, 128 and so on, are all plausible topics of this document.  The right side of this image, shows word cloud, size reflecting weight, of the most distinctive vocabulary identified in this document.  This simple list is a considerably better guide to the specific content of the document, a denunciation of a conspiracy against the souverains of Europe to which is appended extracts from the constitution of the club.

Below the lists of Topics and Word Cloud of most distinctive tokens in Topic Model  report, there are two lists of 20 documents.  Below Topics are the top 20 documents identified by the similarity of topic distributions while below the Word Cloud are the top 20 documents as measured by similar vocabulary.  


The first two entries on the right hand column are parts of Sieyès'  Ébauche d'un nouveau plan de société patriotique, adopté par le Club de mil sept cent quatre-vingt-neuf  (BNF), found in Dénonciation, followed by Condorcet's constitutional proposal of 1793.  The two lists represent two different ways to identify similar documents.  It is useful to note the overlaps between the two lists, since these are identified as being relevant by both measures:

The contrast between topics and most distinctive words can be very significant.  Mercier's brief chapter on Vaches in Tableau de Paris is striking.  There are no overlaps between the similar document links and on two words, animal and compagnie, appear in the topic words, and those for low weighted topics.  Other documents are marked by the relative alignment of topics and distinctive words.  The topic/word report for Lettres écrites à M. Cérutti par M. Clavière, sur les prochains arrangemens de finance (1790 text link) shows that the distinctive tokens are found frequently on the top topic model word lists and there is more overlap between the most similar documents.
  
It is hardly surprising to find that there are significant distinctions between the representation of the contents of a specific document under a Topic Model and Word Vector (Most Distinctive Vocabulary) can be significantly different.  Topic Models attempt to identify the best fit of a document in an arbitrary number of groups.  Many documents about specific things, like cows in Paris, may well fall between these groups and be assigned to topics which are only very tangentially related to the contents of the document.  This weak relationship to topics is reflected by the limited number of tokens shared between the most heavily weighted topic terms and distinctive vocabulary of a document as well as limited or no overlap between lists of similar documents.  Topic models are an effective technique at identifying large patterns of topic development for search and analysis and classifying documents within these large patterns.  By contrast, identifying documents related by similar vocabulary, generally falling under the rubric of "nearest neighbor search" (NNS) is able to identify and leverage the particularities of a specific document to identify others closely related to it, but cannot by itself be used to aid with larger classifications or themes.  

Thus we provide the user in the Intertextual Hub with these two distinct views of a document, identifying the topics in which it is situated and the its most distinctive vocabulary and other documents which most closely resemble it.  A quick examination of the topics, words, and document lists gives the reader a pretty good sense of the degree to which a specific document falls coherently into one of the 150 topics in this model.  

The suggestion that humans should consider both measures and make a determination of the goodness of fit of a document to the topic model, it may be worth experimenting with the use of NNS measures as a way to evaluate Topic Models.  As I have shown above that a topic model, say of 150 topics generated using a set of parameters, can cover some documents more compelling than others.  In the example, finance documents are specific enough to be well covered by several related topics.  This leads to the possibility of establishing a quantitative measure by using somewhat independent measures, topics and word vectors, to assess the validity of a particular topic model.  For each document in a collection, this would be assessed by 
  • the number of common tokens in the top N topics with the most distinctive words;
  • number of common documents in the two lists;
  • number of matching topics (say top 3) for each document in the two lists of documents.
For every document, one would calculate how well the topic approximates the nearest neighbors of that document, measuring 0 for not at all to 1 for perfect identity.  We have two ways of dividing up an information space, topic models from effectively the top down (we're going have 150 buckets) and the other from the bottom up (but we don't know how many buckets).  Like a Venn diagram, the more these overlap, the better the coverage for that document.

Summing up this measure across all of the documents, one would arrive at a single value for all topics, and possibly a single value for every topic.  You could then adjust parameters.  Of course, you could overfit this, simply by saying I will have the same number of topics as documents.  But it might even give you a measure of how many topics is best, by observing a decrease in the coverage, which would be theoretically possible by spreading the topics to thin across the information space.  

 


  

Read More
Next PostNewer Posts Previous PostOlder Posts Home