From Cyclopaedia to Encyclopédie: Experiments in Machine Translation and Sequence Alignment
It is well known that the Encyclopédie ou dictionnaire raisonné des sciences, des arts et des métiers began first as a modest translation project of Ephraim Chambers' Cyclopaedia in 1745 [1]. Over the next few years, Diderot and d'Alembert would replace the original editors and the project would be duly transformed from a simple translation into an effort to compile and organise the sum total of the world's knowledge. Over the course of their editorial work, Diderot, and most notably d'Alembert, were not shy in incorporating these translations of the Cyclopaedia as filler for the Encyclopédie, many of which were inherited from the earlier project. Indeed, "ils ont laissé une bonne partie de ces articles presque inchangés, ou avec des modifications insignifiantes" [2]. The philosophes were nonetheless conscious of their debt to their English predecessor Chambers. His name appears some 1,154 times in the text of the Encyclopédie and he is referenced as sole or contributing source to 1,081 articles, where his name appears in italics at the end of a section or article. Given the scale of the two works under consideration, systematic evaluation of the extent of the philosophes' use of Chambers has remained, even today, a daunting task. John Lough, in 1980, framed the problem nicely:
So far no one has had the patience to make a detailed study of the exact relationship between the text of Diderot's Encyclopédie and the work of Ephraim Chambers. This would no doubt require several years of arduous toil devoted to comparing the two works article by article. [3]
Recent developments in machine translation and sequence alignment now offer new possibilities for the systematic comparison of digital texts across languages. The following post outlines some recent experimental work in leveraging these new techniques in an effort to reduce the "arduous toil" of textual comparison, giving some preliminary examples of the kinds of results that can be achieved, and providing some cursory observations on the advantages and limitations of such systems for automatic text analysis.
Our two comparison datasets are the ARTFL Encyclopédie (v. 1117) and the recently digitised ARTFL edition of the 1741 Chambers' Cyclopaedia (link). The 1741 edition was selected as it was one of the likely sources for the translation original project and we were able to work from high quality pages images provided by the University of Chicago Library [4]. In a nutshell, our approach was to generate a machine translation of all of the Cyclopaedia articles into French and then use ARTFL's Text-PAIR sequence alignement system to identify similar passages between this virtual French Cyclopaedia and the Encyclopédie, with the translation providing links back to the original English edition of the Chambers as well as links to the relevant passages in the Encyclopédie.
For the English to French machine translation of Chambers, we examined two of the most widely-used resources in this domain, Google Translate and DeepL. Both systems provide useful APIs as part of their respective subscription services, and both provide translations based on cutting-edge neural network language models. We compared results from various samples and found, in general, that both systems worked reasonably well, given the complications of eighteenth-century vocabularies (in both English and French) and many uncommon and archaic terms (this may be the subject of a future post). While DeepL provided somewhat more satisfying translations from a reader's perspective, we ultimately opted to use Google Translate for the ease of its API and its ability to parse the TEI encoding of our documents with little difficulty. The latter is of critical importance, since we wanted to keep the overall document structure of our dictionaries to allow for easy navigation between the versions.
Operationally, we segmented the text of the Cyclopaedia into short blocks, split at paragraph breaks, and sent them for automatic translation via the Google API, with a short delay between blocks. This worked relatively well, though the system would occasionally throw timeout or other errors, which required a query resend. You can inspect the translation results here - though this virtual French edition of the Chambers is not really meant for public consumption. Each article has a link at the bottom to the corresponding english version for the sake of comparison. It is important to note that the objective here is NOT to produce a good translation of the text or even on that might serve as the basis for a human edition. Rather, this machine-generated edition exists as a "pivot-text" between the English Chambers and French Encyclopédie, allowing for an automatic comparison of the two (or three) versions using a highly fault-tolerant sequence aligner designed to pick out commonalities in very noisy document spaces [5].
The next step was to establish workable parameters for the Text-PAIR alignment system. The challenge here was to find commonalities between the French translations created by eighteenth-century authors and translators and machine translations produced by a modern automatic translation system. Additionally, the editors and authors of the Encyclopédie were not necessary constrained to produce an exact translation of the text in question, but could and did, make significant modifications to the original in terms of length, style, and content. To address this challenge we ran a series of tests with different matching parameters such as n-gram construction (e.g., number of words that constitue an n-gram), minimum match lengths, maximum gaps between matches, and decreasing match requirements as a match length increased (what we call a "flex gap") among others on a representative selection of 100 articles from the Encyclopédie where Chambers was identified as the possible source. It is important to note that even with the best parameters [6], which we adjusted to get favorable recall and precision results, we were only able to identify 81 of the 100 articles. Some articles, even where clearly affiliated, were missed by the aligner, due to the size of the articles (some are very small) and fundamental differences in the translation of the English. For example, the article Compulseur is attributed by Mallet to Chambers, but the machine translation of Compulsor is a rather more literal and direct translation of the English article than what is offered by Mallet. Further relaxing matching parameters could potentially find this example, but would increase the number of false positives, in effect drowning out the signal with increased noise.
All things considered, we were quite happy with the aligner's performance given the complexity of the comparison task and the multiple potential variations between historical text and modern machine translations. To give an example of how fine-grained and at the same time highly-flexible our matching parameters needed to be, see the below article 'Gynaecocracy', which is a fairly direct translation on a rather specialised subject, but that nonetheless matched on only 8 content words.
Other straightforward articles were however missed due to differences in the translation and sparse matching n-grams, see for example the small article on "Occult" lines in geometry below, where the 6 matching words weren't enough to constitute a match for the aligner.
Obviously, this is a rather inexact science, reliant on an outside process of automatic translation and the ability to match a virtual text that in reality never existed. Nonetheless, the 81% recall rate we attained on our sample corpus seemed more than sufficient for this experiment and allowed us to move forward towards a more general evaluation of the entirety of identified matches.
Once settled on the optimal parameters, we thenText-PAIR to generate both an alignment database, for interactive examination, and a set of static files. Both of these results format are used for this project. The alignment database (link) contains some 7,304 aligned passage pairs. The system allows queries on metadata, such as author and article title as well as words or phrases found in the aligned passages. The system also uses faceted browsing to allow the user to summarize results by the various metadata [7]. Each aligned passage is presented as a facing page representation and the user can toggle a display of all of the variations between the two aligned passages. As seen below, the variations between the texts can be extensive.
Text-PAIR also contextualises results back to the original document(s). For example, the following is the article "Almanach" by d'Alembert, showing the aligned passage from Chambers in blue.
Notes
1. The page image of the title page from the 1745 prospectus is taken from ARTFL's "18th" volume of the Encyclopédie.
2. Paolo Quintili, "D'Alembert « traduit » Chambers. Les articles de mécanique de la Cyclopædia à l'Encyclopédie", Recherches sur Diderot et sur l'Encyclopédie 21 (1996):75. [link]
3. John Lough, "The Encyclopédie and the Chambers' Cyclopaedia", in SVEC 185, Oxford: Voltaire Foundation (1980): 221.
4. On the possible editions of the Cyclopaedia used by the encyclopédistes, see Irène Passeron, "Quelle(s) édition(s) de la Cyclopœdia les encyclopédistes ont-ils utilisée(s) ?", Recherches sur Diderot et sur l'Encyclopédie 40-41 (2006): 287-92. [link]
5. See Clovis Gladstone, Russ Horton, and Mark Olsen, "TextPAIR (Pairwise Alignment for Intertextual Relations)", ARTFL Project, University of Chicago, 2008-2021.
6. See comparison table. The primary parameters chosen were bigrams, stemmer=true, word len=3, maxgap=12, flexmatch=true, minmatchingngrams=5. Consult the TextPair documentation and configuration file for a description of these values.
7. The question of the Dictionnaire de Trévoux is one such factor, as it is known that both Chambers and the encyclopédistes used it as a source for their own articles--so matches we find between the Chambers and Encyclopédie may indeed represent shared borrowings from the Trévoux and not a translation at all. Or, more interestingly, perhaps Chambers translated a Trévoux article from French to English, which a dutiful encyclopédiste then translated back to French for the Encyclopédie--in this case, which article is the 'source' and which the 'translation'? For more on these particular aspects of dictionary-making, see our previous article "Plundering Philosophers: Identifying Sources of the Encyclopédie", Journal of the Association for History and Computing13.1 (Spring 2010) [link] and Marie Leca-Tsiomis' response, "The Use and Abuse of the Digital Humanities in the History of Ideas: How to Study the Encyclopédie", History of European Ideas 39.4 (2013): 467-76.
8. For more on 'intertextual bridges' in French, see our current NEH project [link].