ARTFL Project Research Blog

**From Cyclopaedia to Encyclopédie: Experiments in Machine Translation and Sequence Alignment**

It is well known that the Encyclopédie ou dictionnaire raisonné des sciences, des arts et des métiers began first as a modest translation project of Ephraim Chambers' Cyclopaedia in 1745 [1]. Over the next few years, Diderot and d'Alembert would replace the original editors and the project would be duly transformed from a simple translation into an effort to compile and organise the sum total of the world's knowledge. Over the course of their editorial work, Diderot, and most notably d'Alembert, were not shy in incorporating these translations of the Cyclopaedia as filler for the Encyclopédie, many of which were inherited from the earlier project. Indeed, "ils ont laissé une bonne partie de ces articles presque inchangés, ou avec des modifications insignifiantes" [2]. The philosophes were nonetheless conscious of their debt to their English predecessor Chambers. His name appears some 1,154 times in the text of the Encyclopédie and he is referenced as sole or contributing source to 1,081 articles, where his name appears in italics at the end of a section or article. Given the scale of the two works under consideration, systematic evaluation of the extent of the philosophes' use of Chambers has remained, even today, a daunting task. John Lough, in 1980, framed the problem nicely:

So far no one has had the patience to make a detailed study of the exact relationship between the text of Diderot's Encyclopédie and the work of Ephraim Chambers. This would no doubt require several years of arduous toil devoted to comparing the two works article by article. [3]

Recent developments in machine translation and sequence alignment now offer new possibilities for the systematic comparison of digital texts across languages. The following post outlines some recent experimental work in leveraging these new techniques in an effort to reduce the "arduous toil" of textual comparison, giving some preliminary examples of the kinds of results that can be achieved, and providing some cursory observations on the advantages and limitations of such systems for automatic text analysis.

Our two comparison datasets are the ARTFL Encyclopédie (v. 1117) and the recently digitised ARTFL edition of the 1741 Chambers' Cyclopaedia (link). The 1741 edition was selected as it was one of the likely sources for the translation original project and we were able to work from high quality pages images provided by the University of Chicago Library [4]. In a nutshell, our approach was to generate a machine translation of all of the Cyclopaedia articles into French and then use ARTFL's Text-PAIR sequence alignement system to identify similar passages between this virtual French Cyclopaedia and the Encyclopédie, with the translation providing links back to the original English edition of the Chambers as well as links to the relevant passages in the Encyclopédie.

For the English to French machine translation of Chambers, we examined two of the most widely-used resources in this domain, Google Translate and DeepL. Both systems provide useful APIs as part of their respective subscription services, and both provide translations based on cutting-edge neural network language models. We compared results from various samples and found, in general, that both systems worked reasonably well, given the complications of eighteenth-century vocabularies (in both English and French) and many uncommon and archaic terms (this may be the subject of a future post). While DeepL provided somewhat more satisfying translations from a reader's perspective, we ultimately opted to use Google Translate for the ease of its API and its ability to parse the TEI encoding of our documents with little difficulty. The latter is of critical importance, since we wanted to keep the overall document structure of our dictionaries to allow for easy navigation between the versions.

Operationally, we segmented the text of the Cyclopaedia into short blocks, split at paragraph breaks, and sent them for automatic translation via the Google API, with a short delay between blocks. This worked relatively well, though the system would occasionally throw timeout or other errors, which required a query resend. You can inspect the translation results here - though this virtual French edition of the Chambers is not really meant for public consumption. Each article has a link at the bottom to the corresponding english version for the sake of comparison. It is important to note that the objective here is NOT to produce a good translation of the text or even on that might serve as the basis for a human edition. Rather, this machine-generated edition exists as a "pivot-text" between the English Chambers and French Encyclopédie, allowing for an automatic comparison of the two (or three) versions using a highly fault-tolerant sequence aligner designed to pick out commonalities in very noisy document spaces [5].

The next step was to establish workable parameters for the Text-PAIR alignment system. The challenge here was to find commonalities between the French translations created by eighteenth-century authors and translators and machine translations produced by a modern automatic translation system. Additionally, the editors and authors of the Encyclopédie were not necessary constrained to produce an exact translation of the text in question, but could and did, make significant modifications to the original in terms of length, style, and content. To address this challenge we ran a series of tests with different matching parameters such as n-gram construction (e.g., number of words that constitue an n-gram), minimum match lengths, maximum gaps between matches, and decreasing match requirements as a match length increased (what we call a "flex gap") among others on a representative selection of 100 articles from the Encyclopédie where Chambers was identified as the possible source. It is important to note that even with the best parameters [6], which we adjusted to get favorable recall and precision results, we were only able to identify 81 of the 100 articles. Some articles, even where clearly affiliated, were missed by the aligner, due to the size of the articles (some are very small) and fundamental differences in the translation of the English. For example, the article Compulseur is attributed by Mallet to Chambers, but the machine translation of Compulsor is a rather more literal and direct translation of the English article than what is offered by Mallet. Further relaxing matching parameters could potentially find this example, but would increase the number of false positives, in effect drowning out the signal with increased noise.

All things considered, we were quite happy with the aligner's performance given the complexity of the comparison task and the multiple potential variations between historical text and modern machine translations. To give an example of how fine-grained and at the same time highly-flexible our matching parameters needed to be, see the below article 'Gynaecocracy', which is a fairly direct translation on a rather specialised subject, but that nonetheless matched on only 8 content words.

Other straightforward articles were however missed due to differences in the translation and sparse matching n-grams, see for example the small article on "Occult" lines in geometry below, where the 6 matching words weren't enough to constitute a match for the aligner.

Obviously, this is a rather inexact science, reliant on an outside process of automatic translation and the ability to match a virtual text that in reality never existed. Nonetheless, the 81% recall rate we attained on our sample corpus seemed more than sufficient for this experiment and allowed us to move forward towards a more general evaluation of the entirety of identified matches.

Once settled on the optimal parameters, we thenText-PAIR to generate both an alignment database, for interactive examination, and a set of static files. Both of these results format are used for this project. The alignment database (link) contains some 7,304 aligned passage pairs. The system allows queries on metadata, such as author and article title as well as words or phrases found in the aligned passages. The system also uses faceted browsing to allow the user to summarize results by the various metadata [7]. Each aligned passage is presented as a facing page representation and the user can toggle a display of all of the variations between the two aligned passages. As seen below, the variations between the texts can be extensive.

Text-PAIR also contextualises results back to the original document(s). For example, the following is the article "Almanach" by d'Alembert, showing the aligned passage from Chambers in blue.

In this instance, d'Alembert reused almost all of Chambers' original article Almanac, with some minor variations, but does not to appear to have indicated the source of the first part of his article (page image).

The alignment database is a useful first pass to examine the results of the alignment process, but it is limited in at least two ways. It identifies each aligned passage, but does not merge multiple passages identified in in article pairs. Thus we find 5 shared passages between the articles "Constellation". The interface also does not attempt to evaluate the alignments or identify passages that occur between different articles. For example, D'Alembert's article ATMOSPHERE indeed has a passage from Chambers' article "Atmosphere", but also many longer passages from the article Generation.

To accumulate results and to refine evaluation, we subsequently processed the raw Text-PAIR alignment data as found in the static output files. We developed an evaluation algorithm for each alignment, with parameters based on the length of the matching passages and the degree to which the headwords were close matches. This simple evaluation model eliminated a significant number of false positives, which we found were typically short text matches between articles with different headwords. The output of this algorithm resulted in two tables, one for matches that were likely to be valid and one that was less likely to be valid, based on our simple heuristics (see a selection of the 'YES' table below). We are, of course, making this distinction based on the comparison of the machine translated Chambers headwords and the headwords found in the Encyclopédie, so we expected that some valid matches would be identified as invalid.

The next was phase of the project included the necessary step of human evaluation of the identified matches. While we were able to reduce the work involved significantly by generating a list of reasonably solid matches to be inspected, there is still no way to eliminate fully the "arduous toil" of comparison referenced by Lough. More than 5,000 potential matches were scrutinised, looking in essence for 'false negatives', i.e., matches that our evaluation algorithm classed as negative (based primarily on differences in headword translations) but that were in reality valid. The results of this work was then merged into in a single table of what we consider to be valid matches, a list that includes some 3,700 Encyclopédie articles with at least one matching passage from the Cyclopaedia. These results will form the basis of a longer article that is currently in preparation.

CONCLUSIONS

In all, we found some 3,778 articles in the Encyclopédie that upon evaluation seem highly similar in both content and structure to articles in the 1741 edition of Chambers' Cyclopaedia. Whether or not these articles constitute real acts of historical translation is the subject for another, or several other, articles. There are simply too many outside factors at play, even in this rather straightforward comparison, to make blanket conclusions about the editorial practices of the encyclopédistes based on this limited experiment [7]. What we can say, however, is that of the 1,081 articles that include a "Chambers" reference in the Encyclopédie, we only found 689 with at least one matching passage. Obviously, this recall rate 63.7% is well below the 81% we attained on our sample corpus, probably due to overfitting the matching algorithm to the sample which warrants further investigation. But, beyond testing this ground truth, we are also left with the rather astounding fact of 3,089 articles with no reference to Chambers whatsoever, all of which seem, at first blush, to be at least somewhat related to their English predecessors.

The overall evaluation of these results remains ongoing, and the "arduous toil" of traditional textual comparison continues apace, albeit guided somewhat by the machine's heavy hand. Indeed, the use of machine translation as a bridge between documents to find similar passages, be they reuses, plagiarisms, etc. is, as we have attempted to show here, a workable approach for future research, although not without certain limitations. The Chambers --> Encyclopédie task outlined above is fairly well constrained and historically bounded. More general applications of these same methods may well yield less useful results. These reservations notwithstanding, the fact that we were able to unearth many thousands of valid potential intertextual relationships between documents in different languages is a feat that even a few years ago might not have been possible. As large-scale language models become ever more sophisticated and historically aware, the dream of intertextual bridges[8] between multilingual corpora may yet become a reality.

- Glenn Roe & Mark Olsen

Notes

1. The page image of the title page from the 1745 prospectus is taken from ARTFL's "18th" volume of the Encyclopédie.

2. Paolo Quintili, "D'Alembert « traduit » Chambers. Les articles de mécanique de la Cyclopædia à l'Encyclopédie", Recherches sur Diderot et sur l'Encyclopédie 21 (1996):75. [link]

3. John Lough, "The Encyclopédie and the Chambers' Cyclopaedia", in SVEC 185, Oxford: Voltaire Foundation (1980): 221.

4. On the possible editions of the Cyclopaedia used by the encyclopédistes, see Irène Passeron, "Quelle(s) édition(s) de la Cyclopœdia les encyclopédistes ont-ils utilisée(s) ?", Recherches sur Diderot et sur l'Encyclopédie 40-41 (2006): 287-92. [link]

5. See Clovis Gladstone, Russ Horton, and Mark Olsen, "TextPAIR (Pairwise Alignment for Intertextual Relations)", ARTFL Project, University of Chicago, 2008-2021.

6. See comparison table. The primary parameters chosen were bigrams, stemmer=true, word len=3, maxgap=12, flexmatch=true, minmatchingngrams=5. Consult the TextPair documentation and configuration file for a description of these values.

7. The question of the Dictionnaire de Trévoux is one such factor, as it is known that both Chambers and the encyclopédistes used it as a source for their own articles--so matches we find between the Chambers and Encyclopédie may indeed represent shared borrowings from the Trévoux and not a translation at all. Or, more interestingly, perhaps Chambers translated a Trévoux article from French to English, which a dutiful encyclopédiste then translated back to French for the Encyclopédie--in this case, which article is the 'source' and which the 'translation'? For more on these particular aspects of dictionary-making, see our previous article "Plundering Philosophers: Identifying Sources of the Encyclopédie", Journal of the Association for History and Computing13.1 (Spring 2010) [link] and Marie Leca-Tsiomis' response, "The Use and Abuse of the Digital Humanities in the History of Ideas: How to Study the Encyclopédie", History of European Ideas 39.4 (2013): 467-76.

8. For more on 'intertextual bridges' in French, see our current NEH project [link].

ARTFL Project Research Blog

Cyclopaedia to Encyclopédie

**From Cyclopaedia to Encyclopédie: Experiments in Machine Translation and Sequence Alignment**

Labels

Popular Posts

Blog Archive

Developed by ARTFL