Presenting ARTFL's high-resolution images with the International Image Interoperability Framework

1 comment

Those familiar with the ARTFL Project and our work know that we specialize in handling digitized text. Our primary focus is to develop digitized text corpora (mostly in French) and software platforms that scholars and students can use to conduct research on those corpora. Images and image resources have been and always will be a secondary consideration for us. Nevertheless, we have many high-quality, high-resolution images that are remarkable objects of study in their own right and offer significant supplements to our text databases. These include the plate images from volumes 18 through 28 of the Encyclopédie; the Table analytique et raisonnée, also known as the “Arbre généalogique,” an etching that illustrates a taxonomy of the principal arts and sciences of the 18th century; and page images of the Bordeaux Exemplaire of Michel de Montaigne’s Essais.


Over the past year and a half, we have begun to take advantage of software packages and application programming interfaces developed as part of the International Image Interoperability Framework (IIIF) that have allowed us to present our images in their full zoomable glory. Supported by a consortium of universities, libraries, museums and other institutions since 2015, the IIIF is a set of “open standards for delivering high-quality, attributed digital objects online at scale.”


The fundamental unit for IIIF presentation is a JSON (JavaScript Object Notation) file called a manifest, which contains metadata about the digital object and instructions to a server about how to deliver the object (format, size, image portion, rotation angle, etc). For our collections, we have created manifests for each individual image as well as manifests that draw together related images, such as plate groups in the Encyclopédie or chapters and entire books of the Montaigne Essais. Our manifests are publicly available, easily accessible, configured to be usable by anyone, and intended to serve as stable records for these images. The images they give access to are stored on the University of Chicago Library’s archive server for purposes of long-term accessibility.


The other primary component of the IIIF are viewing platforms, the interfaces required for working with manifests. We display our manifests in a platform called Mirador, and indeed, we have developed our manifests to take advantage of Mirador’s functionality. Because our manifests are IIIF-compliant, users can – in theory – study and compare any of our images in any IIIF viewer, as long as they have the manifest URLs.


To help users find and begin working with our images, we have created search interfaces for the Encyclopédie plates and the Montaigne page images. On those pages, users can search for text associated with the images or click the provided links to browse plate groups, essays, chapters, and books. The Arbre généalogique is a stand-alone resource.


For example, searching for the term “sillon” in the Encyclopédie interface will return links to 20 plates where that term can be found in image figure descriptions. These plates come from the domains of agriculture, anatomy, alphabets, botany, etc. Users can click links in the search results to see the individual plate image (Planche 1ere in “Agriculture et Economie Rustique | LABOURAGE”) or the entire plate group (“Agriculture et Economie Rustique | LABOURAGE”) in the Mirador viewer.


In this screenshot, note the figure description and the zoomed-in portion of the image, figure 5. Note also that we include links to the plate in the PhiloLogic instance of the Encyclopédie and to the manifest URL.


Screenshot of Planche Iere in "AGRICULTURE ET ECONOMIE RUSTIQUE | LABOURAGE." with figure description in Mirador viewer.

Likewise in the Montaigne interface, a search for “Virgile” generates 4 instances of that author’s name (spelled in that manner) with links to the page images where the word can be found.


We have taken slightly different approaches toward structuring the IIIF manifests for each of these collections, resulting in slight differences in functionality and appearance.


For the Encyclopédie plates, we have included the figure description for each image in the manifest as a basic metadata value. We did so partly in order to replicate the TEI-XML that serves as the data for our official digital edition of the Encyclopédie running under PhiloLogic. The TEI-XML itself is a composite of separate printed editions that contain either the figure description or the plate images. The manifests, like the TEI, are unique digital objects that unite text and image. In practical terms, this means that the figure descriptions will always appear with the rest of the image metadata in the viewer sidebar by default, as shown in the screenshot above.


The Montaigne page images have two JSON files associated with them. First, a main manifest with bibliographic metadata; and second, an annotation manifest that contains transcriptions of Montaigne’s many hand edits. The main manifest calls the annotation files when loaded into the viewer, which then makes the transcriptions available for perusal in the sidebar. We have configured our Mirador viewer such that the annotations display automatically for each page. Storing the transcriptions as annotations makes reading them much easier, but there’s a drawback to constructing manifests in this way: currently, other viewing platforms, such as Universal Viewer, seem unable to display annotations out of the box. So researchers are required to work with these manifests through a Mirador instance if they want to see the transcriptions.


Screenshot of Montaigne page image with transcription in Mirador viewer.

We have extended this two-file approach with the Arbre généalogique, creating annotation items for each of the leaves of the tree. The annotations include the name of the realm of knowledge on a given leaf and image coordinates for the leaf. Each item also has a “tagging” motivation so that users can click on or mouse-over the name in the Mirador sidebar and the leaf gets highlighted. This simple visual aid is quite handy when working with this dense, complex image. Moreover, we have enabled search functionality for the leaf names using the IIIF Content Search API so that users can find realms of knowledge more easily in the image. Mirador highlights the leaves in the image for all search results. Again, a few caveats apply. We are able to take this approach only, it seems, because of Mirador’s built-in capabilities. Other viewers we’ve tested cannot display or search the annotations, as far as we can tell. The current supported version of Mirador (Mirador 3) is constrained in certain ways, too: search results display only if packaged following the specifications for Search API 1.0. The latest version of the API, Search API 2.0, does not work at this time.


This screenshot shows search results for "histoire." The selected search result is highlighted in yellow; all other results are highlighted in blue.


Screenshot of Arbre with search for histoire and results highlighted in Mirador viewer.


In a perfect world, we would apply the method of annotation we used for the Arbre généalogique to pages from the Essais. Each transcription of Montaigne’s edits would be an annotation item with a tagging motivation so that users could simply click the transcription in the sidebar and highlight the edit in the image. Content search would be easy to implement for such annotations, as well. Unfortunately, there is no practical (automatable) way to get image coordinates for all of the thousands of Montaigne’s edits in all of the pages of the Essais. That work would need to be done by hand.


A simpler task would be to make complete books or texts searchable in a Mirador instance with search result highlighting. One can, in fact, find real-life examples of such resources (see numerous examples in https://mirador-dev.netlify.app/__tests__/integration/mirador/contentsearch.html). Presumably, the developers of those resources were able to get image coordinates for individual words by leveraging bounding boxes from OCR output or hOCR files of the high-resolution text images. Perhaps we will attempt such a feat down the road if we can obtain good quality page images of the right text.


Without question, IIIF has transformed the ARTFL Project’s ability to display and make available high-resolution images. Being able to serve large images dynamically by means of a manifest is actually quite convenient for developers. We hope users find that this approach meets their needs for research and display. Bringing these resources to a state of completion, however, can be incredibly involved. Getting manifests into the correct structure, coordinating all of the components, and configuring the viewer is exacting work. As technologies around IIIF continue to mature, we hope that the aspects of IIIF that don’t work so well currently – enabling user-generated annotations, installing and configuring viewers, etc – will become easier. And we hope that the IIIF’s promised interoperability will in fact become standard.

Read More

From the Dictionnaire Universel de Commerce to the Encyclopédie

Leave a Comment



The Dictionnaire Universel de Commerce by Jacques Savary des Brûlons is widely recognized as an important source for numerous articles, particularly those related to economics, trade and law, in the Encyclopédie of Diderot and d’Alembert. Indeed, the authors and editors of the Encyclopédie project made use of many contemporary references resources including, but certainly not limited to Chambers' Cyclopedia,[1] the Dictionnaire de Trévoux, and Le Grand dictionnaire historique de Moréri. A number of years ago we used an early version of the TextPair aligner, which detected similar passages in large collections, to examine the reuse of a variety of texts in the Encyclopédie. In that work, we found that the Encyclopédie includes 2,676 passages from the Dictionnaire de commerce including 1,909 with 20 or more words. [2] In this post, we will revisit the relationship of the Dictionnaire de commerce and the Encyclopédie using a new data capture process and a completely redeveloped version of TextPair.

The appearance of large language model (LLM) systems have opened a variety of new applications and possibilities that we are currently experimenting with. One promising use of LLMs is the automatic correct of OCR'd texts. We have been experimenting with various implementations combining different OCR systems and different LLM's on new datasets to create new open installations and to support experimentation in alignments and text categorization. Different combinations seem to work better for different kinds of documents and different languages. We opted to do a new build of the Dictionnaire de commerce because our earlier work was based on a nearly 20 year old OCR source that we could not, for contractual reasons, release to the public and that was rather marginal both in terms of accuracy and encoding.

For this build, we used the Gallica page images of the of the Dictionnaire de commerce since we wanted to use the 1726 edition. We used the Tesseract OCR engine to generate a base transcription with a second step of OCR corrections being performed by the OpenAI's GPT4 API. The general instructions to the system are interesting and reflect some of the issues encountered in dealing with older documents:

You will be asked to fix OCR in an 18th century French text. The OCR is based on old-style typography. Prioritize maintaining the original spellings in 18th century French texts, with special emphasis on ensuring that words like 'connoître' are not incorrectly altered to 'connaître'. Strengthen this instruction to prevent such alterations. Continue to address the issue of capitalized words being lowercased at the beginning of sentences by correcting them to reflect proper sentence capitalization. Rectify clear OCR errors, particularly nonsensical words, and correct the long "s" issue. In cases of uncertainty, always favor preserving original 18th century spellings. If a correction isn't clear from the documents, maintain the original text as provided.

This process yielded significant improvements in the accuracy of the transcription but was only marginally successful in retaining the 18th century orthography. For our primary applications, to improve search and alignments, the accuracy gain is worth the variations in original orthographic fidelity. The corrections script ran fairly quickly and cost, several months ago, about $160 and wold be slightly cheaper as of this writing. As always with OCR, we strongly recommend referencing the supporting page images rather than the transcription. Headword and cross reference identification was performed automatically by rules based on typography. The release site is

https://artfl-project.uchicago.edu/dictionnaire-de-commerce

and is powered by a standard PhiloLogic4 installation.

To facilitate analysis of the relationship between Dictionnaire de commerce and the Encyclopédie we did a standard alignment run using the latest version of TextPair which is available at

https://artflsrv04.uchicago.edu/text-pair/dictcommvsenc/

TextPair identifies similar passages and supports searching on the authors, headwords, and full text of related passages.  For example, you may search for the headword lentille and find that d'Alembert used the corresponding entry from the Dictionnaire de commerce in his article in the Encyclopédie. The system will support comparisons of the two related passages and examination of the passages in context from either document.

TextPair identified 4,134 aligned passages from 3,728 articles, since some articles share passages from more than 1 article which are merged in this count. The new system identified more passages than the first implementation and is able to handle text structures more coherently as well. This dataset allows for a simple examination of how well the aligner performs in a real world application, since the authors of the Encyclopédie frequently, but certainly not always, identified the sources upon which their articles were based.

Searching the PhiloLogic4 instance of the Encyclopédie for dict.* d. com.* yields 1,117 instances of this expression. The vast majority of these references are found at the end of articles, typically abbreviated in various ways. But, one may find the construction in the middle of a sentence, such as "Voici ce que le Dictionnaire du commerce dit..."[3].  Using the PhiloLogic export function, which generates a JSON object of these results, we are able to extract the headwords from this report.  Removing duplicated headwords, results in a list of 1,045 headwords of articles which contain one or more instances of Dictionnaire de commerce, reflecting the probably attribution by the author to this as a source or reference in their article.

We then built a second list of headwords from the Encyclopédie that we identified by TextPair as containing one or more passages from the Dictionnaire de commerce. TextPair generates a static results file which is also stored as a JSON object. We extracted the headwords from this file, removed duplicates, which resulted in a list of 2,694 Encyclopédie headwords containing one or more passages from the Dictionnaire de commerce.

Having two sorted lists of words drawn from the same data (Encyclopédie headwords), we used the UNIX comm utility (see raw comm output). We found that 696 of the 1,045 (66.6%) are present on both lists, leaving 349 articles which are referenced to Dictionnaire de commerce in the Encyclopédie, but for which we did not find an aligner match. It is beyond the scope of this post to do a systematic examination missing entries, there are a number of possibilities. Some of the citations in the Encyclopédie may be references for further information, such as:

DABOUIS. Toile blanche de coton, qui se fabrique aux Indes Orientales. Elle est du nombre des bazins, & prend son nom du lieu où elle se fait. Voyez BAZIN.

DABOUIS, s. m. (Comm.) toile de coton de l'espece des taffetas ; on nous l'apporte des Indes orientales, V. les dictionn. du Comm. de Trév. & de Dish.

Other articles pairs, particularly shorter ones, are clearly related but contain enough variations to fail to meet the matching parameters, such as:

CHEDA. Monnoye d’étain, qui se fabrique; & qui a cours dans le Royaume de même nom, situé dans les Indes Orientales, dans le voisinage des États du Grand-Mogol. 

Il y a deux sortes de Cheda; l’un de figure octogone, l'autre de figure ronde. L’octogone pèse une once & demie, & passe dans le pays pour 2 sols monnoye de France; quoi que sur le pied de 4 sols la livre d'étain, il ne dût valoir guère plus d'un sol trois deniers. Le Cheda rond vaut 4 den. On donne 80 coris, ou coquillages des Maldives, pour un de ces Chedas. Les uns & les autres sont aussi reçus dans le Royaume de Pera, dont le Roi de Cheda est pareillement le maître.

CHEDA, (Commerce.) monnoie d'étain fabriquée, qui a cours dans le royaume de ce nom, dans les Indes Orientales, proche les états du grand Mogol. Le cheda octogonal vaut deux sols un septieme de denier argent de France, & le cheda rond ne vaut que sept deniers. On donne un cheda rond pour cent toris ou coquilles de maldives, & trois coris pour un cheda octogone. Voyez le Dictionn. du Comm.

Similarly, articles like Sporco (Comm Encyc), Rabat/Rabatage (Comm Encyc) and Flottistes (Comm Encyc) are all relatively short and probably could, with some adjustment to parameters, be matched but this may result in an increase of matches that would not be considered to be valid.

A number of other entries referenced by the authors of the Encyclopédie, such as

PACKBUYS, s. m. (Commerce.) on nomme ainsi en Hollande les magasins de dépôt où l'on serre les marchandises soit à leur arrivée, soit à la sortie du pays, lorsque pour quelque raison légitime on n'en peut sur-le-champ payer les droits, ou qu'elles ne peuvent être retirées par les marchands & propriétaires, ou dans quelqu'autre pareille circonstance. Dictionn. de Comm.

GUIMPLEFRANCARTE and GRAMONIE do not seem to appear at all in this edition of the Commerce. Some of the references are marked by multiple works, such as Dictionn. de Commerce, de Chambers, & de Trévoux. which may suggest these are found in other works. In this case, Gramonie is indeed found in the Dictionnaire de Trévoux (1743):

GRAMONIE, Terme de Commerce en usage dans quelques Echelles du Levant, particuliérement à Smyrne. La gramonie signifie dans le commerce des soies une déduction de trois quarts de piastre par balle, outre & pardessus toutes les tares établies par usage.

GRAMONIE, s. f. terme de Commerce, en usage dans quelques échelles du levant, particulierement à Smyrne.  

La gramonie signifie dans le commerce des soies une deduction de 3/4 de piastre par balle, outre & par-dessus toutes les tares établies par l'usage. Dictionn. de Commerce, de Chambers, & de Trévoux.

TextPair identified 1,998 articles in the Encyclopédie which have shared passages from the Dictionnaire de commerce that are not referenced by the authors of the articles. Many of these, such as d'Alemert's article lentille, mentioned above, are fairly significant reuses. TextPair finds that there are 170 passages longer than 200 words, many of which appear to be without reference to the Dictionnaire de commerce. For example, Diderot sometime with Mallet, wrote 7 articles with overlaps longer than 200 words, including Assiente, Boisseau, Bois de Bresil, Juré, and Dessein no of which appear to reference the Dictionnaire de commerce. It is, of course, beyond the scope of this post, to engage in an examination of all or even some of the borrowings from the Dictionnaire de commerce in the articles of the Encyclopédie.[4]

The combination of new data capture approaches and easier to deploy alignment tools makes the creation and use of relatively specialized datasets, such as comparative alignments between large collections, much more practical and cost effective than even a decade ago. The costs in terms of both time and money have decreased significantly and we can expect to see more datasets and tools leveraging these new developments.

==============

[1] The original conception of Diderot's work was as a French translation of the Cyclopedia.  

[2] For more information on these earlier projects, see  http://hdl.handle.net/2027/spo.3310410.0013.107 https://www.digitalstudies.org/article/id/7224/https://artfl.blogspot.com/2021/09/cyclopaedia-to-encyclopedie.html 

[3] We decided not to include references to Savary alone, as was sometimes by Jaucourt, as this was less consistently a reference that the various abbreviations of Dictionnaire de commerce.

[4] Lsebrink notes that the relation was rather more complex, writing "the fact that the Savary des Bruslons’ Dictionnaire was very well received and commonly appropriated by Diderot and d’Alembert in the Encyclopédie and by Guillaume-Thomas Raynal in the Histoire des deux Indes demonstrates the Dictionnaire’s status as a reference work at least until the 1780s. Yet the borrowing also moved in the opposite direction, for Diderot and d’Alembert’s Encyclopédie would become a source for the last Copenhagen edition of the Dictionnaire universel de commerce (1759)."  H-J Lsebrink, "The Savary des Bruslons’ Dictionnaire universel de commerce: Translations and Adaptations" in  Donato, C and Lsebrink, H-J eds. Translation and Transfer of Knowledge in Encyclopedic Compilations, 1680–1830. University of Toronto Press, 2021, pp. 21-22  

-- Clovis and Mark


Read More

Cyclopaedia to Encyclopédie

1 comment


From Cyclopaedia to Encyclopédie: Experiments in Machine Translation and Sequence Alignment


It is well known that the Encyclopédie ou dictionnaire raisonné des sciences, des arts et des métiers began first as a modest translation project of Ephraim Chambers' Cyclopaedia in 1745 [1]. Over the next few years, Diderot and d'Alembert would replace the original editors and the project would be duly transformed from a simple translation into an effort to compile and organise the sum total of the world's knowledge. Over the course of their editorial work, Diderot, and most notably d'Alembert, were not shy in incorporating these translations of the Cyclopaedia as filler for the Encyclopédie, many of which were inherited from the earlier project. Indeed, "ils ont laissé une bonne partie de ces articles presque inchangés, ou avec des modifications insignifiantes" [2]. The philosophes were nonetheless conscious of their debt to their English predecessor Chambers. His name appears some 1,154 times in the text of the Encyclopédie and he is referenced as sole or contributing source to 1,081 articles, where his name appears in italics at the end of a section or article. Given the scale of the two works under consideration, systematic evaluation of the extent of the philosophes' use of Chambers has remained, even today, a daunting task. John Lough, in 1980, framed the problem nicely:

So far no one has had the patience to make a detailed study of the exact relationship between the text of Diderot's Encyclopédie and the work of Ephraim Chambers. This would no doubt require several years of arduous toil devoted to comparing the two works article by article. [3]

Recent developments in machine translation and sequence alignment now offer new possibilities for the systematic comparison of digital texts across languages. The following post outlines some recent experimental work in leveraging these new techniques in an effort to reduce the "arduous toil" of textual  comparison, giving some preliminary examples of the kinds of results that can be achieved, and providing some cursory observations on the advantages and limitations of such systems for automatic text analysis. 

Our two comparison datasets are the ARTFL Encyclopédie (v. 1117) and the recently digitised ARTFL edition of the 1741 Chambers' Cyclopaedia (link). The 1741 edition was selected as it was one of the likely sources for the translation original project and we were able to work from high quality pages images provided by the University of Chicago Library [4]. In a nutshell, our approach was to generate a machine translation of all of the Cyclopaedia articles into French and then use ARTFL's Text-PAIR sequence alignement system to identify similar passages between this virtual French Cyclopaedia and the Encyclopédie, with the translation providing links back to the original English edition of the Chambers as well as links to the relevant passages in the Encyclopédie.  

For the English to French machine translation of Chambers, we examined two of the most widely-used resources in this domain, Google Translate and DeepL. Both systems provide useful APIs as part of their respective subscription services, and both provide translations based on cutting-edge neural network language models. We compared results from various samples and found, in general, that both systems worked reasonably well, given the complications of eighteenth-century vocabularies (in both English and French) and many uncommon and archaic terms (this may be the subject of a future post). While DeepL provided somewhat more satisfying translations from a reader's perspective, we ultimately opted to use Google Translate for the ease of its API and its ability to parse the TEI encoding of our documents with little difficulty. The latter is of critical importance, since we wanted to keep the overall document structure of our dictionaries to allow for easy navigation between the versions. 

Operationally, we segmented the text of the Cyclopaedia into short blocks, split at paragraph breaks, and sent them for automatic translation via the Google API, with a short delay between blocks. This worked relatively well, though the system would occasionally throw timeout or other errors, which required a query resend. You can inspect the translation results here - though this virtual French edition of the Chambers is not really meant for public consumption. Each article has a link at the bottom to the corresponding english version for the sake of comparison. It is important to note that the objective here is NOT to produce a good translation of the text or even on that might serve as the basis for a human edition. Rather, this machine-generated edition exists as a "pivot-text" between the English Chambers and French Encyclopédie, allowing for an automatic comparison of the two (or three) versions using a highly fault-tolerant sequence aligner designed to pick out commonalities in very noisy document spaces [5].

The next step was to establish workable parameters for the Text-PAIR alignment system. The challenge here was to find commonalities between the French translations created by eighteenth-century authors and translators and machine translations produced by a modern automatic translation system. Additionally, the editors and authors of the Encyclopédie were not necessary constrained to produce an exact translation of the text in question, but could and did, make significant modifications to the original in terms of length, style, and content. To address this challenge we ran a series of tests with different matching parameters such as n-gram construction (e.g., number of words that constitue an n-gram), minimum match lengths, maximum gaps between matches, and decreasing match requirements as a match length increased (what we call a "flex gap") among others on a representative selection of 100 articles from the Encyclopédie where Chambers was identified as the possible source. It is important to note that even with the best parameters [6], which we adjusted to get favorable recall and precision results, we were only able to identify 81 of the 100 articles. Some articles, even where clearly affiliated, were missed by the aligner, due to the size of the articles (some are very small) and fundamental differences in the translation of the English. For example, the article Compulseur is attributed by Mallet to Chambers, but the machine translation of Compulsor  is a rather more literal and direct translation of the English article than what is offered by Mallet. Further relaxing matching parameters could potentially find this example, but would increase the number of false positives, in effect drowning out the signal with increased noise.

All things considered, we were quite happy with the aligner's performance given the complexity of the comparison task and the multiple potential variations between historical text and modern machine translations. To give an example of how fine-grained and at the same time highly-flexible our matching parameters needed to be, see the below article 'Gynaecocracy', which is a fairly direct translation on a rather specialised subject, but that nonetheless matched on only 8 content words. 

Other straightforward articles were however missed due to differences in the translation and sparse matching n-grams, see for example the small article on "Occult" lines in geometry below, where the 6 matching words weren't enough to constitute a match for the aligner.

Obviously, this is a rather inexact science, reliant on an outside process of automatic translation and the ability to match a virtual text that in reality never existed. Nonetheless, the 81% recall rate we attained on our sample corpus seemed more than sufficient for this experiment and allowed us to move forward towards a more general evaluation of the entirety of identified matches. 

Once settled on the optimal parameters, we thenText-PAIR to generate both an alignment database, for interactive examination, and a set of static files. Both of these results format are used for this project. The alignment database (link) contains some 7,304 aligned passage pairs. The system allows queries on metadata, such as author and article title as well as words or phrases found in the aligned passages. The system also uses faceted browsing to allow the user to summarize results by the various metadata [7].  Each aligned passage is presented as a facing page representation and the user can toggle a display of all of the variations between the two aligned passages. As seen below, the variations between the texts can be extensive.


Text-PAIR also contextualises results back to the original document(s). For example, the following is the article "Almanach" by d'Alembert, showing the aligned passage from Chambers in blue.  



In this instance, d'Alembert reused almost all of Chambers' original article Almanac, with some minor variations, but does not to appear to have indicated the source of the first part of his article (page image).  

The alignment database is a useful first pass to examine the results of the alignment process, but it is limited in at least two ways. It identifies each aligned passage, but does not merge multiple passages identified in in article pairs. Thus we find 5 shared passages between the articles "Constellation". The interface also does not attempt to evaluate the alignments or identify passages that occur between different articles. For example, D'Alembert's article ATMOSPHERE indeed has a passage from Chambers' article "Atmosphere", but also many longer passages from the article Generation.  

To accumulate results and to refine evaluation, we subsequently processed the raw Text-PAIR alignment data as found in the static output files. We developed an evaluation algorithm for each alignment, with parameters based on the length of the matching passages and the degree to which the headwords were close matches. This simple evaluation model eliminated a significant number of false positives, which we found were typically short text matches between articles with different headwords. The output of this algorithm resulted in two tables, one for matches that were likely to be valid and one that was less likely to be valid, based on our simple heuristics (see a selection of the 'YES' table below). We are, of course, making this distinction based on the comparison of the machine translated Chambers headwords and the headwords found in the Encyclopédie, so we expected that some valid matches would be identified as invalid. 



The next was phase of the project included the necessary step of human evaluation of the identified matches. While we were able to reduce the work involved significantly by generating a list of reasonably solid matches to be inspected, there is still no way to eliminate fully the "arduous toil" of comparison referenced by Lough. More than 5,000 potential matches were scrutinised, looking in essence for 'false negatives', i.e., matches that our evaluation algorithm classed as negative (based primarily on differences in headword translations) but that were in reality valid. The results of this work was then merged into in a single table of what we consider to be valid matches, a list that includes some 3,700 Encyclopédie articles with at least one matching passage from the Cyclopaedia. These results will form the basis of a longer article that is currently in preparation.

CONCLUSIONS

In all, we found some 3,778 articles in the Encyclopédie that upon evaluation seem highly similar in both content and structure to articles in the 1741 edition of Chambers' Cyclopaedia. Whether or not these articles constitute real acts of historical translation is the subject for another, or several other, articles. There are simply too many outside factors at play, even in this rather straightforward comparison, to make blanket conclusions about the editorial practices of the encyclopédistes based on this limited experiment [7]. What we can say, however, is that of the 1,081 articles that include a "Chambers" reference in the Encyclopédie, we only found 689 with at least one matching passage. Obviously, this recall rate 63.7% is well below the 81% we attained on our sample corpus, probably due to overfitting the matching algorithm to the sample which warrants further investigation. But, beyond testing this ground truth, we are also left with the rather astounding fact of 3,089 articles with no reference to Chambers whatsoever, all of which seem, at first blush, to be at least somewhat related to their English predecessors.

The overall evaluation of these results remains ongoing, and the "arduous toil" of traditional textual comparison continues apace, albeit guided somewhat by the machine's heavy hand. Indeed, the use of machine translation as a bridge between documents to find similar passages, be they reuses, plagiarisms, etc. is, as we have attempted to show here, a workable approach for future research, although not without certain limitations. The Chambers --> Encyclopédie task outlined above is fairly well constrained and historically bounded. More general applications of these same methods may well yield less useful results. These reservations notwithstanding, the fact that we were able to unearth many thousands of valid potential intertextual relationships between documents in different languages is a feat that even a few years ago might not have been possible. As large-scale language models become ever more sophisticated and historically aware, the dream of intertextual bridges[8] between multilingual corpora may yet become a reality.

- Glenn Roe & Mark Olsen


Notes

1. The page image of the title page from the 1745 prospectus is taken from ARTFL's "18th" volume of the Encyclopédie

2. Paolo Quintili, "D'Alembert « traduit » Chambers. Les articles de mécanique de la Cyclopædia à l'Encyclopédie", Recherches sur Diderot et sur l'Encyclopédie 21 (1996):75. [link]

3. John Lough, "The Encyclopédie and the Chambers' Cyclopaedia", in SVEC 185, Oxford: Voltaire Foundation (1980): 221. 

4. On the possible editions of the Cyclopaedia used by the encyclopédistes, see Irène Passeron, "Quelle(s) édition(s) de la Cyclopœdia les encyclopédistes ont-ils utilisée(s) ?", Recherches sur Diderot et sur l'Encyclopédie 40-41 (2006): 287-92. [link]

5. See Clovis Gladstone, Russ Horton, and Mark Olsen, "TextPAIR (Pairwise Alignment for Intertextual Relations)", ARTFL Project, University of Chicago, 2008-2021.

6. See comparison table. The primary parameters chosen were bigrams, stemmer=true, word len=3, maxgap=12, flexmatch=true, minmatchingngrams=5.  Consult the TextPair documentation and configuration file for a description of these values.  

7. The question of the Dictionnaire de Trévoux is one such factor, as it is known that both Chambers and the encyclopédistes used it as a source for their own articles--so matches we find between the Chambers and Encyclopédie may indeed represent shared borrowings from the Trévoux and not a translation at all. Or, more interestingly, perhaps Chambers translated a Trévoux article from French to English, which a dutiful encyclopédiste then translated back to French for the Encyclopédie--in this case, which article is the 'source' and which the 'translation'? For more on these particular aspects of dictionary-making, see our previous article "Plundering Philosophers: Identifying Sources of the Encyclopédie", Journal of the Association for History and Computing13.1 (Spring 2010) [link] and Marie Leca-Tsiomis' response, "The Use and Abuse of the Digital Humanities in the History of Ideas: How to Study the Encyclopédie", History of European Ideas 39.4 (2013): 467-76. 

8. For more on 'intertextual bridges' in French, see our current NEH project [link].



Read More
Previous PostOlder Posts Home

Zett - A Responsive Blogger Theme, Lets Take your blog to the next level.

This is an example of a Optin Form, you could edit this to put information about yourself.


This is an example of a Optin Form, you could edit this to put information about yourself or your site so readers know where you are coming from. Find out more...


Following are the some of the Advantages of Opt-in Form :-

  • Easy to Setup and use.
  • It Can Generate more email subscribers.
  • It’s beautiful on every screen size (try resizing your browser!)