ARTFL Project Research Blog

Presenting ARTFL's high-resolution images with the International Image Interoperability Framework

Ch. Cooney Wednesday, November 13, 2024 1 comment

Those familiar with the ARTFL Project and our work know that we specialize in handling digitized text. Our primary focus is to develop digitized text corpora (mostly in French) and software platforms that scholars and students can use to conduct research on those corpora. Images and image resources have been and always will be a secondary consideration for us. Nevertheless, we have many high-quality, high-resolution images that are remarkable objects of study in their own right and offer significant supplements to our text databases. These include the plate images from volumes 18 through 28 of the Encyclopédie; the Table analytique et raisonnée, also known as the “Arbre généalogique,” an etching that illustrates a taxonomy of the principal arts and sciences of the 18th century; and page images of the Bordeaux Exemplaire of Michel de Montaigne’s Essais.

Over the past year and a half, we have begun to take advantage of software packages and application programming interfaces developed as part of the International Image Interoperability Framework (IIIF) that have allowed us to present our images in their full zoomable glory. Supported by a consortium of universities, libraries, museums and other institutions since 2015, the IIIF is a set of “open standards for delivering high-quality, attributed digital objects online at scale.”

The fundamental unit for IIIF presentation is a JSON (JavaScript Object Notation) file called a manifest, which contains metadata about the digital object and instructions to a server about how to deliver the object (format, size, image portion, rotation angle, etc). For our collections, we have created manifests for each individual image as well as manifests that draw together related images, such as plate groups in the Encyclopédie or chapters and entire books of the Montaigne Essais. Our manifests are publicly available, easily accessible, configured to be usable by anyone, and intended to serve as stable records for these images. The images they give access to are stored on the University of Chicago Library’s archive server for purposes of long-term accessibility.

The other primary component of the IIIF are viewing platforms, the interfaces required for working with manifests. We display our manifests in a platform called Mirador, and indeed, we have developed our manifests to take advantage of Mirador’s functionality. Because our manifests are IIIF-compliant, users can – in theory – study and compare any of our images in any IIIF viewer, as long as they have the manifest URLs.

To help users find and begin working with our images, we have created search interfaces for the Encyclopédie plates and the Montaigne page images. On those pages, users can search for text associated with the images or click the provided links to browse plate groups, essays, chapters, and books. The Arbre généalogique is a stand-alone resource.

For example, searching for the term “sillon” in the Encyclopédie interface will return links to 20 plates where that term can be found in image figure descriptions. These plates come from the domains of agriculture, anatomy, alphabets, botany, etc. Users can click links in the search results to see the individual plate image (Planche 1ere in “Agriculture et Economie Rustique | LABOURAGE”) or the entire plate group (“Agriculture et Economie Rustique | LABOURAGE”) in the Mirador viewer.

In this screenshot, note the figure description and the zoomed-in portion of the image, figure 5. Note also that we include links to the plate in the PhiloLogic instance of the Encyclopédie and to the manifest URL.

Screenshot of Planche Iere in "AGRICULTURE ET ECONOMIE RUSTIQUE | LABOURAGE." with figure description in Mirador viewer.

Likewise in the Montaigne interface, a search for “Virgile” generates 4 instances of that author’s name (spelled in that manner) with links to the page images where the word can be found.

We have taken slightly different approaches toward structuring the IIIF manifests for each of these collections, resulting in slight differences in functionality and appearance.

For the Encyclopédie plates, we have included the figure description for each image in the manifest as a basic metadata value. We did so partly in order to replicate the TEI-XML that serves as the data for our official digital edition of the Encyclopédie running under PhiloLogic. The TEI-XML itself is a composite of separate printed editions that contain either the figure description or the plate images. The manifests, like the TEI, are unique digital objects that unite text and image. In practical terms, this means that the figure descriptions will always appear with the rest of the image metadata in the viewer sidebar by default, as shown in the screenshot above.

The Montaigne page images have two JSON files associated with them. First, a main manifest with bibliographic metadata; and second, an annotation manifest that contains transcriptions of Montaigne’s many hand edits. The main manifest calls the annotation files when loaded into the viewer, which then makes the transcriptions available for perusal in the sidebar. We have configured our Mirador viewer such that the annotations display automatically for each page. Storing the transcriptions as annotations makes reading them much easier, but there’s a drawback to constructing manifests in this way: currently, other viewing platforms, such as Universal Viewer, seem unable to display annotations out of the box. So researchers are required to work with these manifests through a Mirador instance if they want to see the transcriptions.

Screenshot of Montaigne page image with transcription in Mirador viewer.

We have extended this two-file approach with the Arbre généalogique, creating annotation items for each of the leaves of the tree. The annotations include the name of the realm of knowledge on a given leaf and image coordinates for the leaf. Each item also has a “tagging” motivation so that users can click on or mouse-over the name in the Mirador sidebar and the leaf gets highlighted. This simple visual aid is quite handy when working with this dense, complex image. Moreover, we have enabled search functionality for the leaf names using the IIIF Content Search API so that users can find realms of knowledge more easily in the image. Mirador highlights the leaves in the image for all search results. Again, a few caveats apply. We are able to take this approach only, it seems, because of Mirador’s built-in capabilities. Other viewers we’ve tested cannot display or search the annotations, as far as we can tell. The current supported version of Mirador (Mirador 3) is constrained in certain ways, too: search results display only if packaged following the specifications for Search API 1.0. The latest version of the API, Search API 2.0, does not work at this time.

This screenshot shows search results for "histoire." The selected search result is highlighted in yellow; all other results are highlighted in blue.

Screenshot of Arbre with search for histoire and results highlighted in Mirador viewer.

In a perfect world, we would apply the method of annotation we used for the Arbre généalogique to pages from the Essais. Each transcription of Montaigne’s edits would be an annotation item with a tagging motivation so that users could simply click the transcription in the sidebar and highlight the edit in the image. Content search would be easy to implement for such annotations, as well. Unfortunately, there is no practical (automatable) way to get image coordinates for all of the thousands of Montaigne’s edits in all of the pages of the Essais. That work would need to be done by hand.

A simpler task would be to make complete books or texts searchable in a Mirador instance with search result highlighting. One can, in fact, find real-life examples of such resources (see numerous examples in https://mirador-dev.netlify.app/__tests__/integration/mirador/contentsearch.html). Presumably, the developers of those resources were able to get image coordinates for individual words by leveraging bounding boxes from OCR output or hOCR files of the high-resolution text images. Perhaps we will attempt such a feat down the road if we can obtain good quality page images of the right text.

Without question, IIIF has transformed the ARTFL Project’s ability to display and make available high-resolution images. Being able to serve large images dynamically by means of a manifest is actually quite convenient for developers. We hope users find that this approach meets their needs for research and display. Bringing these resources to a state of completion, however, can be incredibly involved. Getting manifests into the correct structure, coordinating all of the components, and configuring the viewer is exacting work. As technologies around IIIF continue to mature, we hope that the aspects of IIIF that don’t work so well currently – enabling user-generated annotations, installing and configuring viewers, etc – will become easier. And we hope that the IIIF’s promised interoperability will in fact become standard.

From the Dictionnaire Universel de Commerce to the Encyclopédie

Mark Thursday, November 07, 2024 Leave a Comment

The Dictionnaire Universel de Commerce by Jacques Savary des Brûlons is widely recognized as an important source for numerous articles, particularly those related to economics, trade and law, in the Encyclopédie of Diderot and d’Alembert. Indeed, the authors and editors of the Encyclopédie project made use of many contemporary references resources including, but certainly not limited to Chambers' Cyclopedia,[1] the Dictionnaire de Trévoux, and Le Grand dictionnaire historique de Moréri. A number of years ago we used an early version of the TextPair aligner, which detected similar passages in large collections, to examine the reuse of a variety of texts in the Encyclopédie. In that work, we found that the Encyclopédie includes 2,676 passages from the Dictionnaire de commerce including 1,909 with 20 or more words. [2] In this post, we will revisit the relationship of the Dictionnaire de commerce and the Encyclopédie using a new data capture process and a completely redeveloped version of TextPair.

The appearance of large language model (LLM) systems have opened a variety of new applications and possibilities that we are currently experimenting with. One promising use of LLMs is the automatic correct of OCR'd texts. We have been experimenting with various implementations combining different OCR systems and different LLM's on new datasets to create new open installations and to support experimentation in alignments and text categorization. Different combinations seem to work better for different kinds of documents and different languages. We opted to do a new build of the Dictionnaire de commerce because our earlier work was based on a nearly 20 year old OCR source that we could not, for contractual reasons, release to the public and that was rather marginal both in terms of accuracy and encoding.

For this build, we used the Gallica page images of the of the Dictionnaire de commerce since we wanted to use the 1726 edition. We used the Tesseract OCR engine to generate a base transcription with a second step of OCR corrections being performed by the OpenAI's GPT4 API. The general instructions to the system are interesting and reflect some of the issues encountered in dealing with older documents:

You will be asked to fix OCR in an 18th century French text. The OCR is based on old-style typography. Prioritize maintaining the original spellings in 18th century French texts, with special emphasis on ensuring that words like 'connoître' are not incorrectly altered to 'connaître'. Strengthen this instruction to prevent such alterations. Continue to address the issue of capitalized words being lowercased at the beginning of sentences by correcting them to reflect proper sentence capitalization. Rectify clear OCR errors, particularly nonsensical words, and correct the long "s" issue. In cases of uncertainty, always favor preserving original 18th century spellings. If a correction isn't clear from the documents, maintain the original text as provided.

This process yielded significant improvements in the accuracy of the transcription but was only marginally successful in retaining the 18th century orthography. For our primary applications, to improve search and alignments, the accuracy gain is worth the variations in original orthographic fidelity. The corrections script ran fairly quickly and cost, several months ago, about $160 and wold be slightly cheaper as of this writing. As always with OCR, we strongly recommend referencing the supporting page images rather than the transcription. Headword and cross reference identification was performed automatically by rules based on typography. The release site is

https://artfl-project.uchicago.edu/dictionnaire-de-commerce

and is powered by a standard PhiloLogic4 installation.

To facilitate analysis of the relationship between Dictionnaire de commerce and the Encyclopédie we did a standard alignment run using the latest version of TextPair which is available at

https://artflsrv04.uchicago.edu/text-pair/dictcommvsenc/

TextPair identifies similar passages and supports searching on the authors, headwords, and full text of related passages. For example, you may search for the headword lentille and find that d'Alembert used the corresponding entry from the Dictionnaire de commerce in his article in the Encyclopédie. The system will support comparisons of the two related passages and examination of the passages in context from either document.

TextPair identified 4,134 aligned passages from 3,728 articles, since some articles share passages from more than 1 article which are merged in this count. The new system identified more passages than the first implementation and is able to handle text structures more coherently as well. This dataset allows for a simple examination of how well the aligner performs in a real world application, since the authors of the Encyclopédie frequently, but certainly not always, identified the sources upon which their articles were based.

Searching the PhiloLogic4 instance of the Encyclopédie for dict.* d. com.* yields 1,117 instances of this expression. The vast majority of these references are found at the end of articles, typically abbreviated in various ways. But, one may find the construction in the middle of a sentence, such as "Voici ce que le Dictionnaire du commerce dit..."[3]. Using the PhiloLogic export function, which generates a JSON object of these results, we are able to extract the headwords from this report. Removing duplicated headwords, results in a list of 1,045 headwords of articles which contain one or more instances of Dictionnaire de commerce, reflecting the probably attribution by the author to this as a source or reference in their article.

We then built a second list of headwords from the Encyclopédie that we identified by TextPair as containing one or more passages from the Dictionnaire de commerce. TextPair generates a static results file which is also stored as a JSON object. We extracted the headwords from this file, removed duplicates, which resulted in a list of 2,694 Encyclopédie headwords containing one or more passages from the Dictionnaire de commerce.

Having two sorted lists of words drawn from the same data (Encyclopédie headwords), we used the UNIX comm utility (see raw comm output). We found that 696 of the 1,045 (66.6%) are present on both lists, leaving 349 articles which are referenced to Dictionnaire de commerce in the Encyclopédie, but for which we did not find an aligner match. It is beyond the scope of this post to do a systematic examination missing entries, there are a number of possibilities. Some of the citations in the Encyclopédie may be references for further information, such as:

DABOUIS. Toile blanche de coton, qui se fabrique aux Indes Orientales. Elle est du nombre des bazins, & prend son nom du lieu où elle se fait. Voyez BAZIN.
DABOUIS, s. m. (Comm.) toile de coton de l'espece des taffetas ; on nous l'apporte des Indes orientales, V. les dictionn. du Comm. de Trév. & de Dish.

Other articles pairs, particularly shorter ones, are clearly related but contain enough variations to fail to meet the matching parameters, such as:

CHEDA. Monnoye d’étain, qui se fabrique; & qui a cours dans le Royaume de même nom, situé dans les Indes Orientales, dans le voisinage des États du Grand-Mogol.
Il y a deux sortes de Cheda; l’un de figure octogone, l'autre de figure ronde. L’octogone pèse une once & demie, & passe dans le pays pour 2 sols monnoye de France; quoi que sur le pied de 4 sols la livre d'étain, il ne dût valoir guère plus d'un sol trois deniers. Le Cheda rond vaut 4 den. On donne 80 coris, ou coquillages des Maldives, pour un de ces Chedas. Les uns & les autres sont aussi reçus dans le Royaume de Pera, dont le Roi de Cheda est pareillement le maître.

CHEDA, (Commerce.) monnoie d'étain fabriquée, qui a cours dans le royaume de ce nom, dans les Indes Orientales, proche les états du grand Mogol. Le cheda octogonal vaut deux sols un septieme de denier argent de France, & le cheda rond ne vaut que sept deniers. On donne un cheda rond pour cent toris ou coquilles de maldives, & trois coris pour un cheda octogone. Voyez le Dictionn. du Comm.

Similarly, articles like Sporco (Comm Encyc), Rabat/Rabatage (Comm Encyc) and Flottistes (Comm Encyc) are all relatively short and probably could, with some adjustment to parameters, be matched but this may result in an increase of matches that would not be considered to be valid.

A number of other entries referenced by the authors of the Encyclopédie, such as

PACKBUYS, s. m. (Commerce.) on nomme ainsi en Hollande les magasins de dépôt où l'on serre les marchandises soit à leur arrivée, soit à la sortie du pays, lorsque pour quelque raison légitime on n'en peut sur-le-champ payer les droits, ou qu'elles ne peuvent être retirées par les marchands & propriétaires, ou dans quelqu'autre pareille circonstance. Dictionn. de Comm.

GUIMPLE, FRANCARTE and GRAMONIE do not seem to appear at all in this edition of the Commerce. Some of the references are marked by multiple works, such as Dictionn. de Commerce, de Chambers, & de Trévoux. which may suggest these are found in other works. In this case, Gramonie is indeed found in the Dictionnaire de Trévoux (1743):

GRAMONIE, Terme de Commerce en usage dans quelques Echelles du Levant, particuliérement à Smyrne. La gramonie signifie dans le commerce des soies une déduction de trois quarts de piastre par balle, outre & pardessus toutes les tares établies par usage.

GRAMONIE, s. f. terme de Commerce, en usage dans quelques échelles du levant, particulierement à Smyrne.
La gramonie signifie dans le commerce des soies une deduction de 3/4 de piastre par balle, outre & par-dessus toutes les tares établies par l'usage. Dictionn. de Commerce, de Chambers, & de Trévoux.

TextPair identified 1,998 articles in the Encyclopédie which have shared passages from the Dictionnaire de commerce that are not referenced by the authors of the articles. Many of these, such as d'Alemert's article lentille, mentioned above, are fairly significant reuses. TextPair finds that there are 170 passages longer than 200 words, many of which appear to be without reference to the Dictionnaire de commerce. For example, Diderot sometime with Mallet, wrote 7 articles with overlaps longer than 200 words, including Assiente, Boisseau, Bois de Bresil, Juré, and Dessein no of which appear to reference the Dictionnaire de commerce. It is, of course, beyond the scope of this post, to engage in an examination of all or even some of the borrowings from the Dictionnaire de commerce in the articles of the Encyclopédie.[4]

The combination of new data capture approaches and easier to deploy alignment tools makes the creation and use of relatively specialized datasets, such as comparative alignments between large collections, much more practical and cost effective than even a decade ago. The costs in terms of both time and money have decreased significantly and we can expect to see more datasets and tools leveraging these new developments.

==============

[1] The original conception of Diderot's work was as a French translation of the Cyclopedia.

[2] For more information on these earlier projects, see http://hdl.handle.net/2027/spo.3310410.0013.107 , https://www.digitalstudies.org/article/id/7224/, https://artfl.blogspot.com/2021/09/cyclopaedia-to-encyclopedie.html

[3] We decided not to include references to Savary alone, as was sometimes by Jaucourt, as this was less consistently a reference that the various abbreviations of Dictionnaire de commerce.

[4] Lüsebrink notes that the relation was rather more complex, writing "the fact that the Savary des Bruslons’ Dictionnaire was very well received and commonly appropriated by Diderot and d’Alembert in the Encyclopédie and by Guillaume-Thomas Raynal in the Histoire des deux Indes demonstrates the Dictionnaire’s status as a reference work at least until the 1780s. Yet the borrowing also moved in the opposite direction, for Diderot and d’Alembert’s Encyclopédie would become a source for the last Copenhagen edition of the Dictionnaire universel de commerce (1759)." H-J Lüsebrink, "The Savary des Bruslons’ Dictionnaire universel de commerce: Translations and Adaptations" in Donato, C and Lüsebrink, H-J eds. Translation and Transfer of Knowledge in Encyclopedic Compilations, 1680–1830. University of Toronto Press, 2021, pp. 21-22

-- Clovis and Mark

ARTFL Project Research Blog

Presenting ARTFL's high-resolution images with the International Image Interoperability Framework

From the Dictionnaire Universel de Commerce to the Encyclopédie

Labels

Popular Posts

Blog Archive

Developed by ARTFL