Federated Search and PhiloLogic -- from works to (someday) words

Leave a Comment
Over the past several years, the ARTFL Project has been developing the code infrastructure for the Intertextual Hub reading environment that federates heterogeneous text collections, extracting data from individual PhiloLogic4 instances and exposing that data to text analysis algorithms in order to allow users to navigate between individual and larger groups of texts related through shared themes, ideas, and passages.

We have now adapted components of this infrastructure to enable federated bibliographic searching on all of the text collections running under PhiloLogic. With the PhiloLogic Federated Bibliography Search database, we offer a simple, yet flexible search system that allows users to search for texts across approximately 90 individual collections in nearly a dozen languages. We currently allow search by author, title, and collection language. Searches can be further delimited by access type and by date range. So for example, a search for titles containing the word “slavery” written in English between 1750 and 1800 yields 38 results from the American Archives Collection, ECCO-TCP, and the Evans Early American Imprint Collection:


Search results contain links to work titles and collections. In results, we note the access status of the collection, whether open or limited to subscribing institutions or to users at the University of Chicago. This same search can be expanded across French and English collections by using a Boolean “OR” and entering “slavery OR esclavage” in the title field:


This search yields several titles in the open-access Newberry French Revolution Collection, one in the Frantext collection, and one -- a play entitled “L’Esclavage des Noirs, ou L’Heureux Naufrage, Drame” -- in the Théâtre Classique collection.

We envision this bibliographic search system to be the first of many such tools that permit search across the entirety of our collections. In the Intertextual Hub, users can conduct word or topic vector searches across all seven of the 18th-century French collections included in it. Results are returned ranked by relevance. For example, see these results for a search using a topic vector that contains astronomical terms:


Taking inspiration from this federated search approach, we would create a mechanism that enables combined metadata and fulltext queries across all PhiloLogic instances -- or at least a logically coherent subset thereof -- at once, in real time. Users would no longer be constrained to working inside single collections, but could conduct searches across multiple collections and potentially in multiple languages. For example, instead of searching for “slavery OR esclavage” only in titles, users could search for those terms in any number of collections running under PhiloLogic.

The technical details of such a search scheme remain to be hashed out, of course. But the great thing about PhiloLogic4 is that its fundamental architecture makes it possible to create standalone widgets or external apps that query database instances via an API and then repackage and render search results independently. For example, ARTFL’s PhiloReader apps for both Android and iOS work in exactly this way, and from the beginning were meant to be a demonstration of PhiloLogic’s server capabilities (download the Encyclopédie reader apps here and here).

Encyclopédie app search suggestionsEncyclopédie app metadata query results

These screenshots illustrate a simple example of the Encyclopédie app interacting with the PhiloLogic4 API. In the left screenshot, the app gets metadata search suggestions dynamically, in this case "Astronomie | Géographie". Query results for articles with that classification appear in the right screenshot.

For a federated search system, a client would send queries to however many PhiloLogic instances; gather and sort query results or links to query results; then present those results to the user. Again, we would first have to work out certain details before creating a search system like this, such as determining the exact nature of query results; whether and how to perform relevance ranking on results; whether we would need to integrate certain kinds of reporting features into PhiloLogic as a parallel development activity, etc.

However we proceed, the experience of building the Intertextual Hub has taught us that we can tap into the indexing, processing, and reporting capabilities of PhiloLogic to draw together many individual, heterogeneous text collections and create larger-scale research environments that allow users to engage in text analysis of an incredibly broad scope.
Read More

Topic Models and Word Vectors

Leave a Comment


The Intertextual Hub is built around several different algorithms to facilitate document search, similarity and navigation. In previous posts in this series, I have examined the applications of sequence alignment, topic modeling, and document similarity in various contexts. A primary objective of the Hub is to direct attention to particular documents that may be of interest.  Arriving at a specific document to consult, the user is offered two views. One is a document browse mode, which provides links to similar documents and borrowed passages if detected. The second is to consult the Topic Distribution of the document. 

The left side of image above is top element of the Topic Model report for the Dénonciation a toutes les puissances de l'Europe : d'un plan de conjuration contre sa tranquilité général (link to text), an anonymous attack on the Club de 1789 published in 1790.  As mentioned in an earlier post in this series, the first topic, number 123, is clearly about elections, which does indeed reflect a section describing elections in the club constitution.  The lesser weighted topics in the document, 114, 111, 128 and so on, are all plausible topics of this document.  The right side of this image, shows word cloud, size reflecting weight, of the most distinctive vocabulary identified in this document.  This simple list is a considerably better guide to the specific content of the document, a denunciation of a conspiracy against the souverains of Europe to which is appended extracts from the constitution of the club.

Below the lists of Topics and Word Cloud of most distinctive tokens in Topic Model  report, there are two lists of 20 documents.  Below Topics are the top 20 documents identified by the similarity of topic distributions while below the Word Cloud are the top 20 documents as measured by similar vocabulary.  

The first two entries on the right hand column are parts of Sieyès'  Ébauche d'un nouveau plan de société patriotique, adopté par le Club de mil sept cent quatre-vingt-neuf  (BNF), found in Dénonciation, followed by Condorcet's constitutional proposal of 1793.  The two lists represent two different ways to identify similar documents.  It is useful to note the overlaps between the two lists, since these are identified as being relevant by both measures:

The contrast between topics and most distinctive words can be very significant.  Mercier's brief chapter on Vaches in Tableau de Paris is striking.  There are no overlaps between the similar document links and on two words, animal and compagnie, appear in the topic words, and those for low weighted topics.  Other documents are marked by the relative alignment of topics and distinctive words.  The topic/word report for Lettres écrites à M. Cérutti par M. Clavière, sur les prochains arrangemens de finance (1790 text link) shows that the distinctive tokens are found frequently on the top topic model word lists and there is more overlap between the most similar documents.
It is hardly surprising to find that there are significant distinctions between the representation of the contents of a specific document under a Topic Model and Word Vector (Most Distinctive Vocabulary) can be significantly different.  Topic Models attempt to identify the best fit of a document in an arbitrary number of groups.  Many documents about specific things, like cows in Paris, may well fall between these groups and be assigned to topics which are only very tangentially related to the contents of the document.  This weak relationship to topics is reflected by the limited number of tokens shared between the most heavily weighted topic terms and distinctive vocabulary of a document as well as limited or no overlap between lists of similar documents.  Topic models are an effective technique at identifying large patterns of topic development for search and analysis and classifying documents within these large patterns.  By contrast, identifying documents related by similar vocabulary, generally falling under the rubric of "nearest neighbor search" (NNS) is able to identify and leverage the particularities of a specific document to identify others closely related to it, but cannot by itself be used to aid with larger classifications or themes.  

Thus we provide the user in the Intertextual Hub with these two distinct views of a document, identifying the topics in which it is situated and the its most distinctive vocabulary and other documents which most closely resemble it.  A quick examination of the topics, words, and document lists gives the reader a pretty good sense of the degree to which a specific document falls coherently into one of the 150 topics in this model.  

The suggestion that humans should consider both measures and make a determination of the goodness of fit of a document to the topic model, it may be worth experimenting with the use of NNS measures as a way to evaluate Topic Models.  As I have shown above that a topic model, say of 150 topics generated using a set of parameters, can cover some documents more compelling than others.  In the example, finance documents are specific enough to be well covered by several related topics.  This leads to the possibility of establishing a quantitative measure by using somewhat independent measures, topics and word vectors, to assess the validity of a particular topic model.  For each document in a collection, this would be assessed by 
  • the number of common tokens in the top N topics with the most distinctive words;
  • number of common documents in the two lists;
  • number of matching topics (say top 3) for each document in the two lists of documents.
For every document, one would calculate how well the topic approximates the nearest neighbors of that document, measuring 0 for not at all to 1 for perfect identity.  We have two ways of dividing up an information space, topic models from effectively the top down (we're going have 150 buckets) and the other from the bottom up (but we don't know how many buckets).  Like a Venn diagram, the more these overlap, the better the coverage for that document.

Summing up this measure across all of the documents, one would arrive at a single value for all topics, and possibly a single value for every topic.  You could then adjust parameters.  Of course, you could overfit this, simply by saying I will have the same number of topics as documents.  But it might even give you a measure of how many topics is best, by observing a decrease in the coverage, which would be theoretically possible by spreading the topics to thin across the information space.  



Read More
Next PostNewer Posts Previous PostOlder Posts Home