The Intertextual Hub is built around several different algorithms to facilitate document search, similarity and navigation. In previous posts in this series, I have examined the applications of sequence alignment, topic modeling, and document similarity in various contexts. A primary objective of the Hub is to direct attention to particular documents that may be of interest. Arriving at a specific document to consult, the user is offered two views. One is a document browse mode, which provides links to similar documents and borrowed passages if detected. The second is to consult the Topic Distribution of the document.
The left side of image above is top element of the Topic Model report for the Dénonciation a toutes les puissances de l'Europe : d'un plan de conjuration contre sa tranquilité général (link to text), an anonymous attack on the Club de 1789 published in 1790. As mentioned in an earlier post in this series, the first topic, number 123, is clearly about elections, which does indeed reflect a section describing elections in the club constitution. The lesser weighted topics in the document, 114, 111, 128 and so on, are all plausible topics of this document. The right side of this image, shows word cloud, size reflecting weight, of the most distinctive vocabulary identified in this document. This simple list is a considerably better guide to the specific content of the document, a denunciation of a conspiracy against the souverains of Europe to which is appended extracts from the constitution of the club.
Below the lists of Topics and Word Cloud of most distinctive tokens in Topic Model report, there are two lists of 20 documents. Below Topics are the top 20 documents identified by the similarity of topic distributions while below the Word Cloud are the top 20 documents as measured by similar vocabulary.
The first two entries on the right hand column are parts of Sieyès' Ébauche d'un nouveau plan de société patriotique, adopté par le Club de mil sept cent quatre-vingt-neuf (BNF), found in Dénonciation, followed by Condorcet's constitutional proposal of 1793. The two lists represent two different ways to identify similar documents. It is useful to note the overlaps between the two lists, since these are identified as being relevant by both measures:
- Mont-Gilbert, François-Agnès, b. 1747.●Avis au peuple, sur sa liberté et l'exercice de ses droits : contenu dans un projet de constitution républicaine●1793
- Daunou, P. C. F. (Pierre Claude François), 1761-1840.●Essai sur la Constitution●1793
- ARCHIVES PARLEMENTAIRES DE 1787 A 1860 PREMIÈRE SÉRIE (1787 à 1799)●QUARANTE-HUITIÈME ANNEXE (1)●1793 (reprint of Mont-Gilbert)
- ARCHIVES PARLEMENTAIRES DE 1787 A 1860 PREMIÈRE SÉRIE (1787 à 1799)TOME LXII DU 13 AVRIL 1793 AU 19 AVRIL 1793●DOUZIÈME ANNEXE (1) A LA SÉANCE DE LA CONVENTION NATIONAL DU MERCREDI 17 AVRIL 1793.●1793 (reprint of Daunou).
The contrast between topics and most distinctive words can be very significant. Mercier's brief chapter on Vaches in Tableau de Paris is striking. There are no overlaps between the similar document links and on two words, animal and compagnie, appear in the topic words, and those for low weighted topics. Other documents are marked by the relative alignment of topics and distinctive words. The topic/word report for Lettres écrites à M. Cérutti par M. Clavière, sur les prochains arrangemens de finance (1790 text link) shows that the distinctive tokens are found frequently on the top topic model word lists and there is more overlap between the most similar documents.
Thus we provide the user in the Intertextual Hub with these two distinct views of a document, identifying the topics in which it is situated and the its most distinctive vocabulary and other documents which most closely resemble it. A quick examination of the topics, words, and document lists gives the reader a pretty good sense of the degree to which a specific document falls coherently into one of the 150 topics in this model.
The suggestion that humans should consider both measures and make a determination of the goodness of fit of a document to the topic model, it may be worth experimenting with the use of NNS measures as a way to evaluate Topic Models. As I have shown above that a topic model, say of 150 topics generated using a set of parameters, can cover some documents more compelling than others. In the example, finance documents are specific enough to be well covered by several related topics. This leads to the possibility of establishing a quantitative measure by using somewhat independent measures, topics and word vectors, to assess the validity of a particular topic model. For each document in a collection, this would be assessed by
- the number of common tokens in the top N topics with the most distinctive words;
- number of common documents in the two lists;
- number of matching topics (say top 3) for each document in the two lists of documents.
For every document, one would calculate how well the topic approximates the nearest neighbors of that document, measuring 0 for not at all to 1 for perfect identity. We have two ways of dividing up an information space, topic models from effectively the top down (we're going have 150 buckets) and the other from the bottom up (but we don't know how many buckets). Like a Venn diagram, the more these overlap, the better the coverage for that document.
Summing up this measure across all of the documents, one would arrive at a single value for all topics, and possibly a single value for every topic. You could then adjust parameters. Of course, you could overfit this, simply by saying I will have the same number of topics as documents. But it might even give you a measure of how many topics is best, by observing a decrease in the coverage, which would be theoretically possible by spreading the topics to thin across the information space.