Topic Models and Word Vectors

Leave a Comment

 










The Intertextual Hub is built around several different algorithms to facilitate document search, similarity and navigation. In previous posts in this series, I have examined the applications of sequence alignment, topic modeling, and document similarity in various contexts. A primary objective of the Hub is to direct attention to particular documents that may be of interest.  Arriving at a specific document to consult, the user is offered two views. One is a document browse mode, which provides links to similar documents and borrowed passages if detected. The second is to consult the Topic Distribution of the document. 

The left side of image above is top element of the Topic Model report for the Dénonciation a toutes les puissances de l'Europe : d'un plan de conjuration contre sa tranquilité général (link to text), an anonymous attack on the Club de 1789 published in 1790.  As mentioned in an earlier post in this series, the first topic, number 123, is clearly about elections, which does indeed reflect a section describing elections in the club constitution.  The lesser weighted topics in the document, 114, 111, 128 and so on, are all plausible topics of this document.  The right side of this image, shows word cloud, size reflecting weight, of the most distinctive vocabulary identified in this document.  This simple list is a considerably better guide to the specific content of the document, a denunciation of a conspiracy against the souverains of Europe to which is appended extracts from the constitution of the club.

Below the lists of Topics and Word Cloud of most distinctive tokens in Topic Model  report, there are two lists of 20 documents.  Below Topics are the top 20 documents identified by the similarity of topic distributions while below the Word Cloud are the top 20 documents as measured by similar vocabulary.  


The first two entries on the right hand column are parts of Sieyès'  Ébauche d'un nouveau plan de société patriotique, adopté par le Club de mil sept cent quatre-vingt-neuf  (BNF), found in Dénonciation, followed by Condorcet's constitutional proposal of 1793.  The two lists represent two different ways to identify similar documents.  It is useful to note the overlaps between the two lists, since these are identified as being relevant by both measures:

The contrast between topics and most distinctive words can be very significant.  Mercier's brief chapter on Vaches in Tableau de Paris is striking.  There are no overlaps between the similar document links and on two words, animal and compagnie, appear in the topic words, and those for low weighted topics.  Other documents are marked by the relative alignment of topics and distinctive words.  The topic/word report for Lettres écrites à M. Cérutti par M. Clavière, sur les prochains arrangemens de finance (1790 text link) shows that the distinctive tokens are found frequently on the top topic model word lists and there is more overlap between the most similar documents.
  
It is hardly surprising to find that there are significant distinctions between the representation of the contents of a specific document under a Topic Model and Word Vector (Most Distinctive Vocabulary) can be significantly different.  Topic Models attempt to identify the best fit of a document in an arbitrary number of groups.  Many documents about specific things, like cows in Paris, may well fall between these groups and be assigned to topics which are only very tangentially related to the contents of the document.  This weak relationship to topics is reflected by the limited number of tokens shared between the most heavily weighted topic terms and distinctive vocabulary of a document as well as limited or no overlap between lists of similar documents.  Topic models are an effective technique at identifying large patterns of topic development for search and analysis and classifying documents within these large patterns.  By contrast, identifying documents related by similar vocabulary, generally falling under the rubric of "nearest neighbor search" (NNS) is able to identify and leverage the particularities of a specific document to identify others closely related to it, but cannot by itself be used to aid with larger classifications or themes.  

Thus we provide the user in the Intertextual Hub with these two distinct views of a document, identifying the topics in which it is situated and the its most distinctive vocabulary and other documents which most closely resemble it.  A quick examination of the topics, words, and document lists gives the reader a pretty good sense of the degree to which a specific document falls coherently into one of the 150 topics in this model.  

The suggestion that humans should consider both measures and make a determination of the goodness of fit of a document to the topic model, it may be worth experimenting with the use of NNS measures as a way to evaluate Topic Models.  As I have shown above that a topic model, say of 150 topics generated using a set of parameters, can cover some documents more compelling than others.  In the example, finance documents are specific enough to be well covered by several related topics.  This leads to the possibility of establishing a quantitative measure by using somewhat independent measures, topics and word vectors, to assess the validity of a particular topic model.  For each document in a collection, this would be assessed by 
  • the number of common tokens in the top N topics with the most distinctive words;
  • number of common documents in the two lists;
  • number of matching topics (say top 3) for each document in the two lists of documents.
For every document, one would calculate how well the topic approximates the nearest neighbors of that document, measuring 0 for not at all to 1 for perfect identity.  We have two ways of dividing up an information space, topic models from effectively the top down (we're going have 150 buckets) and the other from the bottom up (but we don't know how many buckets).  Like a Venn diagram, the more these overlap, the better the coverage for that document.

Summing up this measure across all of the documents, one would arrive at a single value for all topics, and possibly a single value for every topic.  You could then adjust parameters.  Of course, you could overfit this, simply by saying I will have the same number of topics as documents.  But it might even give you a measure of how many topics is best, by observing a decrease in the coverage, which would be theoretically possible by spreading the topics to thin across the information space.  

 


  

Next PostNewer Post Previous PostOlder Post Home

0 comments:

Post a Comment