ARTFL Project Research Blog

Some Classification Experiments

Mark Friday, August 28, 2009 Leave a Comment

Since Clovis has running some experiments to see how well Topic Modeling using LDA might be used to predict topics on unseen instances, I thought I would back track a bit and write a bit about some experiments I ran last year which may be salient for future for comparative experimentation or even to begin thinking about putting some of our classification work into some level of production. I am presuming that you are basically familiar with some of the classifiers and problems with the Encyclopédie ontology. These are described in varying levels of detail in some of our recent papers/talks and on the PhiloMine site.

The first set was a series of experiments classifying a number of 18th century documents using a stand alone Bayesian classifier, learning the ontology of the Encyclopédie, and predicting the classes on chapters (divs) of selected documents. I have selected three for discussion here, since they are interesting and are segmented nicely into reasonable size chunks. I ran these using the English classifications and did not exclude the particularly problematic classes, such as Modern Geography (which tend to be biographies about important folks, filed under where they were from) or Literature. Each document shows the Chapter or Article, which is linked to the text of the chapter, followed by one or more classifications, assigned using the Multinomial Bayesian classifier. If I rerun these, I will simply pop the classification data right in each segment, for easier consultation. Right now, you will need to juggle between two windows:

Montesquieu, Esprit des Loix
Selected articles from Voltaire, Dictionnaire philosophique
Diderot, Elements de physiologie

PENDING: Discussion of some interesting examples and notable failures.

The second set of experiments compared K-Nearest Neighbor (KNN) classifier to the Multinomial Bayesian classifiers in two tests, the first being cross classification of the Encyclopédie and the second being multiple classifications, again using the Encyclopedie ontology, to predict classes of knowledge in Montesquieu's Esprit des Loix. The reason for these experiments is to examine the performance of linear (Bayesian) and non-linear (KNN) classifications in the rather noisy information space that is the Encyclopédie ontology. By "noisy" I mean to suggest that it is not at all uniform in terms of size of categories (which can range from several instances to several thousand), size of articles processed, degree of "abstractness," where some categories are very general and some are very specific, and a range other considerations. We have debated, on and off, whether KNN or Bayesian (or other linear classifiers such as Support Vector Machines) classifiers are better suited to the kinds of noisy information spaces that one encounters in retro-fitting historical resources such as the Encyclopedie. The distinction is not rigid. In fact, in a paper last year, on which Russ was the lead author, we argued that one could reasonably combine KNN and Bayesian classifiers by using a "meta-classifier" to determine which should be used to perform a classification task on a particular article in cases of a dispute (Cooney, et. al. "Hidden Roads and Twisted Paths: Intertextual Discovery using Clusters, Classifications, and Similarities", Digital Humanities 2008, University of Oulu, Oulu, Finland, June 25-29, 2008 [link]). We concluded that, for example, "KNN is most accurate when it classifies smaller articles into classes of knowledge with smaller membership".

Cross classification of the classified articles in Encyclopedie using MNB and KNN. I did a number of runs, varying the size of the training set and set to be classified. The result files for each of these runs, on an article by article basis, as quite large (and I'm happy to send them along). So, I compiled the results into a summary table. I took 16,462 classified articles, excluding Modern Geography, and "trained" the classifiers on between 10% and 50% of the instances. I put "trained" in scare quotes because a KNN classifier is an unsupervised learner, so what you are really doing is selecting a subset of comparison vectors with their classes. The selection process resulted in 276 and 708 classes of knowledge in the information space. As is shown in the table, KNN significantly outperforms MNB in this task. We know from pervious work, and general background, that the MNB tends to flatten out distinctions among smaller classes, but has the advantage of being fast.

The distinctions are at times fairly particular and many times the classifiers come up with quite reasonable predictions, even when they are wrong. A few examples (red shows a mis-classification):

Abaissé, Coat of arms (en terme de Blason)

KNN Best category = CoatOfArms
KNN All categories = CoatOfArms, ModernHistory
MNB Best category = ModernHistory
MNB All categories = ModernHistory, Geography

AGRÉMENS, Rufflemaker (Passement.)

KNN Best category = Ribbonmaker
KNN All categories = Ribbonmaker
MNB Best category = Geography
MNB All categories = Geography

TYPHON, Jaucourt: General physics (Physiq. générale)

KNN Best category = Geography
KNN All categories = Geography, GeneralPhysics, Navy, AncientGeography
MNB Best category = Geography
MNB All categories = Geography, AncientGeography

I applied the comparative classifiers in a number of runs using different parameters for Montesquieu, Esprit des Loix. All of the runs tended to give fairly similar results, so here is the last of the result sets. The results are all rather reasonable, with in limits, given the significant variations in size of chapters/sections in the EdL. The entire "section" 1:5:13 is

Idée du despotisme. Quand les sauvages de la Louisiane veulent avoir du fruit, ils coupent l'arbre au pied, et cueillent le fruit. Voilà le gouvernement despotique.

which gets classified as

KNN Best category = NaturalHistoryBotany
KNN All categories = NaturalHistoryBotany
MNB Best category = NaturalHistoryBotany
MNB All categories = NaturalHistoryBotany, Geography, Botany, ModernHistory

In certain other instances, KNN will pick classes like "Natural Law" or "Political Law" while the MNB will return the more general "Jurisprudence". I am particularly entertained by

PARTIE 2 LIVRE 12 CHAPITRE 5:
De certaines accusations qui ont particulièrement besoin de modération et de prudence
KNN Best category = Magic
KNN All categories =
MNB Best category = Jurisprudence
MNB All categories = Jurisprudence

Consulting the article, one finds a "Maxime importante: il faut être très circonspect dans la poursuite de la magie et de l'hérésie" and that the rest of the chapter is indeed a discussion of magic. While the differences are fun, and sometimes puzzling, one should also note the degree of agreement between the different classifiers, particularly if one discounts certain hard to determine differences between classes, such as Physiology and Medicine. The chapter "Combien les hommes sont différens dans les divers climats" (3:14:2) is classified by KNN as "Physiology" and MNB as "Medicine". Both clearly distinguish this chapter from others on Jurisprudence or Law.

I have tended to find KNN classifications to be rather more interesting than MNB. But I don't think the jury is out on that and one can always perform the kinds of tests that Russ described in the Hidden Roads talk.

All of these experiments were run using Ken Williams' incredible handy perl modules AI:Categorizer rather than PhiloMine (which also has a number of Williams' modules) just because it was easier to construct and tinker with the modules. I will post some of these shortly, for future reference.

Collocation Notes

Mark Thursday, August 27, 2009 Leave a Comment

Since we are planning a proposal that will use collocation as a main component for yet another grant/project proposal, I thought I would give some background notes for future reference. One of the more popular reporting features in PhiloLogic is the collocation table. This is a very simple mechanism. It counts the words around a search term or list of terms (the user sets the span and can turn of function word filtering) and reports the frequencies of terms to the left, right and total in a table. Richard recently added the "collocation cloud" feature to the current production version at ARTFL. The following is the collocation table and cloud for "tradition" in the current release of ARTFL-Frantext:

Collocation is a well established approach in Digital Humanities and other domains. Susan Hockey, for example, has a nice discussion of collocation in Electronic Texts in the Humanities, (Oxford, 2000), pp 90-91. She describes some work from the early 1970s and brings out the distinction between statistical calculations of collocation and very simple counts.

Berry-Rogghe (1973) discusses the relevance of collocations in lexical studies with reference to an investigation of the collocates of house, from which she is able to derive some notion of the semantic field of house. [...] Her program counts the total number of occurrences of the node, and the total number of occurrences of each collocate of the node within a certain span. It then attempts to indicate the probability of these collocates occurring if the words were distributed randomly throughout the text, and can thus estimate the expected number of collocates. It then compares the expected number with the observed number and generates a 'z-score', which indicates the significance of the collocate. The first table she presents shows the collocates of house based on a span of three words and in descending order of frequency. First is the, which co-occurs thirty-five times with house, but the total number of occurrences of the is 2,368. The is followed by this, a, of, I, in, it, my, is,have, and to, before the first significant collocate sold where six of the seven occurrences are within three words of house. Four words further on is commons, where all four occurrences collocate with house, obviously from the phrase House of Commons. When reordered by z-score, the list begins sold, commons, decorate, this, empty, buying, painting, opposite.

She goes on to suggest that "[f]or the non-mathematical or those who are suspicious of statistics, even simple counts of collocates can begin to show useful results, especially for comparative purposes." Which is, of course, precisely what PhiloLogic does now.

I have made extensive use of collocations over the years for my own work, both the zscore calculation and the very simple collocation by counts (filtering function words). These studies include American and French political discourse for my dissertation and subsequent papers, gender marked discourse, and comparisons of notions of tradition over time and in English and French. Breaking collocations down over time gives a pretty handy way to look at changing meanings of words. I have an ancient paper "Quantitative Linguistics and histoire des mentalités: Gender Representation in the Trésor de la langue française, 1600-1950" in the Contributions to Quantitative Linguistics: Proceedings of QUALICO 1991, First Quantitative Linguistics Conference (Amsterdam: Kluwer 1993): 351-71. which gives a write up on the method, some math :-), and references to some salient papers, including Berry-Rogghe (1973). In more recent work, I have used pretty much the same working model. Build a database split into 1/2 century chunks and do collocations by half century periods, using the z-score calculation (outline the paper). Indeed, I have a hacked version of PhiloLogic that does this.

As Hockey indicates, the statistical measure gives a rather different flavor for the collocates, since it attempts to measure the degree of relatedness between the two words. For example, the top collocates of "Platon" in a subset of Frantext shift around significantly.


Word   Rank ->  by zscore    by freq
Speusippe:          1             78
Aristote :          5              2

The reason for this is clear. 4/8 occurrences of Speusippe occur near Platon while 51/793 occurrences of Aristote are near Platon. I think both techniques are valid, and have used them to illuminate various tendencies. The z-score measures the relatedness of two words while the simple counts shows how in general the keyword s typically used. There is, of course, some overlap between the two, but the z-score tends to privilege to more unique constructions and associations.

Now, the obvious question is: "why don't we have the z-score calculation as an option in the standard collocation function in PhiloLogic?" And the answer is speed. The z-score (and other statistical models which I will mention below), attempts to compare expected frequencies of the word distribution against the observed frequencies, where the expected frequency assumes random distribution of words across a text, taking into account differences in frequencies. [Caveat, we know that "Language is never, ever, ever, random", but it is a useful heuristic, particularly for the kinds of simplistic comparisons I am doing.] The bottle neck for a real-time version of z-score collocations has been calculating baseline frequencies for any arbitrary range of documents. This may no longer be a significant problem. In a recent experiment, I built a script to sum the counts from arbitrary documents selected by bibliographic data (ARTFL Frantext word frequency report). While we have had a few users express interest in having more global counts, it would appear that our latest servers have more than enough horsepower to do these kinds of additions very quickly, certain fast enough to be bolted on to a collocation generator as an option. Certainly something to think about for a future revision of the old hopper.

There are, of course, a huge number of ways to calculate collocations. I suspect that there are two major areas: 1) how to identify spans and 2) how to measure the relationships between words. I had this notion that rather than simply look at spans as N words to the right and left, one would count words in pre-identified constructions (such as noun phrases, verb phases, or even clauses). Given the power of modern NLP tools, this is certain an option to think about. Related is the notion that one would rather do collocations on either lemmas or even "stems" (the results of a stemmer which basically strips various characters) which are not words, but can be related to sets of words. The other area of work is the possibility of using other statistical measures of association, such as log-likelihood and mutual information.

I'm pretty sure I've seen standalone packages that support more sophisticated statistical models. If we were going to do anything serious, the first place to start is reading. Reading? What? Yes, indeed. The chapter on Collocation in Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999 is a great place to start. Other titles may include Sabine Bartsch, Structural and functional properties of collocations in English: a corpus study of lexical and pragmatic constraints on lexical co-occurrence (Gunter Narr Verlag, 2004). There is also software. Of course, Martin's WordHoard has an array of collocation measures (documentation) and we should not forget other goodies, such as Collocate (commercial) and the Cobuild Collocation Sampler.

Finding related articles using topic modeling

Clovis Wednesday, August 26, 2009 2 comments

While still working on the topic inferencer, I started hacking at another possibility which is made possible by topic modeling, that is finding closely related texts within a corpus. There are several ways of doing this. What I chose to do was to consider the top three topics in each article and their respective proportion, and weigh it against the whole corpus. Here's a link to a search form where you can search for similar articles in the Encyclopedie :
http://robespierre.uchicago.edu/topic_modeling/search.form.html
In order to use it, you should paste the url of the article you're looking at. You'll then get a list of links to various articles that should be similar in content to the one you selected. A lot of tinkering can be done with the calculation of similarity, therefore I very well might have made some bad jugement here and there. This is therefore work in progress, therefore you might get strange results. But if you go through the whole list of results you might see some interesting things.
I would like to give you two examples I've tried that work really well. The first one is the article Economie by Rousseau ( which gives very good results), and if you look at link 24, which is according to my (flawed) calculation the 24th closest article, you'll see an example of an article that would have been hard to find and link to Rousseau. The second example is Question by Jaucourt. Among the top 20, a lot concern various methods of torture, spread out in different classes of knowledge. Let me know what you think.

Some Notes on Theme-Rheme in PhiloLogic

Mark Wednesday, August 26, 2009 Leave a Comment

One of the more arcane, and probably rarely used, functions in PhiloLogic is an experimental reporting scheme that I rather tentatively named "word in clause position analysis" or "theme-rheme," which is briefly described in the PhiloLogic user manual. I proposed this in talk titled "Making Space: Women's Writing in France, 1600-1950," which I gave at the ACH-ALLC and COCH/COSH conferences in 2004 (and drafted a good chunk of a paper about), and implemented in PhiloLogic around that time. Since we are now thinking of using this kind of analysis as a possible way to identify "interesting" or "illustrative" uses of words as part of another project, I thought it might be helpful to back-track a bit, give a bit more overview of how it works, outline some of the theoretical background as I understand it, and provide some useful links and papers.

As noted in the user manual entry, the "theme-rheme" function generates a standard concordance which it then attempts to sort out by where your search term occurs in a clause, where a clause is defined by punctuation. It segregates the occurrences by front of clause, back of clause, middle of clause, and instances where the clause is too short. By default, it displays only those occurrences that are clause initial. In the current implementation of ARTFL-Frantext a search for "tradition" results in 4,962 occurrences, which roughly break down as follows:

Front of Clause: 571 out of 4692 [12.16%] Avg. Clause length: 9.58
Last of Clause: 1056 out of 4692 [22.50%] Avg. Clause length: 8.68
Middle of Clause: 2348 out of 4692 [50.04%] Avg. Clause length: 9.56
Too Short: 717 out of 4692 [15.28%] Avg. Clause length: 2.40

The system further identifies specific documents in which your search term exceeds, by a certain percentage, the front of clause rate (in this case 12.16%), such as

55.55% (10/18): Montalembert, Charles Forbes, [1836], Histoire de Sainte Elisabeth de Hongrie, duchese de Thuringe...
28.20% (11/39): Bossuet, Jacques Bénigne, 1627-1704. [1681], Discours sur l'histoire universelle

and it, of course, displays these in different colors, such as:

L'Europe ainsi déracinée s'est plus tard déracinée davantage en se séparant, dans une large mesure, de la tradition chrétienne elle-même sans pouvoir renouer aucun lien spirituel avec l'Antiquité.

Oui, sans doute, si cette tradition était tout entière dans Aristote et dans l'enseignement péripatéticien de la scolastique.

La tradition attribue à Pythagore un séjour à Babylone.

The basic notion is that clause initial instances of words are probably more important, since they tend to be the "subject"of the rest of the clause. And authors who tend to use your favorite word in more clause initial positions than is average, might be doing something of particular note. In other words, can we use the machine to try to isolate, from the thousands of hits, those that might be particularly noteworthy. In this case, we have isolated a small subset (12%) of the occurrences of "tradition" in a clause initial position and some authors/documents who tend to privilege this word. I also identified clause ending uses, since (I suspect) end of clause words provide a bridge to the next clause (or sentence).

I set two "intertwingled" problems in the paper, women's writing and, more salient to this post, the increasing need to arrive at high orders of generalization to make sense of the results coming from ever increasing datasets. Obviously, one solution to this is work we have been doing over the last few years in the areas of machine learning, document summarization, and text data mining (see PhiloMine and related papers). What I proposed in this paper was a move toward from traditional text analysis techniques towards analytical notions based on functional linguistics or functional grammar, which are related in various ways to text linguistics or discourse analysis. This is a huge area of work and I would not begin to characterize it. Helma, of course, is a functional linguist and proposes that this is a branch of "linguistics that takes the communicative functions of language as primary as opposed to seeing form as primary." And as you might imagine, there are schools and competing views. I have to admit I like the name "West Coast Functionalists. :-)

My take on this is that meaning arises from choices, or chains of choices, with sets of goals and objectives. I also suspect that many "functionalists" would agree on a few other basic notions, such as lexis and grammar are inseparable in meaning creation, and indeed the term "‘lexico-grammar’ is now often used in recognition of the fact that lexis and grammar are not separate and discrete, but form a continuum." (cite) It also appears that many functionalists would agree with the notion that the clause is the building block unit. There are probably other points of general agreement about just how different layers might work or be defined. For example, Simon Dik (not related to Helma) identified three layers in his Functional Grammar:

SEMANTIC FUNCTIONS (Agent, Patient, Recipient, etc.) which define the roles that participants play in states of affairs, as designated by predications.
SYNTACTIC FUNCTIONS (Subject and Object) which define different perspectives through which states of affairs are presented in linguistic expressions.
PRAGMATIC FUNCTIONS (Theme and Tail, Topic and Focus) which define the informational status of constituents of linguistic expressions. They relate to the embedding of the expression in the ongoing discourse, that is, are determined by the status of the pragmatic information of Speaker and Addressee as it developes in verbal interaction.

Of course, other folks will carve these things up differently. Robert de Beaugrande, whose extensive web site and papers are well worth the visit, represents the various levels of functional linguistics from nerves to text, as outline in the image, taken from his "Functionalism and Corpus Linguistics in the ‘Next Generation." In another paper, he argues "Corpus data are so eminently suited to informing us about 'networks' because they offer concrete displays of the constraints upon how sets of choices can interact. In the 'lexicon' part of the 'lexicogrammar' of English, these constraints constitute the collocability in the virtual system, and the textual actualisations are the lexical collocations. In the 'grammar' part of the 'lexicogrammar', these constraints constitute the colligability in the virtual system, and the textual actualisations are the grammatical colligations" and goes on to represent the following image the series of "dialectics" running between text and language.

Ok, they are fun images ... now back to work... and I wanted to see how embedding images would work...

It is the level of pragmatics that I suspect interests us in this particular case. As I noted above, I borrowed the "theme-rheme" nominclature from MAK Halliday's Introduction to Functional Linguistics. Again:

Theme: "starting point of the message, what the clause is going to be about".
Rheme: everything not the Theme: new information/material

Theme contains given information i.e. information which has already been mentioned somewhere in the text, or is familar from the context. There is an accessible description of this, with some nice examples in Theme and Rheme in the Thematic Organization of Text.

In English (and French), identification of the Theme is based primarily on word order. Thus, the theme is the element which comes first in the clause. (Eggins, An Introduction to Systemtic Functional Linguistics, p. 275) Plenty of problems identifying the exact boundaries of different kinds of themes.

The take way point, from all of this, is that the theme/rheme distinction is important because it is the way you get thematic development across a longer span of text. Obviously, the Rheme in one clause can become Theme in the next.

One other take away: Halliday makes the argument that one can use punctuation in written texts to identify clauses, which is not the same for spoken texts.

More later????? I can track down a few more bibliographic entries....

Topic inference using the Encyclopédie trained model

Clovis Tuesday, August 18, 2009 Leave a Comment

While trying to use the Encyclopédie trained topic model on the Mémoires de Trévoux, something quite unexpected happened, the topic modeler was finding it hard to find topics that matched the Trévoux articles. You can see those results here:
http://robespierre.uchicago.edu/topic_modeling/inference/encyclo2trevoux.txt
Since the topic inference feature in mallet is relatively new, I though of creating a model out of the Trévoux, and then compare the topic proportion generated from the topic trainer with the one generated using the model. So basically, I tested the model against the corpus of articles from which it originated. In all likelihood, the results were going to be excellent. Well, they weren't, therefore showing that the topic inferencer is not yet operational (it is a new feature after all). On the other hand, I did notice something, that if you compare the results, you'll notice that the same topics (mostly) are prominent in both, only the proportion measure is off, approximately divided by ten when using topic inference. Here are those results:
when using topic training:
http://robespierre.uchicago.edu/topic_modeling/inference/proportions.txt
when using topic inference:
http://robespierre.uchicago.edu/topic_modeling/inference/proportions_itself.txt
The question is, can I trust those results. My initial analysis tends to show that it does work, but it's definitely not as accurate as the first experiments I did with topic modeling. Some more digging is needed, eventually getting in touch with the Mallet developers.

Proportions of topics in Encyclopédie articles

Clovis Friday, August 14, 2009 Leave a Comment

This is a follow-up to my previous blog entry about topic modeling in the Encyclopédie. As the title of this post suggests, I will be showing here the proportions of topics per article. Instead of just posting those results without any further comment, I would like to focus on 12 random articles to see what kind of results one could get. My feeling about this is that the best results are in the 300 topic model. What do you think? Note that there is still a lot of room for some refinement.

Examples from the 42 topic model :
http://docs.google.com/View?id=dgrbcw9z_69gk9w5tgc
Examples from the 100 topic model:
http://docs.google.com/View?id=dgrbcw9z_70c2n79kgv
Examples from the 150 topic model:
http://docs.google.com/View?id=dgrbcw9z_71cx73tsch
Examples from the 200 topic model:
http://docs.google.com/View?id=dgrbcw9z_724t5x9mfm
Examples from the 250 topic model:
http://docs.google.com/View?id=dgrbcw9z_73fvznkb7j
Examples from the 300 topic model:
http://docs.google.com/View?id=dgrbcw9z_74chqfgsct
Examples from the 350 topic model:
http://docs.google.com/View?id=dgrbcw9z_75chsw8gcp

If you wish to look yourself at the results, here they are, the first number is the topic with the proportion measure in parentheses. The article number is the div number of the article :

http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_42.txt
http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_100.txt
http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_150.txt
http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_200.txt
http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_250.txt
http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_300.txt
http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_350.txt

the PhiloLogic Data Architecture

Richard Tuesday, August 11, 2009 Leave a Comment

For the last year or so, I've been arguing that it's time for a round of maintenance work on PhiloLogic's various retrieval sub-systems. In a later post, I'll examine some of the newer data store components out there in the open-source world. First, however, I'd like to enumerate what PhiloLogic's main storage components are, where they live, and how they work, for clarity and economy of reference.

The Main Word Index:

PhiloLogic's central data store is a GDBM hashtable called index that functions, basically, the same way as a Perl hash, but on disk, rather than in memory. It has a set of keys, in this case each unique word in the database. Each key corresponds to a short-ish byte-string value, which can come in two different formats:

For low-frequency words, each key word corresponds to a packed binary data object that contains three components:

A short header that says, "I'm a low-frequency word!"
Total frequency for this word. This is used by the query optimizer.
A compressed binary hitlist for the word, containing the byte offset and object address of every occurrence of the word.

For high frequency words, the structure is similar. A type header is followed by the total frequency, which is followed by an address into the raw block index, called index.1. If you've ever looked at a database directory, you may have noticed that this index.1 file is typically two or three times the size of the main index. That's because it contains the binary hitlists for all the high-frequency words in the database, divided into 2-kilobyte chunks. That's important, because, as Zipf's law will tell us, the most frequent words in a language can be very, very frequent, and thus the hits for a single word could go on for tens or or hundreds of megabytes. By dividing large hitlists into chunks, we can put a limit on memory usage. In a modern system, we could set a higher ceiling- 64k might be reasonable. But architecturally, the chunking algorithms are vital for frequent words or large databases.

The upside of this admittedly complex architecture is PhiloLogic's raison d'etre: its blindingly fast search performance. The downside is that GDBM doesn't support some of the features that we expect, particularly ones that involve more complicated searches than the simple keywords.

Thus, we added a plain-text token table, words.R, that we can grep through quite quickly to get a list of all valid keys that match a specified pattern, and a tab delimited token table, words.R.wom, that we can grep through for various secondary attributes and normalizations of the indexed tokens.

Both of these functions are very fast, due to the high throughput of GNU grep. The only downside is the opacity of the index construction process, which can make modifications to this structure very difficult. That said, it's capable of handling unexpectedly rich data if you understand where everything goes. I'll go into this in more depth in my Perseus whitepaper.

Document Metadata:

Traditional 3-series PhiloLogic keeps the most important information about it's XML document store in a file called docinfo, which contains the filename, size, date, and a few other book-keeping tidbits. The reporting systems use this file for the basic tasks of opening an XML file and reading a section out of it, whether for search results context, or for browsing with getobject.

All other data go in a file called bibliography, which has about 20 fields for author, title, publisher, language, etc., and which the gimme utilities search for bibliographic queries. Traditionally, this is done with GNU egrep, but more recent releases have preferred MySQL for its more sophisticated query language. The result of any query, regardless, is a list of binary document id's to pass through to the search engine as a corpus file.

Text Object Metadata:

PhiloLogic tracks all objects below the document level as either division or paragraph objects, and stores them in two different tables: divindex.raw and subdivindex.raw respectively, and uses the subdocgimme utilities to query the metadata. As before, query results, via SQL or egrep, are pushed off to search3 as a packed binary corpus file. And again, the reporting and retrieval subsystems have their own data structure, in this case the toms, for contextualizing hits, or for retrieval with getobject. Finally, the loader builds several derived data structures called navigation, pagemarks, references, and dividxchild.tab for various internal functions.

As the reader may have noticed, the document and text objects are not as clean or as optimized as the main word index, and even harder to hack coherently. I'll detail my first attempt at a more dynamic object structure for the Perseus corpus in a later post. For now, though, I'll pose a question:

Is it possible to devise a single data structure that can handle all of the functions that our current gang of tables and packed binaries does? In short, these are:

Query objects for arbitrary combinations of properties
Efficiently retrieve file paths and byte offsets for retrieval
Maintain the logical relationships of all these objects
Resolve internal and external references to objects at any depth

My current prototypes fulfill about half of these requirements, #2 & 3. But a new object architecture would ideally be a single data structure that does it all. Can anyone think of needed features that I've missed? What about the kind of metadata that ASP or IWW use? Can anyone think of a component that's totally unnecessary and redundant, or notoriously buggy? Please, let me know.

Preliminary results on topic modeling in the Encyclopédie

Clovis Friday, August 07, 2009 2 comments

Following up on Mark's comments on topic modeling using Latent Dirichlet Allocation, or LDA, I went on to explore some implementations of this algorithm to see what type of results we would get on some of the data sets we have. I first started using David Blei's code, but it ended being to complex to use, the documentation was very elusive. So I starting to look at another tool, Mallet, which also includes an implementation of LDA.
Here are the first results I've come up with when running it against the Encyclopédie. The main issue when using topic modeling is, as described in this article, coming up with the right number of topics as the results differ quite a bit depending on this number. I haven't quite settled yet for a particular number. Below are the topics I've come up with. Let me know what you think, which version(s) seems the more accurate. I would argue that the question comes down to how focused do we want each topic to be, or how broad do we want those topics to be without losing any accuracy. Please let me know if there are some words you think I could eliminate (less noise, more accuracy). Several hints would be useful, such as pinpointing a topic that doesn't make sense, a word that seems out of place somewhere (probably some noise to be eliminated during another run). Note that the list of words that I delete from the articles (so far a little over 300) could very well be used for other 18th century French texts, if not for different periods from 1650 to today with some tweaks here and there. Thanks.

Version with 42 topics:
http://robespierre.uchicago.edu/topic_modeling/42topics-encyclo.txt
Version with 100 topics:
http://robespierre.uchicago.edu/topic_modeling/100topics-encyclo.txt
Version with 150 topics:
http://robespierre.uchicago.edu/topic_modeling/150topics-encyclo.txt
Version with 200 topics:
http://robespierre.uchicago.edu/topic_modeling/200topics-encyclo.txt
Version with 250 topics:
http://robespierre.uchicago.edu/topic_modeling/250topics-encyclo.txt
Version with 300 topics:
http://robespierre.uchicago.edu/topic_modeling/300topics-encyclo.txt
Version with 350 topics:
http://robespierre.uchicago.edu/topic_modeling/350topics-encyclo.txt

These results are just the preliminary step. The interesting part is the topics proportions per document. I'll show some results in another post.

ARTFL Project Research Blog

Some Classification Experiments

Collocation Notes

Finding related articles using topic modeling

Some Notes on Theme-Rheme in PhiloLogic

Topic inference using the Encyclopédie trained model

Proportions of topics in Encyclopédie articles

the PhiloLogic Data Architecture

Preliminary results on topic modeling in the Encyclopédie

Labels

Popular Posts

Blog Archive

Developed by ARTFL