ARTFL Project Research Blog

Encyclopédie under KinoSearch

Mark Thursday, October 29, 2009 3 comments

One of the things that I have wanted to do for a while is to examine implementations of Lucene, both as a search tool to complement PhiloLogic and possibly as a model for future PhiloLogic renovations. Late this summer, Clovis identified a particular nice open source, perl implementation of Lucene called KinoSearch. This looks like it will fit both bills very nicely indeed. As a little experiment, I loaded 73,000 articles (and other objects) from the Encyclopédie, and cooked up a super simple query script. This allows you to type in query words and get links to articles sorted by their relevancy to your query (the italicized number next to the headword). At this time, I am limiting to the top 100 "hits". Words should be lower case, accents are required, and words should be separated by spaces. Try it:

Query Words: or
Require all words

Here are a couple of examples which you can block copy in: artisan laboureur ouvrier paysan malade symptome douleur estomac
peuple pays nation ancien république décadence

The first thing to notice is search speed. Lucene is known to be robust, massively scalable, and fast. The KinoSearch implementation is certainly very fast. A six term search returns in a real .35 seconds and less than 1/10 of a second of system time, using time on the command line. I did not time the indexing run, but think 10 minutes or so. [Addition: by reading 147 TEI files rather than 77,000 split files, the loading indexing time for the Encyclopédie is falls to (using time) real 2m45.9s, user 2m33.8s sys 0m11.1s.]

The KinoSearch developer, Marvin Humphrey, has a splendid slide show, outlining how it works, with specific reference to the kind of parameters, such as stemmers and stopwords, that one needs to consider as well as an overview of the indexing scheme. Clovis and I thought this might be the easiest way to begin working with Lucene, since it is a perl module with C components, so it is easy to install and get running. Given the performance and utility of KinoSearch, I suspect that we will be using it extensively for projects where ranked relevancy results are of interest. These might include structured texts, such as newspaper and encyclopedia articles, and possibly large collections of uncorrected OCR materials which may not suitable for text analysis applications supported by PhiloLogic. Also, on first review, the code base is very nicely designed and, since it has many of the same kinds of functions as PhiloLogic, strikes me as being a really fine model of how we might want to renovate PhiloLogic.

For this experiment, I took the articles as individual documents in TEI, which Clovis had prepared for other work. For each article, I grabbed the headword and PhiloLogic document id, which are loaded as fielded data. The rest of the article is stripped of all encoding and loaded in. It would be perfectly simple to read the data from our normal TEI files. We could see simply adding a script that would load source data from a PhiloLogic database build, to add a different kind of search, which would need to have a different search box/form.

I have not played at all with parameters and I can imagine that we would want to perform some functions, such as using simple rules for normalization, on input, since it uses a stemmer package also by M Humphrey. Please email me, post comments, or add a blog entry here if you see problems, particularly search oddities, have ideas about other use cases, or more general interface notions. I will be writing a more generalized loader and query script -- with paging, numbers of hits per page, filtering by minimum relvancy scores and looking at a version of the Philologic object fetch which would try to high-light matching terms -- and moving that over to our main servers.

back to comparing similar documents

Clovis Monday, October 26, 2009 Leave a Comment

I mentioned a little while ago some work I did on comparing one document with the rest of the corpus it belongs to ( the examples I used in that blog post will not give the same results anymore, the results might not be as good, I haven't optimized the new code for the Encyclopédie yet). The idea behind it was to use the topic proportions for each article generated from LDA, and come up with a set of calculations to decide which document(s) was closest to the original document. The reason why I'm mentioning here once more is that I've been through that code again, cleaned it up quite a bit, improved its performance, tweaked the calculations. Basically, I made it usable for other people but myself. Last time I built a basic search form to use with Encyclopédie articles. This time I'm going to show the command line version, which has a couple more options than the web version.
In the web version, I was using both the top three topics in each document, and their individual proportion within that document. For instance, Document A would have topic 1, 2 and 3 as its main topics. Topic1 would have a proportion of 0.36, Topic2 0.12, Topic3 0.09. In the command line version, there's the option of only using the topics, without the proportion. The order of importance of each topic is of course still respected. Depending on the corpus you're looking at, you might want to use one model rather than the other. It does give different results. One could of course tweak this some more and decide to only take the proportion of the prominent topic, therefore giving it more importance. There is definitely room for improvement.
There was also another option that was left out of the web version. By default, I set a tolerance level, that is the score needed by each document in order to be given as a result of the query. In the command line version, I made it possible to define this tolerance in order to get more or fewer results. This option is currently only possible with the refined model (the one with topic proportions). The code is currently living in
robespierre:/Users/clovis/LDA_scripts/
It's called compare_to_all.pl. There's some documentation in the header to explain how to use it. It's fairly simple. I might do some more work on it, and will update the script accordingly.
There are other applications of this script besides using on a corpus made of well defined documents. One could very well imagine applying this to a corpus subdivided in chunks of text using a text segmentation algorithm. On could then try to find passages on the same topic(s) using a combination of LDA and this script. The Archives parlementaires could be a good test case.
Another option would be to run every document of a corpus against the whole corpus and store all the results in a SQL database. This would allow having a corpus where each document can be linked to various others according to the mixture of topics they are made of.
I will try to give more concrete results some time soon.

Supervised LDA: Preliminary Results on Homer

Kristin Monday, October 26, 2009 Leave a Comment

While Clovis has been running LDA tests on Encyclopédie texts using the Mallet code, I have been running some tests using the sLDA algorithm. After a few minor glitches, Richard and I managed to get the sLDA code, written by Chong Wang and David Blei, from Blei's website up and running.

Unlike LDA, sLDA (Supervised Latent Dirichlet Allocation), requires a training set of documents paired with corresponding class labels or responses. As Blei suggests, these can be categories, responses, ratings, counts or many other things. In my experiments on Homeric texts, I used only two classes, corresponding to Homer's two major works: the Iliad and the Odyssey. Akin to LDA, topics are inferred from the given texts and a model is made of the data. This model, having seen the class labels of the texts it was trained on, can then be used to infer the class labels of previously unseen data.

For my experiments, I modified the xml versions of the Homer texts that we have on hand using a few simple perl scripts. Getting the xml transformed into an acceptable format for Wang's code required a bit of finagling, but was not too terrible. My scripts first took the xml and split it into books (the 24 books of the Iliad and likewise for the Odyssey), then stripped the xml tags from the text. Saving out four books from each text for applying the inference step, I took the rest of the books and output the corresponding data file necessary for input into the algorithm (data format here).

I played around a bit with leaving out words that occurred extremely frequently or extremely rarely. For the results I am posting here, the English vocabulary was vast and I cut it down to words that occurred between 10 and 60 times. This probably cuts it down too much though, so it would be good to try some variations. Richard has suggested also cutting out the proper nouns before running sLDA in order to focus more on the semantic topics. For the Greek vocabulary, I used the words occurring between 3 and 100 times, after stripping out the accents.

Running the inference part of sLDA on the 8 books that I had saved out seemed to work quite well. It got all 8 correctly labeled as to whether they belonged to the Iliad or to the Odyssey. In a reverse run, the inference was able to again achieve 100 percent accuracy on labeling the 40 books after having been trained on only the 8 remaining books.

The raw results of the trials give a matrix of betas with a column for each word, and a row for each topic. These betas thus give a log based weighting of each word in each topic. Following this are the etas, with a column for each topic and a row for each class. These etas give the weightings of each topic in each class, as far as I understand it. Richard and I slightly altered the sLDA code to output an eta for each class, rather than one less than the number of classes as it was giving us. As far as we understand the algorithm as presented in Blei's paper, it should be giving us an eta for each class. Our modification didn't seem to break anything, so we are assuming that it worked, as the results seem to be looking nice. Using the final model data, I have a perl script that outputs the top words in each topic along with the top topics in each class. These are the results that I am giving below.

Results of my sLDA Experiments on Homer:

English Text: 10 Topics Greek Text: 10 Topics

25 Topics 25 Topics

50 Topics 50 Topics

75 Topics 75 Topics

100 Topics 100 Topics

150 Topics 150 Topics

200 Topics 200 Topics

250 Topics 250 Topics

Also, samples of the output from Blei and Wang's code, corresponding to the English Text with 100 topics:

Final Model: gives the betas and the etas which I used to output my results
Likelihood: the likelihood of these documents, given the model
Gammas
Word-assignments

Inferred Labels: Iliad has label '0', Odyssey has label '1'.
Inferred Likelihood: the likelihood the previously unseen texts
Inferred Gammas

I have not played around much with the gammas, but they seem to give a weighting of each topic in each document. Thus you could figure out for which book of the Iliad or the Odyssey a specific topic was the most prevalent. It would be interesting to see if this correctly pinpoints which book the cyclops comes in for instance, as this is a fairly easily identifiable topic in most of the trials.

Encyclopédie Renvois Search/Linker

Mark Thursday, October 22, 2009 Leave a Comment

During the summer (2009), a user (UofC PhD, tenured elsewhere) wrote to ask if there was any way to search the Encyclopédie and "generate a list of all articles that cross-reference a given article". We went back and forth a bit, and I slapped a little toy together and let him play with it, to which his reply was "Oh, this is cool! Five minutes of playing with the search engine and I can tell you it shows fun stuff...". This is, of course, an excellent suggestion which we have talked about in the past, usually in the context of visualizing relationships of articles in various ways. At the highest level, visualizing the relationships of the renvois is what Gilles and I attempted to do in our general "cartography paper"[1] and, more recently, Robert and Glenn (et. al.) tried, in a radically different way, to do in their work on "centroids"[2].

The current implementation of the Encyclopédie under PhiloLogic will allow users to follow renvois links (within operational limits to be outlined below), but does not support searching and navigating the renvois in any kind of systematic fashion. Since this is something I think warrants further consideration, I thought it might be helpful to document this toy, give some examples, let folks play with it, outline some of the current issues, and conclude with some ideas about what might be done going forward.

To construct this toy, I wrote a recognizer to extract metadata for each article in the Encyclopédie which has one or more renvois. As part of the original development of the Encyclopédie, each cross reference was automatically detected from certain typographic and lexical clues. This resulted in roughly 61,000 cross-references. Accordingly, the extracted database has 61,000 records. I loaded these into a simple MySQL database and used a standard script to support searching and reporting. The search parameters may include articles headwords, authors, normalized and English classes of knowledge as well as the term(s) being cross referenced. For example, there are 39 cross-referenced article pairs for the headword estomac. As you can see from the output, I'm listing the headword, author, classes of knowledge, and the cross referenced term. You can get the article of the cross referenced term or the cross-references in that article. Thus, the second example shows the link to Digestion:

ESTOMAC, ventriculus (Tarin: Anatomie, Anatomy ) ==> Digestion || renvois
[The renvois of Digestion find 56 articles pairs, including one to intestins]
DIGESTION (Venel: Economie animale, Animal economy ) ==> Intestins || renvois
Intestins (unknown: Anatomie, Anatomy ) ==> Chyle || renvois

and so on ==>lymphe==>sang==>ad nauseum. No, there is no ad nauseum, just how you might feel after going round and round.

Now, there are problems, but please go ahead and play with this now using the submit form, as long as you promise to come back and read thru the rest of this and let me know about any other problems.

Problems

As noted above, the renvois were identified automatically. And as with most of these things, it worked reasonably well. But you will see link errors and other things which indicate problems. Glenn reported these to me and I was going to eliminate them. On second thought, this little toy lets to consider the renvois rather more systematically. Where you see a link error is (probably) a recognizer error, which either failed to get a string to link or got confused by some typography. The linking mechanism itself is based on string searches. In other words, whenever you click on a renvois, you are in fact performing a search on the headwords. This simple heuristic works reasonably well, returning string matched headwords. In some cases, you get nothing because there is no headword that has the renvois word(s), and at other times you will get quite a list of articles, which may or may not include what the authors/editors intended. It is, of course, well known that many renvois simply don't correspond to an article and many others differ in various ways from the article headwords. I am also applying a few rules to renvois searching to try to improve recall and reduce noise. So, this also adds another level of indirection.

Now, ideally, one would go through the entire database, examine each renvois and build a direct link to the one article that the authors/editors intended. But we're talking 60,000+ renvois against 72,000 (or so) articles and it is not clear that humans could resolve this in many instances. When Gilles and I worked on this, we used a series of (long forgotten) heuristics to filter out noise and errors. So, this simple toy works within operational limits and gives us a way to more systematically identify possible errors and ways to improve it.

Future Work

Aside from being a quick and dirty to way get some notion of errors in the renvois, we might be able to make this more presentable. Please feel free to play with this and suggest ways to think about. In the long haul, I would love a totally cool visualization. A clickable directed graph, so you could click on a node and re-center it on another article, or class of knowledge or author. Maybe something like Tricot's representation of the classes of knowledge. Or maybe something like DocuBurst. Marti Heast's chapter on visualizing text analysis, is a treasure-trove of great ideas.

For the immediate term, I would like to recast this simple model to allow the user to specify number of steps. So, set the number of iterations to follow, so you would get something like:


ESTOMAC, ventriculus (Tarin: Anatomie, Anatomy ) ==> Digestion || renvois
  DIGESTION (Venel: Economie animale, Animal economy ) ==> Intestins || renvois
        Intestins (unknown: Anatomie, Anatomy ) ==> Viscere || renvois
ESTOMAC, ventriculus (Tarin: Anatomie, Anatomy ) ==> Chyle || renvois
  CHYLE (Tarin: Anatomie | Physiologie, Anatomy. Physiology ) ==> Sanguification || renvois
        SANGUIFICATION (unknown: Physiologie, Physiology ) ==> Respiration || renvois
              RESPIRATION (unknown: Anatomie | Physiologie, Anatomy | Physiology ) ==> Air || renvois

Following this chains of renvois either until you run out or your hit an iteration limit. I will try to follow this up with both the multi-iteration model and see if I can recover some of what Liz tried to do using GraphViz to generate clickable directed graphs.

References

[1] Gilles Blanchard et Mark Olsen, « Le système de renvoi dans l’Encyclopédie: Une cartographie des structures de connaissances au XVIII^e siècle », Recherches sur Diderot et sur l'Encyclopédie, numéro 31-32 L'Encyclopédie en ses nouveaux atours électroniques: vices et vertus du virtuel, (2002) [En ligne], mis en ligne le 16 mars 2008.

[2] Charles Cooney, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, and Robert Voyer, "Re-engineering the tree of knowledge: Vector space analysis and centroid-based clustering in the Encyclopédie", Digital Humanities 2008, University of Oulu, Oulu, Finland, June 25-29, 2008

Archives Parlementaires: lèse (more)

Mark Tuesday, October 06, 2009 Leave a Comment

As I mentioned in my last in this thread, I was a bit surprised to see just how prevalent the construction lèse nation had become early in the Revolution. The following is a sorted KWIC of lEse in the AP, with the object type restricted to "cahiers", resulting in 38 occurrences. These are, of course, the complaints sent to the King, reflecting relatively early developments of Revolutionary discourse. Keeping in mind all of the caveats regarding this data, we can see some interesting and possibly contradictory uses:

CAHIER: (p.319)sent être, comme criminels de lèse-humanité au premier chef, et ils se joindront au
CAHIER GÉN...: (p.77)manière de juger, qui   lèse les droits les plus sacrés des citoyens, doit av
CAHIER: (p.697)r individus, cette concession lèse les et avoir eu d'autre mo dre une r {La partie d
CAHIER: (p.108)e, excepté dans les crimes de lèse-majesté au premier chef. Art. 33. Qu'aucun jugem
CAHIER: (p.791) si ce n'est pour le crime de lèse-majesté au premier chef, et réduite aux seuls c
CAHIER: (p.448)té seulement pour le crime de lèse-majesté au premier chef ou pour celui de haute t
CAHIER: (p.409)s choses saintes, et crime de lèse-majesté, dans tous les cas spécifiés par l'ord
CAHIER: (p.260)istériels, sauf pour crime de lêse-majesté, de haute trahison et autres cas, qui se
CAHIER: (p.42)e, à l'exception des crimes de lèse-majesté, de péculat et de concussion; mais, dan
CAHIER: (p.780), si ce n'était pour crime de lèse-majesté divine et humaine. Art. 9. Qu'ii soit as
CAHIER: (p.476)ée, si ce n'est pour crime de lèse-majesté divine et humaine. Art. 8. Qu'il soit as
CAHIER: (p.584)our le meurtre et le crime de lèse-majesté divine ou humaine, et que hors de ce cas
CAHIER: (p.378)ont seuls juges des crimes de lèse-majesté et de lèse-nation. Art. 8. Le compte de
CAHIER: (p.42)re précise ce qui est crime de lèse-majesté. Et que l'on établisse quels sont les c
CAHIER.: (p.117)déclaré coupable du crime de lèse-majesté etnation. et comme tel, puni des peines
CAHIER GÉN...: (p.671) excepté le crime de   lèse-majesté, le poison, l'incendie et assassinat sur
CAHIER: (p.660) les cas, excepté le crime de lèse majesté, le poison, l'incendie et assassinat sur
CAHIER: (p.532)hommes coupables elu crime de lèse-majesté nationale; l'exemple elu passé nous a m
CAHIER: (p.645)poursuivis comme criminels de lèse-majesté nationale; que visite soit faite dans le
CAHIER: (p.383)s par elle comme criminels de lèse-majesté, quand ils tromperont la confiance du so
CAHIER: (p.286)s crimes de lèse-nation ou de lèse-majesté seulement; et que, dans ce cas, l'accus
CAHIER GÉN...: (p.210)ni comme criminel de   lèse-majesté; 4° Cette loi protectrice de la libert
CAHIER: (p.35)rrémissibles comme le crime de lese-majesté. 13° 'Qu'en matière civile comme en mat
CAHIER: (p.378) crimes de lèse-majesté et de lèse-nation. Art. 8. Le compte des finances imprimé a
CAHIER: (p.359) crimes de lèsemajesté, et de lèse-nation, ce qui comprend les crimes d'Etat. 7° En
CAHIER: (p.301)ort infâme, comme coupable de lèse-nation, celui qui sera convaincu d'avoir violé c
CAHIER.: (p.536) et punis comme coupables de lèse nation. 17" De demander 1 aliénation irrévocabl
CAHIER: (p.82)x, sera déclarée criminelle de lèse-nation et poursuivie comme telle, soit par les Et
CAHIER: (p.402)tte règle seront coupables de lèse-nation et poursuivis comme tels dès qu'ils auron
CAHIER: (p.285) patrie, coupable du crime de lèse-nation, et puniecomme telle par le tribunal qu'é
CAHIER: (p.544) coupables de rébellion et de lèse-nation, favoriser la violation de la constitution
CAHIER: (p.42)lisse quels sont les crimes de lèse-nation. Le vœu des bailliages est que les ressor
CAHIER: (p.285)n user que pour {es crimes de lèse-nation ou de lèse-majesté seulement; et que, da
CAHIER: (p.402)s généraux, comme coupable de lèse-nation; que les impositions seront réparties dan
CAHIER: (p.320)e défendre, c'est un crime de lèse-nation. Qui pourrait nier que dans la génératio
CAHIER: (p.388)-mêmes; déclarant criminel de lèse-nation tous ceux qui pourraient entreprendre dire
CAHIER.: (p.249)sions. Ce serait vu crime de lèse-patrie de ne pas correspondre à sa confiance pat
CAHIER GÉN...: (p.221)i serait un crime de   lèse-patrie. 2° De demander l'abolition de la gabelle

These include "lèse-majesté nationale", "lèse-majesté et nation" (OCR error fixed), "crimes de lèse-majesté et de lèse-nation", and (my favorite) "crime de lèse-majesté divine et humaine". Kelly suggests that notions of royal authority had been trimmed over the 18th century and with this reduction came a restriction of just what would constitute lèse-majesté and to what kinds of crimes it would apply. He argues that it was only in 1787, with the Assembly of Notables, that the idea of the nation "begins to take shape in a public glare" and further suggested that the decrees of September 1789 to decree the punishments for lèse-nation (and subsequent events) show the "confused and arbitrary genesis of lèse-nation".

See also the 11 entries in our Dictionnaires d'autrefois for lese which stress lèse-majesté through the entire period with lèse-nation being left as an after-thought, such as in the DAF (8th edition): "Il se joint quelquefois, par analogie, à d'autres noms féminins. Crime de lèse-humanité, de lèse-nation, de lèse-patrie." One should not construe this as excessively conservative, however, since lèse-majesté is, by far, the most common construction in the 19th and 20th centuries (at least as represented in ARTFL-Frantext).

Topic Based Text Segmentation Goodies

Mark Sunday, October 04, 2009 Leave a Comment

As you may recall, Clovis ran some experiments this summer (2009) applying a perl implementation of Marti Heart's TextTiling algorithm to perform topic based text segmentation on different French documents (see his blog post and related files). Clovis reasonably suggests that some types of literary documents, such as epistolary novels, may be more suitable candidates than other types, because they do not have the same degree of structural cohesion. Now, as I mentioned in my first discussion of the Archives Parlementaires, I suspect that this collection may be particularly well to topic based segmentation. At the end of his post, Clovis also suggests that we might be able to test how well a particular segmentation approach is working by using a clustering algorithm, such as LDA Topic Modeling, to see if the segments can be shown to be reasonably cohesive. Both topic segmentation and modeling are difficult to assess because human readers/evaluators can have rather different opinions, leading to problems in "inter-rater reliability", which is probably a more vexing problem in the humanities and related areas of textual studies than in other domains.

Earlier this year (and a bit last year), I also ran some experiments on some 18th century English materials, such as Hume's History of England and the Federalist Papers. Encouraged by these results, particularly on the Federalist Papers, I have accumulated a number of newer algorithms, packages, and papers which may be useful for future work in this area. These are on my machine (for ARTFL folks, let me know if you want to know where), but I will not redistribute here as a couple of packages require non-redistribution or other limitations. I am putting links to some of the source files, when I have them.

Since Heart's original work, there have been a number of different approaches to topic based text segmentation. Clovis and I have tried to make note of much of this work on our CiteULike references (segmentation). There is some overlap with Shlomo's list. In no particular order of preference or chronology, here is what I have so far. I will also try to provide some details on using these when I have a chance to run them up.

From the Columbia NLP group (http://www1.cs.columbia.edu/nlp/tools.cgi), we have both Min-Yan Kan's Segmenter and Michael Galley's LCSeg. These required signing a use agreement, which I have in my office. The release archives for both have papers, some test data,

I spent some time trying to track down Freddy Choi's C99 algorithm and implementation described in some papers in the early part of this decade. I finally tracked it all down on the WayBack Machine at Internet Archive (link, thank you!!), which also has some papers, software, data and implementations of TextTiling and other approaches from that period. It appears several of the packages below use C99 and some of the code from this.

I was going to reference Utiyama and Isihara's implementation (TextSeg), but in the few months since I assembled this list, the link has (also) gone dead:
http://www2.nict.go.jp/x/x161/members/mutiyama/software.html#textseg
This appears to be a combination of approaches.

Igor Malioutov's MinCut code (2006) is available from his page:
http://people.csail.mit.edu/igorm/acl06code.html

There appears to be some info on TextTiling in Simon Cozens (2006), "Advanced Perl Programming".

We also want to check out Beeferman et. al. (link) since I recall that this group had done some interesting work. I have Beeferman's implementation of TextTiling in C, but don't think I have run across anything else.

If you run across anything useful, please blog it here or let me know. Papers should be noted on our CiteUlike. Thanks!!

Archives Parlementaires: lèse collocations

Mark Saturday, October 03, 2009 Leave a Comment

The collocation table function of PhiloLogic is a quick way to look at changes in word use. Lèse majesté, treason or injuries against the dignity of the sovereign or state, is a common expression. The collocation table below shows terms around "lese | leze | lèse | lèze | lése | léze" in ARTFL Frantext (550 documents, 1700-1787) with majesté being by far the most common.

It is interesting to note that the construction "lèse nation" does not appear once in this report. Searching for "lèse nation" before the Revolution in ARTFL-Frantext finds a single occurrence, in Mirabeau's [1780] Lettres écrits du donjon de Vincennes, where he complains that "toute invocation de lettre-de-cachet me paraît un crime de lèse-nation". The collocation table for lEse in the current sample of the Archives Parlementaires (there are no instances of the lEze in this dataset), shows the lèse nation construction to be far more frequent.

There have been discussions* of the transition from lèse majesté to lèse nation, which is clearly shown here. Now, a reasonable objection to this is that this report includes the entire (as much as we have at the moment) revolutionary period. But we see roughly the same rates and ranking for lèse in 1789.

It would appear -- I would not put too much credit in these numbers -- that the shift from majesty to nation, and all that this implies in terms of the way state is envisaged, was well under way by 1789. This either happened very quickly in the years leading up to the Revolution, since the construction just once in ARTFL-Frantext before, or was a development that took place in types of documents not found in the rather more literary/canonical sample in ARTFL-Frantext, such as journals, pamphlets, and other more ephemeral materials. I guess data entry projects will never end.

One other observation. I like the collocation cloud as a graphic. But if you examine the table, you may notice that the cloud does not really represent the frequency differences all that well. The second table -- all of the AP -- shows that nation occurs more than 6 times as frequently as majesté, but differences of that magnitude tend to be rather difficult to show in a cloud. So, the compromise of providing both is probably the best approach.

* G. A. Kelly, "From Lèse Majesté to Lèse nation: Treason in 18th century France", Journal of the History of Ideas, 42 (1981): 269-286 (JStor).

Archives Parlementaires (I)

Mark Friday, October 02, 2009 Leave a Comment

A couple of weeks ago, some ARTFL folks discussed the notion of outlining some research and/or development projects that we will be, or would like to be, working on the coming months. We discussed a wide range of possibilities that could involve substantive work, using some of the systems we have already developed or are working on, or more purely technical work. Everyone came up with some pretty interesting projects and proposals, and we decided that it might be entertaining and useful for each of us to outline a specific project or two and write periodic entries here as things move forward. In the cold light of sobriety, this sounds like a pretty good idea. So, let me be the first to give this a whirl.

Our colleagues at the Stanford University Library have been digitizing the Archives Parlementaires using the DocWorks system. During a recent visit, Dan Edelstein was kind enough to deliver 43 volumes of OCRed text, which represents about half of the entire collection. Dan and I very hastily assembled an alpha text build of this sample under PhiloLogic. I converted the source data into a light TEI notation and attempted to identify probable sections in the data, such as "cahiers" , "séances", and other plausible divisions using an incredible simple approach. Dan built a table to identify volumes and years, which we used to load the dataset in (hopefully) coherent order. This is a very alpha test build. It is uncorrected OCR (much of which is surprising good) without links to pages images. The volumes are being scanned in no particular order, so we have volumes from a large swath of the collection. We are hoping to get the rest of volumes from Stanford in the relatively near future and will be working up or more coherent and user friendly site, with page images and the like. So, with these caveats, here is the PhiloLogic search form.

The Archives Parlementaires are the official, printed record of French legislative assemblies from beginning of the Revolution (1787) thru 1860. We are interested in the first part of the first series (82 volumes), out of copyright, ending in January 1794 which contain records pertaining to the Constituent Assembly, Legislative Assembly, and the Convention. The first seven volumes of the AP are the General Cahiers de doléances, which are organized by locality and estate (clergy, nobility, and third). The rest contain debates, speeches, draft legislation, reports, and many other kinds of materials typically organized by legislative session, often twice daily (morning and evening).

There will be some general house keeping required to start. Some of this will involve writing a better division recognizer, particularly for the Cahiers which are currently not including the place name and estate. I will also need to decide how to handle annexes, editorial materials, notes, etc. I suspect that it may also be worth some effort to try to correct some of the errors automatically, by simple replacement rules and identification impossible sequences. I am also thinking of using proximity measures to try to correct some proper names, such as Bobespierre, Kobespierre, etc. I would also want to concentrate some effort on terms that may reflect structural divisions. Dan has suggested identification of speakers, where possible, so one could search the speeches (full and in debates) of specific individuals like Robespierre, but this appears to be fairly problematic, since it is not clear how to identify just where these might stop.

Loading this data, particularly the complete (or at least out of copyright) dataset will probably be of general utility to Revolutionary historians, particularly when linked to page images and given some other enhancements. This will be done in conjunction with our colleagues at Stanford and other researchers.

I have several rather distinct research efforts in mind. There are a series of technical enhancements which I think fit the nature of the data fairly well:

sequence alignment to identified borrowed passages from earlier works, such as Rousseau and Montesquieu,
topic based text segmentation, to split individual sessions into parts, and,
topic modeling or clustering to attempt to identify the topics of parts identified by topic based segmentation.

We have already run experiments using PhiloLine, the many to many sequence aligner which we are using for various other applications. As we have found, this works for uncorrected OCR relatively well. For example, Condorcet in the Séance du vendredi 3 septembre 1790 [note the OCR error below] borrows a passage from Voltaire's Épitres in his

Nouvelles réflexions sur le projet de payer la dette exigible en papier forcé, par M. GoNDORCET.
Un maudit Écossais, chassé de son pays, Vint changer tout en France et gâter nos esprits. L'espoir trompeur et vain, l'avarice au teint blême, Sous l'abbé Terrasson calculaient son système, Répandaient à grands flols les papiers imposteurs, Vidaient nos coffres-forts et corrompaient no s mœurs.

Un maudit écossais, chassé de son pays,
vint changer tout en France, et gâta nos esprits.
L'espoir trompeur et vain, l'avarice au teint blême,
sous l'abbé Terrasson calculant son système,
répandaient à grands flots leurs papiers imposteurs,
vidaient nos coffres-forts, et corrompaient nos
moeurs;

without specific reference to Voltaire (that I could find). This is generally pretty decent OCR. The alignments work for poorer quality and where there are significant insertions or deletions. For example:

Rousseau, Jean-Jacques, [1758], Lettre à Mr. d'Alembert sur les spectacles:

autrui des accusations qu'elles croient fausses; tandis qu'en d'autres pays les femmes, également coupables par leur silence et par leurs discours, cachent, de peur de représailles, le mal qu'elles savent, et publient par vengeance celui qu'elles ont inventé. Combien de scandales publics ne retient pas la crainte de ces sévères observatrices? Elles font presque dans notre ville la fonction de censeurs. C'est ainsi que dans les beaux tems de Rome , les citoyens, surveillans les uns des autres, s'accusoient publiquement par zele pour la justice; mais quand Rome fut corrompue et qu'il ne resta plus rien à faire pour les bonnes moeurs que de cacher les mauvaises, la haine des vices qui les démasque en devint un. Aux citoyens zélés succéderent des délateurs infames; et au lieu qu'autrefois les bons accusoient les méchans, ils en furent accusés à leur tour . Grâce au ciel, nous sommes loin d'un terme si funeste. Nous ne sommes point réduits à nous cacher à nos propres yeux, de peur de nous faire horreur. Pour moi, je n'en aurai pas meilleure opinion des femmes, quand elles seront plus circonspectes: on se ménagera davantage, quand on

Séance publique du 30 avril 1793, l'an II de la:

son tribunal n'exerce pas, d'ailleurs, une autorité aussi 1 mu soire qu'on pourrait le croire ; il se fait J"_ tice d'une partie de la violation des lois «j ciales ; ses vengeances sont terribles p l'homme libre, puisque la censure o lst "°" la honte et le mépris : et combien cle st* § dales publics ne retient pas la crainte m. châtiments ? Dans les beaux temps cle n°*ji les citoyens, surveillants nés les uns a es» s'accusaient publiquement par zèle p % justice. Mais quand Rome fut corrompu^ citoyens zélés succédèrent des oeiai •„ t fâmes; au lieu qu'autrefois les bons accu- -^ les méchants, ils en furent accuses tour . -, rla méEn Egypte, la censure ssu_ v moire des morts ; la comédie eut o*" B^^ des un pouvoir plus étendu sur la rep vivants. „ •* i„ t-Ole niani^ 1 * L'esprit de l'homme est fait ae te ut rtr-c, encore plus du ridicule que d'un ,»ïl u

The Rousseau passage is found in a speech titled Nécessité d'établir une censure publique par J.-P. Picqué, which does not appear to mention the title and possibly not Rousseau at all (as far as I can tell). As you can see, this is fair messy OCR and is significantly truncated. We have a preliminary database running and will probably release this once we have the entire set and experiment further with alignment parameters.

Based on preliminary work that I have done on Topic based text segmentation, which Clovis followed up on in more detail (link), suggests that the individual séances may be a particularly good candidate for topic segmentation, since the topics can shift around radically. Running text tends not to do as well as clear shifts in topics. There are a number of newer approaches than the Hurst TextTiling implementation (which I will blog when I run them up) that may be more effective.

Finally, on the technical side, I want to experiment with LDA topic modeling. Again, Clovis' initial work on topic identification for the articles of Echo de la fabrique, indicate that, if one can get good topic segments, the modeling algorithm may be fairly effective. Oddly enough, I cannot recall anyone doing the "topic two-step", where one would apply topic modeling to parts of documents split up by a topic based segmentation algorithm. Or, I may have missed some important papers. The idea behind all of this is an attempt to build the ability to search for relatively coherent topics, either for browsing or searching.

So far, I have been talking about some more technical experimentation to see if certain algorithms, or general approaches, might be effective on a large and fairly complex document space. While I used the AP for significant work when I was doing Revolutionary studies, my initial systematic interest was in the General Cahiers de doléances. For my dissertation, and some later articles ("The Language of Enlightened Politics: The Société de 1789 in the French Revolution" in Computers and the Humanities 23 (1989): 357-64), I keyboarded a small sample of the Cahiers (don't ever, ever do that as a poor graduate student :-) to serve as a baseline corpus to look at changes in Revolutionary discourse over time, with specific reference to the materials published by the Société de 1789. I suspect that a statistical analysis of the language in the cahiers may bring to light interesting differences between the Estates, urban/rural, and north/south. For this set of tasks, I am planning to use the comparative functions of PhiloMine to examine the degree to which these divisions can be identified using machine learning approaches and, if so, what kinds of lexical differences can be identified. It would be equally interesting to compare a more linguistic analysis to the content analysis results found in Gilbert Shaprio et al, Revolutionary demands: a content analysis of the Cahiers de doléances of 1789.

I will, as promised (or threatened) above, try to blog good results and failures -- remember Edison is credited with saying while trying to invent the lightblub, “I have not failed. I've just found 10,000 ways that won't work.” -- of these efforts here so we can all consider them.

ARTFL Project Research Blog

Encyclopédie under KinoSearch

back to comparing similar documents

Supervised LDA: Preliminary Results on Homer

Encyclopédie Renvois Search/Linker

Archives Parlementaires: lèse (more)

Topic Based Text Segmentation Goodies

Archives Parlementaires: lèse collocations

Archives Parlementaires (I)

Labels

Popular Posts

Blog Archive

Developed by ARTFL