Our colleagues at the Stanford University Library have been digitizing the Archives Parlementaires using the DocWorks system. During a recent visit, Dan Edelstein was kind enough to deliver 43 volumes of OCRed text, which represents about half of the entire collection. Dan and I very hastily assembled an alpha text build of this sample under PhiloLogic. I converted the source data into a light TEI notation and attempted to identify probable sections in the data, such as "cahiers" , "séances", and other plausible divisions using an incredible simple approach. Dan built a table to identify volumes and years, which we used to load the dataset in (hopefully) coherent order. This is a very alpha test build. It is uncorrected OCR (much of which is surprising good) without links to pages images. The volumes are being scanned in no particular order, so we have volumes from a large swath of the collection. We are hoping to get the rest of volumes from Stanford in the relatively near future and will be working up or more coherent and user friendly site, with page images and the like. So, with these caveats, here is the PhiloLogic search form.
The Archives Parlementaires are the official, printed record of French legislative assemblies from beginning of the Revolution (1787) thru 1860. We are interested in the first part of the first series (82 volumes), out of copyright, ending in January 1794 which contain records pertaining to the Constituent Assembly, Legislative Assembly, and the Convention. The first seven volumes of the AP are the General Cahiers de doléances, which are organized by locality and estate (clergy, nobility, and third). The rest contain debates, speeches, draft legislation, reports, and many other kinds of materials typically organized by legislative session, often twice daily (morning and evening).
There will be some general house keeping required to start. Some of this will involve writing a better division recognizer, particularly for the Cahiers which are currently not including the place name and estate. I will also need to decide how to handle annexes, editorial materials, notes, etc. I suspect that it may also be worth some effort to try to correct some of the errors automatically, by simple replacement rules and identification impossible sequences. I am also thinking of using proximity measures to try to correct some proper names, such as Bobespierre, Kobespierre, etc. I would also want to concentrate some effort on terms that may reflect structural divisions. Dan has suggested identification of speakers, where possible, so one could search the speeches (full and in debates) of specific individuals like Robespierre, but this appears to be fairly problematic, since it is not clear how to identify just where these might stop.
Loading this data, particularly the complete (or at least out of copyright) dataset will probably be of general utility to Revolutionary historians, particularly when linked to page images and given some other enhancements. This will be done in conjunction with our colleagues at Stanford and other researchers.
I have several rather distinct research efforts in mind. There are a series of technical enhancements which I think fit the nature of the data fairly well:
- sequence alignment to identified borrowed passages from earlier works, such as Rousseau and Montesquieu,
- topic based text segmentation, to split individual sessions into parts, and,
- topic modeling or clustering to attempt to identify the topics of parts identified by topic based segmentation.
Nouvelles réflexions sur le projet de payer la dette exigible en papier forcé, par M. GoNDORCET.
Un maudit Écossais, chassé de son pays, Vint changer tout en France et gâter nos esprits. L'espoir trompeur et vain, l'avarice au teint blême, Sous l'abbé Terrasson calculaient son système, Répandaient à grands flols les papiers imposteurs, Vidaient nos coffres-forts et corrompaient no s mœurs.
Un maudit écossais, chassé de son pays,without specific reference to Voltaire (that I could find). This is generally pretty decent OCR. The alignments work for poorer quality and where there are significant insertions or deletions. For example:
vint changer tout enFrance , et gâta nos esprits.
L'espoir trompeur et vain, l'avarice au teint blême,
sous l'abbéTerrasson calculant son système,
répandaient à grands flots leurs papiers imposteurs,
vidaient nos coffres-forts, et corrompaient nos
moeurs;
Rousseau, Jean-Jacques, [1758], Lettre à Mr. d'Alembert sur les spectacles:
autrui des accusations qu'elles croient fausses; tandis qu'en d'autres pays les femmes, également coupables par leur silence et par leurs discours, cachent, de peur de représailles, le mal qu'elles savent, et publient par vengeance celui qu'elles ont inventé. Combien de scandales publics ne retient pas la crainte de ces sévères observatrices? Elles font presque dans notre ville la fonction de censeurs. C'est ainsi que dans les beaux tems de Rome , les citoyens, surveillans les uns des autres, s'accusoient publiquement par zele pour la justice; mais quand Rome fut corrompue et qu'il ne resta plus rien à faire pour les bonnes moeurs que de cacher les mauvaises, la haine des vices qui les démasque en devint un. Aux citoyens zélés succéderent des délateurs infames; et au lieu qu'autrefois les bons accusoient les méchans, ils en furent accusés à leur tour . Grâce au ciel, nous sommes loin d'un terme si funeste. Nous ne sommes point réduits à nous cacher à nos propres yeux, de peur de nous faire horreur. Pour moi, je n'en aurai pas meilleure opinion des femmes, quand elles seront plus circonspectes: on se ménagera davantage, quand onSéance publique du 30 avril 1793, l'an II de la:
son tribunal n'exerce pas, d'ailleurs, une autorité aussi 1 mu soire qu'on pourrait le croire ; il se fait J"_ tice d'une partie de la violation des lois «j ciales ; ses vengeances sont terribles p l'homme libre, puisque la censure o lst "°" la honte et le mépris : et combien cle st* § dales publics ne retient pas la crainte m. châtiments ? Dans les beaux temps cle n°*ji les citoyens, surveillants nés les uns a es» s'accusaient publiquement par zèle p % justice. Mais quand Rome fut corrompu^ citoyens zélés succédèrent des oeiai •„ t fâmes; au lieu qu'autrefois les bons accu- -^ les méchants, ils en furent accuses tour . -, rla méEn Egypte, la censure ssu_ v moire des morts ; la comédie eut o*" B^^ des un pouvoir plus étendu sur la rep vivants. „ •* i„ t-Ole niani^ 1 * L'esprit de l'homme est fait ae te ut rtr-c, encore plus du ridicule que d'un ,»ïl uThe Rousseau passage is found in a speech titled Nécessité d'établir une censure publique par J.-P. Picqué, which does not appear to mention the title and possibly not Rousseau at all (as far as I can tell). As you can see, this is fair messy OCR and is significantly truncated. We have a preliminary database running and will probably release this once we have the entire set and experiment further with alignment parameters.
Based on preliminary work that I have done on Topic based text segmentation, which Clovis followed up on in more detail (link), suggests that the individual séances may be a particularly good candidate for topic segmentation, since the topics can shift around radically. Running text tends not to do as well as clear shifts in topics. There are a number of newer approaches than the Hurst TextTiling implementation (which I will blog when I run them up) that may be more effective.
Finally, on the technical side, I want to experiment with LDA topic modeling. Again, Clovis' initial work on topic identification for the articles of Echo de la fabrique, indicate that, if one can get good topic segments, the modeling algorithm may be fairly effective. Oddly enough, I cannot recall anyone doing the "topic two-step", where one would apply topic modeling to parts of documents split up by a topic based segmentation algorithm. Or, I may have missed some important papers. The idea behind all of this is an attempt to build the ability to search for relatively coherent topics, either for browsing or searching.
So far, I have been talking about some more technical experimentation to see if certain algorithms, or general approaches, might be effective on a large and fairly complex document space. While I used the AP for significant work when I was doing Revolutionary studies, my initial systematic interest was in the General Cahiers de doléances. For my dissertation, and some later articles ("The Language of Enlightened Politics: The Société de 1789 in the French Revolution" in Computers and the Humanities 23 (1989): 357-64), I keyboarded a small sample of the Cahiers (don't ever, ever do that as a poor graduate student :-) to serve as a baseline corpus to look at changes in Revolutionary discourse over time, with specific reference to the materials published by the Société de 1789. I suspect that a statistical analysis of the language in the cahiers may bring to light interesting differences between the Estates, urban/rural, and north/south. For this set of tasks, I am planning to use the comparative functions of PhiloMine to examine the degree to which these divisions can be identified using machine learning approaches and, if so, what kinds of lexical differences can be identified. It would be equally interesting to compare a more linguistic analysis to the content analysis results found in Gilbert Shaprio et al, Revolutionary demands: a content analysis of the Cahiers de doléances of 1789.
I will, as promised (or threatened) above, try to blog good results and failures -- remember Edison is credited with saying while trying to invent the lightblub, “I have not failed. I've just found 10,000 ways that won't work.” -- of these efforts here so we can all consider them.
0 comments:
Post a Comment