Encyclopédie under KinoSearch
Mark Thursday, October 29, 2009
Here are a couple of examples which you can block copy in:
artisan laboureur ouvrier paysan
malade symptome douleur estomac
peuple pays nation ancien république décadence
The first thing to notice is search speed. Lucene is known to be robust, massively scalable, and fast. The KinoSearch implementation is certainly very fast. A six term search returns in a real .35 seconds and less than 1/10 of a second of system time, using time on the command line. I did not time the indexing run, but think 10 minutes or so. [Addition: by reading 147 TEI files rather than 77,000 split files, the loading indexing time for the Encyclopédie is falls to (using time) real 2m45.9s, user 2m33.8s sys 0m11.1s.]
The KinoSearch developer, Marvin Humphrey, has a splendid slide show, outlining how it works, with specific reference to the kind of parameters, such as stemmers and stopwords, that one needs to consider as well as an overview of the indexing scheme. Clovis and I thought this might be the easiest way to begin working with Lucene, since it is a perl module with C components, so it is easy to install and get running. Given the performance and utility of KinoSearch, I suspect that we will be using it extensively for projects where ranked relevancy results are of interest. These might include structured texts, such as newspaper and encyclopedia articles, and possibly large collections of uncorrected OCR materials which may not suitable for text analysis applications supported by PhiloLogic. Also, on first review, the code base is very nicely designed and, since it has many of the same kinds of functions as PhiloLogic, strikes me as being a really fine model of how we might want to renovate PhiloLogic.
For this experiment, I took the articles as individual documents in TEI, which Clovis had prepared for other work. For each article, I grabbed the headword and PhiloLogic document id, which are loaded as fielded data. The rest of the article is stripped of all encoding and loaded in. It would be perfectly simple to read the data from our normal TEI files. We could see simply adding a script that would load source data from a PhiloLogic database build, to add a different kind of search, which would need to have a different search box/form.
I have not played at all with parameters and I can imagine that we would want to perform some functions, such as using simple rules for normalization, on input, since it uses a stemmer package also by M Humphrey. Please email me, post comments, or add a blog entry here if you see problems, particularly search oddities, have ideas about other use cases, or more general interface notions. I will be writing a more generalized loader and query script -- with paging, numbers of hits per page, filtering by minimum relvancy scores and looking at a version of the Philologic object fetch which would try to high-light matching terms -- and moving that over to our main servers.