Encyclopédie under KinoSearch
Here are a couple of examples which you can block copy in:
artisan laboureur ouvrier paysan
malade symptome douleur estomac
peuple pays nation ancien république décadence
The first thing to notice is search speed. Lucene is known to be robust, massively scalable, and fast. The KinoSearch implementation is certainly very fast. A six term search returns in a real .35 seconds and less than 1/10 of a second of system time, using time on the command line. I did not time the indexing run, but think 10 minutes or so. [Addition: by reading 147 TEI files rather than 77,000 split files, the loading indexing time for the Encyclopédie is falls to (using time) real 2m45.9s, user 2m33.8s sys 0m11.1s.]
The KinoSearch developer, Marvin Humphrey, has a splendid slide show, outlining how it works, with specific reference to the kind of parameters, such as stemmers and stopwords, that one needs to consider as well as an overview of the indexing scheme. Clovis and I thought this might be the easiest way to begin working with Lucene, since it is a perl module with C components, so it is easy to install and get running. Given the performance and utility of KinoSearch, I suspect that we will be using it extensively for projects where ranked relevancy results are of interest. These might include structured texts, such as newspaper and encyclopedia articles, and possibly large collections of uncorrected OCR materials which may not suitable for text analysis applications supported by PhiloLogic. Also, on first review, the code base is very nicely designed and, since it has many of the same kinds of functions as PhiloLogic, strikes me as being a really fine model of how we might want to renovate PhiloLogic.
For this experiment, I took the articles as individual documents in TEI, which Clovis had prepared for other work. For each article, I grabbed the headword and PhiloLogic document id, which are loaded as fielded data. The rest of the article is stripped of all encoding and loaded in. It would be perfectly simple to read the data from our normal TEI files. We could see simply adding a script that would load source data from a PhiloLogic database build, to add a different kind of search, which would need to have a different search box/form.
I have not played at all with parameters and I can imagine that we would want to perform some functions, such as using simple rules for normalization, on input, since it uses a stemmer package also by M Humphrey. Please email me, post comments, or add a blog entry here if you see problems, particularly search oddities, have ideas about other use cases, or more general interface notions. I will be writing a more generalized loader and query script -- with paging, numbers of hits per page, filtering by minimum relvancy scores and looking at a version of the Philologic object fetch which would try to high-light matching terms -- and moving that over to our main servers.
Hi, Mark
ReplyDeleteAll what I will say is coming from some experience with java lucene, and this excellent book “Lucene in action” http://www.manning.com/hatcher3/. I never used kinosearch. I hope the concepts are similar.
About Porter Stemmer.
I think it's not the best tool for the public of Encyclopedie, for example ATILF folks. Most of us would probably want to control better their results, to avoid surprises like “jouer” finding “jouir”. Maybe the choice could be offered, so you have 2 fulltext fields configured and indexed, one for exact match, the other analysed by a stemmer. Lucene people seems to like better Kstem now, http://ciir.cs.umass.edu/pubfiles/ir-35.pdf, I've not tested and verify if there is something for french. I would be very happy to work with you on this topic.
Control the boost for better relevancy.
A little story will explain a lot. I participated to the lucene tuning for a site aggregating resources about xml, like specs, articles, courses, blog entry, or mailing lists. The goal was to put the XML spec as first result for the query “XML”. The first result was a mail with title « Re: XML Re: XML Re: XML Re: XML... ». The word XML was definitely very frequent for this document field. The workaround was to modify the document boost at indexation, according to a scale on type of docs (higher for specs than for mails). For Encyclopedie, maybe scores could be modified on size, and why not, authors. There are messages on lucene list about relevancy algorithms using dynamic data to get a popularity count for the relevant items. A fast first step could be to provide a kind of abstract for long articles (first para ?), so that a word in abstract will automatically have more weight.
Little bugs
* encoding
I think that this is really valuable and I'm looking forward to running this up against philologic results in production.
ReplyDeleteA couple of search oddities:
-words appear to break on high-level characters. Thus, a search for "moine" returns quite a few results for "témoin", "témoins", etc. In search results for "gré", the third result was "grêle". In search results for "pièce", quite a few of the hits were for things like "pié: ces" and "l' épi. Ces".
-on a related note, accented characters are not highlighted in search results when they appear at a word boundary. I noticed this in a search for "volonté"...
I was just back tracking and notice that KStem appears to be available for English only. Too bad. Looks like a significant improvement.
ReplyDelete