Encyclopédie under KinoSearch

Mark Thursday, October 29, 2009 3 comments

One of the things that I have wanted to do for a while is to examine implementations of Lucene, both as a search tool to complement PhiloLogic and possibly as a model for future PhiloLogic renovations. Late this summer, Clovis identified a particular nice open source, perl implementation of Lucene called KinoSearch. This looks like it will fit both bills very nicely indeed. As a little experiment, I loaded 73,000 articles (and other objects) from the Encyclopédie, and cooked up a super simple query script. This allows you to type in query words and get links to articles sorted by their relevancy to your query (the italicized number next to the headword). At this time, I am limiting to the top 100 "hits". Words should be lower case, accents are required, and words should be separated by spaces. Try it:

Query Words: or
Require all words

Here are a couple of examples which you can block copy in: artisan laboureur ouvrier paysan malade symptome douleur estomac
peuple pays nation ancien république décadence

The first thing to notice is search speed. Lucene is known to be robust, massively scalable, and fast. The KinoSearch implementation is certainly very fast. A six term search returns in a real .35 seconds and less than 1/10 of a second of system time, using time on the command line. I did not time the indexing run, but think 10 minutes or so. [Addition: by reading 147 TEI files rather than 77,000 split files, the loading indexing time for the Encyclopédie is falls to (using time) real 2m45.9s, user 2m33.8s sys 0m11.1s.]

The KinoSearch developer, Marvin Humphrey, has a splendid slide show, outlining how it works, with specific reference to the kind of parameters, such as stemmers and stopwords, that one needs to consider as well as an overview of the indexing scheme. Clovis and I thought this might be the easiest way to begin working with Lucene, since it is a perl module with C components, so it is easy to install and get running. Given the performance and utility of KinoSearch, I suspect that we will be using it extensively for projects where ranked relevancy results are of interest. These might include structured texts, such as newspaper and encyclopedia articles, and possibly large collections of uncorrected OCR materials which may not suitable for text analysis applications supported by PhiloLogic. Also, on first review, the code base is very nicely designed and, since it has many of the same kinds of functions as PhiloLogic, strikes me as being a really fine model of how we might want to renovate PhiloLogic.

For this experiment, I took the articles as individual documents in TEI, which Clovis had prepared for other work. For each article, I grabbed the headword and PhiloLogic document id, which are loaded as fielded data. The rest of the article is stripped of all encoding and loaded in. It would be perfectly simple to read the data from our normal TEI files. We could see simply adding a script that would load source data from a PhiloLogic database build, to add a different kind of search, which would need to have a different search box/form.

I have not played at all with parameters and I can imagine that we would want to perform some functions, such as using simple rules for normalization, on input, since it uses a stemmer package also by M Humphrey. Please email me, post comments, or add a blog entry here if you see problems, particularly search oddities, have ideas about other use cases, or more general interface notions. I will be writing a more generalized loader and query script -- with paging, numbers of hits per page, filtering by minimum relvancy scores and looking at a version of the Philologic object fetch which would try to high-light matching terms -- and moving that over to our main servers.

3 comments:

Frédéric GlorieuxOctober 30, 2009 at 4:19 AM
Hi, Mark

All what I will say is coming from some experience with java lucene, and this excellent book “Lucene in action” http://www.manning.com/hatcher3/. I never used kinosearch. I hope the concepts are similar.

About Porter Stemmer.

I think it's not the best tool for the public of Encyclopedie, for example ATILF folks. Most of us would probably want to control better their results, to avoid surprises like “jouer” finding “jouir”. Maybe the choice could be offered, so you have 2 fulltext fields configured and indexed, one for exact match, the other analysed by a stemmer. Lucene people seems to like better Kstem now, http://ciir.cs.umass.edu/pubfiles/ir-35.pdf, I've not tested and verify if there is something for french. I would be very happy to work with you on this topic.

Control the boost for better relevancy.

A little story will explain a lot. I participated to the lucene tuning for a site aggregating resources about xml, like specs, articles, courses, blog entry, or mailing lists. The goal was to put the XML spec as first result for the query “XML”. The first result was a mail with title « Re: XML Re: XML Re: XML Re: XML... ». The word XML was definitely very frequent for this document field. The workaround was to modify the document boost at indexation, according to a scale on type of docs (higher for specs than for mails). For Encyclopedie, maybe scores could be modified on size, and why not, authors. There are messages on lucene list about relevancy algorithms using dynamic data to get a popularity count for the relevant items. A fast first step could be to provide a kind of abstract for long articles (first para ?), so that a word in abstract will automatically have more weight.

Little bugs

* encoding
TimOctober 30, 2009 at 3:33 PM
I think that this is really valuable and I'm looking forward to running this up against philologic results in production.

A couple of search oddities:

-words appear to break on high-level characters. Thus, a search for "moine" returns quite a few results for "témoin", "témoins", etc. In search results for "gré", the third result was "grêle". In search results for "pièce", quite a few of the hits were for things like "pié: ces" and "l' épi. Ces".

-on a related note, accented characters are not highlighted in search results when they appear at a word boundary. I noticed this in a search for "volonté"...
MarkNovember 24, 2009 at 4:22 PM
I was just back tracking and notice that KStem appears to be available for English only. Too bad. Looks like a significant improvement.

ARTFL Project Research Blog

Encyclopédie under KinoSearch

3 comments:

Labels

Popular Posts

Blog Archive

Developed by ARTFL