Encyclopédie: Similar Article Identification

Mark Thursday, December 03, 2009 6 comments

The Vector Space Model (VSM) is a classic approach to information retrieval. We integrated this as a standard function in PhiloMine and have used it for a number of specific research projects, such as identifying borrowings from the Dictionnaire de Trévoux in the Encyclopédie, which is described in our forthcoming paper "Plundering Philosophers" and related talks[1]. While originally developed by Gerard Salton[2] in 1975 as a model for classic information retrieval, where a user submits a query and gets results in an ranked relevancy list, the algorithm is also very useful to identify similar blocks of text, such as encyclopedia articles or other delimited objects. Indeed, this kind of use of the VSM was proposed by Salton and Singhal[3] in a paper presented months before Salton's death. They demonstrated the use of VSM to produce links between parts of documents, forming a type of automatic hypertext:

The capability of generating weighted vectors for arbitrary texts also makes it possible to decompose individual documents into pieces and explore the relationships between these text pieces. [...] Such insights can be used for picking only the "good" parts of the document to be presented to the reader.

Salton and Singhal further argued that manual link creation would be impractical for huge amounts of text, but these conclusions may have had limited influence given the general interest at that time in human generated hypertext links on the WWW.

Based on earlier work using PhiloMine, we have seen a number of "interesting" -- and at times unexpected -- connections between articles in the Encyclopédie, often drawing connections between previously unrelated articles, if by unrelated we mean having different authors, classes of knowledge and few cross-references (renvois) between them. One might consider this kind of similarity measure between articles as a kind of intertextual discovery tool, where the system would propose articles possibly related to a specific article.

The Vector Space Model functions by comparing a query vector to all of the vectors in a corpus, making it an expensive calculation, not always suitable to real time use. In this experiment, I have recast the VSM implementation in PhiloMine to function as a batch job to generate a database of 27,753 Encyclopédie articles (those with 100 or more words) with the 20 most similar articles for each article. To do this, I pruned features (word stems) which more than 8,325 and less than 41 articles, resulting in a vector size of 10,431 features. I used a standard French word stemmer to reduce lexical variation and a Log Normalization function to handle variations in article sizes. The task took about 17 hours to run.

Update (December 7): I have replaced the VSM build above with the same on 39,200 articles -- all articles with 60 or more words -- which took about 29 hours to run. I pruned features found in more than 11,200 documents and less than 50, leaving 9,710 features. This may change some results by adding more small articles. Note, this is about as large a VSM task as can be performed in memory using perl hashes, since anything large runs out of memory. If we want to go larger, probably store vectors on disk and TIE them to perl hashes.

The results for a query shows the 20 most similar articles, ranked by the similarity score, where an exact match is equal to 1. For example, the article OUESSANT (Modern Geography) -- based on 27,000 articles -- is related to the articles VERTU [0.274], Luxe [0.267], ECONOMIE ou OECONOMIE [0.265], POPULATION [0.263], CHRISTIANISME [0.261], SOCIÉTÉ [0.256], AVERTISSEMENT DES ÉDITEURS (suite) [0.255], MANICHÉISME [0.254], CYNIQUE, secte de philosophes anciens [0.254], Gout [0.250], EDUCATION [0.248] and so on. This reflects the discussion of the moral conditions of the inhabitants of the small island off the coast of Brittany.

You can give it a try using this form (again now for 39,200 articles):

[Dec 9: I added word count info for each article. You can restrict searches to articles in ranges of size. Also, now storing 50 top matches, which you can limit. Showing matching articles which are smaller than source article. Dec 10: added function to display matching stems for any pairwise comparison for inspection.]

There are a number of other options that I might add to the VSM calculations, including using TF-IDF as an alternative normalization weighting scheme and use of virtual normalization to again reduce lexical variations and improve the performance of the stemming algorithm. I have also thought of using Latent Semantic Analysis as another way to handle similarity weighting, but given that we have many query terms, it is not clear that LSA would help all that much.

In a real production environment, I think we will add a "similar article link" from articles in the Encyclopédie. We have talked about having users rank the quality of the similarity performance. The scores assigned are somewhat helpful in ranking, but not in assessing an absolute number, since they can vary by the size of the input article. VSM is an unsupervised learning model. It is not clear to me that we could integrate user evaluations in any systematic fashion, but this is certainly an interesting subject of further consideration.

As always, please let me know what you think. I have a couple of general queries. I have used main and sub articles (as well plate legends, etc.) as units of similarity calculation. Should I use main entries only? I also limited this to articles with more than 100 words. At 50 words, we have some 43,000 articles. Should I do this for a full implementation?

References

[1] See Timothy Allen, Stéphane Douard, Charles Cooney, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, and Robert Voyer, "Plundering Philosophers: Identifying Sources of the Encyclopédie", Journal of the Association for History and Computing (forthcoming 2009). Also, see Ceglowski, Maxiej. 2003: "Building a Vector Space Search Engine in Perl", Perl.com [http://www.perl.com/pub/a/2003/02/19/engine.html].

[2] Salton, G., A. Wong, and C. S. Yang. 1975: "A Vector Space Model for Automatic Indexing," Communications of the ACM 18/11: 613-620.

[3] Singhal, A. and Salton, G. 1995: "Automatic Text Broswing Using Vector Space Model" in Proceedings of the Dual-Use Technologies and Applications Conference 318-324.

6 comments:

Frédéric GlorieuxDecember 3, 2009 at 2:38 PM
I can't resist to the pleasure of a first answer : really interesting !

I was for example searching for poetry, and I found that “oriental poetry” was related to Cartesianism ! No sense, but real pleasure to read. The key was probably in a word, “distinctement”, so specific to Descartes, strangely used by Voltaire to translate a “rose of Saadi” : « Il sait distinctement ce qui ne fut jamais. ». Please, don't try to be more accurate with folksonomy and all kind of collective intelligence, keep the wisdom of blind order. All moderns will “correct” this “error” of the machine, losing this essential fact : Voltaire is really an awful poet. More seriously, the problem may also coming from the difference of size.

About the interface, it could be nice to jump from a list to another, instead of a default access to the article (a link [go to article] could be added). Is there a way to highlight some of the features explaining the choice, like the most similar words in context ? This kind of snippet could be highly valuable to understand the reason of the algorithm, and give desire to jump to the article.

Mark, it could become a high success widget, for wikipedia, a blog, a frame called ”not so random links”.
MarkDecember 4, 2009 at 3:06 PM
Hi Frédéric,

Thank you. First, I added a "Get similar articles" for each of the articles in the list. Good suggestion. Now we can all get really lost. :-)

Most of the less obvious connections are on short articles, were just a few matching words will work. I set the limit for this run at 100 words, including function words, resulting in very sparse vectors. And I am currently doing a run on articles with at least 60 words. It may be the kind of thing that we would want to test on sizes of comparison articles. I can probably generate matching terms (stems actually), but this would probably be for internal experimentation, not for users.

I think you are right about needing to be careful about social tagging. In some ways, folks in the humanities may be less interested in following what other people have suggested. But I don't have a clear picture of just how one might leverage social tagging on this anyway. Maybe just interesting or related vs uninteresting or unrelated, and see if my general assumption that the huge differences is sizes of articles results in less obvious links.

M
Frédéric GlorieuxDecember 5, 2009 at 6:45 AM
« Now we can all get really lost. :-) »

Clovis, do you feel the same pleasure to be lost in our language ?

Little facts :

“FIEF” is linked to “ROMAIN empire”, because this article make an history of the medieval institution, going back to romans, but the “fief” article seems much more linked to law things.

“BISTRE” is a brown color used in painting, the list of similar articles seems quite relevant, but miss “Lavis”, a pictural technique using Bistre, considered as “fortification” topic by Encyclopédie because of maps ; and the list is giving “Eau-de-vie (fabrication de)”, because the two articles explain some chemistry.

Very interesting, but the user should be alerted that these “invisible connections” could not become topics. It seems to me the best approach in history, because no assumptions is made from our categories to classify past knowledge.

PS: if stems or other clues could be given on the calculation of similarity, I would be glad of that.
Robert MorrisseyDecember 6, 2009 at 6:17 PM
I did a search on "peuple" and got back in first position:
MONT AIGUILLE, & par le peuple, montagne inaccessible

This is a short article and it is flagged for obvious reasons. That said, it has nothing to do with "peuple". Would you loose too much if you filtered out this type of referral.

If we are going to have each "similar" article having a set of smilar articles attached to it, I believe it would be useful to document clearly what comes back on the results page.
TimDecember 6, 2009 at 10:53 PM
Would it be difficult to run up separate builds that used shorter articles or only main articles? I'd be interested in comparing the results...
MarkDecember 7, 2009 at 10:08 AM
As I noted in my update to this post (in blue), I have run this on 39,200 articles.

Robert's question appears to relate to the fact that "MONT AIGUILLE, & par le peuple, montagne inaccessible" is incorrectly identified. The articles reads "MONT AIGUILLE, (Géog.) & par le peuple, montagne inaccessible". So it should be head word MONT AIGUILLE only. The remainder probably got picked up as part of the head word because the "montaigne inaccessible" is italicized. Not sure filtering is required in this example.

In response to Tim's suggestion, for the next build, I will add size of article by number of words as a search criteria and filter criteria, so you can limit sizes to see what happens. I will also try to build a model to get matching stems for each comparison, probably as a related table.

Keep the comments coming. Thanks!

ARTFL Project Research Blog

Encyclopédie: Similar Article Identification

6 comments:

Labels

Popular Posts

Blog Archive

Developed by ARTFL