The capability of generating weighted vectors for arbitrary texts also makes it possible to decompose individual documents into pieces and explore the relationships between these text pieces. [...] Such insights can be used for picking only the "good" parts of the document to be presented to the reader.Salton and Singhal further argued that manual link creation would be impractical for huge amounts of text, but these conclusions may have had limited influence given the general interest at that time in human generated hypertext links on the WWW.
Based on earlier work using PhiloMine, we have seen a number of "interesting" -- and at times unexpected -- connections between articles in the Encyclopédie, often drawing connections between previously unrelated articles, if by unrelated we mean having different authors, classes of knowledge and few cross-references (renvois) between them. One might consider this kind of similarity measure between articles as a kind of intertextual discovery tool, where the system would propose articles possibly related to a specific article.
The Vector Space Model functions by comparing a query vector to all of the vectors in a corpus, making it an expensive calculation, not always suitable to real time use. In this experiment, I have recast the VSM implementation in PhiloMine to function as a batch job to generate a database of 27,753 Encyclopédie articles (those with 100 or more words) with the 20 most similar articles for each article. To do this, I pruned features (word stems) which more than 8,325 and less than 41 articles, resulting in a vector size of 10,431 features. I used a standard French word stemmer to reduce lexical variation and a Log Normalization function to handle variations in article sizes. The task took about 17 hours to run.
Update (December 7): I have replaced the VSM build above with the same on 39,200 articles -- all articles with 60 or more words -- which took about 29 hours to run. I pruned features found in more than 11,200 documents and less than 50, leaving 9,710 features. This may change some results by adding more small articles. Note, this is about as large a VSM task as can be performed in memory using perl hashes, since anything large runs out of memory. If we want to go larger, probably store vectors on disk and TIE them to perl hashes.
The results for a query shows the 20 most similar articles, ranked by the similarity score, where an exact match is equal to 1. For example, the article OUESSANT (Modern Geography) -- based on 27,000 articles -- is related to the articles VERTU [0.274], Luxe [0.267], ECONOMIE ou OECONOMIE [0.265], POPULATION [0.263], CHRISTIANISME [0.261], SOCIÉTÉ [0.256], AVERTISSEMENT DES ÉDITEURS (suite) [0.255], MANICHÉISME [0.254], CYNIQUE, secte de philosophes anciens [0.254], Gout [0.250], EDUCATION [0.248] and so on. This reflects the discussion of the moral conditions of the inhabitants of the small island off the coast of Brittany.
You can give it a try using this form (again now for 39,200 articles):
[Dec 9: I added word count info for each article. You can restrict searches to articles in ranges of size. Also, now storing 50 top matches, which you can limit. Showing matching articles which are smaller than source article. Dec 10: added function to display matching stems for any pairwise comparison for inspection.]
There are a number of other options that I might add to the VSM calculations, including using TF-IDF as an alternative normalization weighting scheme and use of virtual normalization to again reduce lexical variations and improve the performance of the stemming algorithm. I have also thought of using Latent Semantic Analysis as another way to handle similarity weighting, but given that we have many query terms, it is not clear that LSA would help all that much.
In a real production environment, I think we will add a "similar article link" from articles in the Encyclopédie. We have talked about having users rank the quality of the similarity performance. The scores assigned are somewhat helpful in ranking, but not in assessing an absolute number, since they can vary by the size of the input article. VSM is an unsupervised learning model. It is not clear to me that we could integrate user evaluations in any systematic fashion, but this is certainly an interesting subject of further consideration.
As always, please let me know what you think. I have a couple of general queries. I have used main and sub articles (as well plate legends, etc.) as units of similarity calculation. Should I use main entries only? I also limited this to articles with more than 100 words. At 50 words, we have some 43,000 articles. Should I do this for a full implementation?
 See Timothy Allen, Stéphane Douard, Charles Cooney, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, and Robert Voyer, "Plundering Philosophers: Identifying Sources of the Encyclopédie", Journal of the Association for History and Computing (forthcoming 2009). Also, see Ceglowski, Maxiej. 2003: "Building a Vector Space Search Engine in Perl", Perl.com [http://www.perl.com/pub/a/2003/02/19/engine.html].
 Salton, G., A. Wong, and C. S. Yang. 1975: "A Vector Space Model for Automatic Indexing," Communications of the ACM 18/11: 613-620.
 Singhal, A. and Salton, G. 1995: "Automatic Text Broswing Using Vector Space Model" in Proceedings of the Dual-Use Technologies and Applications Conference 318-324.