The number of matching terms for small articles can be, of course, very small. For example, article "Tout-Bec" (62 words) is left with four stems [amer 1|oiseau 2|ornith 1|bec 3]. The first most of the most similar articles is Rhinoceros (Hist. nat. Ornith.) -- remember, only the main article here -- matches on three stems:
word frq1 frq2 bec 3 5 oiseau 2 2 ornith 1 1Are these similar? Well, both very small articles refer to kinds of rare birds that are notable by their beaks, one with a very large beak and one that looks like it has two or more beaks. It is also important to note that "ornith" (the class of knowledge) in both is picked up by this example. The next article down (Pipeliene) matches on:
amer 1 1 bec 3 1 oiseau 2 2The third most similar in this example is "Connoissance des Oiseaux par le bec & par les pattes.", a plate legend, with as you expect, lots of beaks. This matches on two stems, bec and oiseau.
It seems that the size of the query article, now that I have eliminated many function words and other extraneous data, carries a significant impact. The larger the article, the more possible matches you will get (Zipf's Law applies). Longer articles will tend to be most similar to other longer articles, and shorter will match better to shorter. So, similarity would appear to be a function of relative frequencies of common features and the length of the articles. We saw this in our original examination of the Encyclopédie and the Dictionnaire de Trévoux, and had built in some restrictions in terms of size as well as comparing articles with the same first letter rather than all to all. As far as I can tell, the kind of more of feature pruning shown here does not have a significant impact on larger articles.
User feedback might be significant in determining just how many features and what kinds of features are required to get more interesting matches. For any pair, we could store the VSM score, the sizes, and the matching features along with the user rating of the match. That might generate some actionable data for future applications.
[Aside: In some cases, similar passages lead to possibly related plates and legends. Cadrature, for example, links to numerous plate legends dealing with clockmaking.]