Collocation Notes

Since we are planning a proposal that will use collocation as a main component for yet another grant/project proposal, I thought I would give some background notes for future reference. One of the more popular reporting features in PhiloLogic is the collocation table. This is a very simple mechanism. It counts the words around a search term or list of terms (the user sets the span and can turn of function word filtering) and reports the frequencies of terms to the left, right and total in a table. Richard recently added the "collocation cloud" feature to the current production version at ARTFL. The following is the collocation table and cloud for "tradition" in the current release of ARTFL-Frantext:

Collocation is a well established approach in Digital Humanities and other domains. Susan Hockey, for example, has a nice discussion of collocation in Electronic Texts in the Humanities, (Oxford, 2000), pp 90-91. She describes some work from the early 1970s and brings out the distinction between statistical calculations of collocation and very simple counts.

Berry-Rogghe (1973) discusses the relevance of collocations in lexical studies with reference to an investigation of the collocates of house, from which she is able to derive some notion of the semantic field of house. [...] Her program counts the total number of occurrences of the node, and the total number of occurrences of each collocate of the node within a certain span. It then attempts to indicate the probability of these collocates occurring if the words were distributed randomly throughout the text, and can thus estimate the expected number of collocates. It then compares the expected number with the observed number and generates a 'z-score', which indicates the significance of the collocate. The first table she presents shows the collocates of house based on a span of three words and in descending order of frequency. First is the, which co-occurs thirty-five times with house, but the total number of occurrences of the is 2,368. The is followed by this, a, of, I, in, it, my, is,have, and to, before the first significant collocate sold where six of the seven occurrences are within three words of house. Four words further on is commons, where all four occurrences collocate with house, obviously from the phrase House of Commons. When reordered by z-score, the list begins sold, commons, decorate, this, empty, buying, painting, opposite.

She goes on to suggest that "[f]or the non-mathematical or those who are suspicious of statistics, even simple counts of collocates can begin to show useful results, especially for comparative purposes." Which is, of course, precisely what PhiloLogic does now.

I have made extensive use of collocations over the years for my own work, both the zscore calculation and the very simple collocation by counts (filtering function words). These studies include American and French political discourse for my dissertation and subsequent papers, gender marked discourse, and comparisons of notions of tradition over time and in English and French. Breaking collocations down over time gives a pretty handy way to look at changing meanings of words. I have an ancient paper "Quantitative Linguistics and histoire des mentalités: Gender Representation in the Trésor de la langue française, 1600-1950" in the Contributions to Quantitative Linguistics: Proceedings of QUALICO 1991, First Quantitative Linguistics Conference (Amsterdam: Kluwer 1993): 351-71. which gives a write up on the method, some math :-), and references to some salient papers, including Berry-Rogghe (1973). In more recent work, I have used pretty much the same working model. Build a database split into 1/2 century chunks and do collocations by half century periods, using the z-score calculation (outline the paper). Indeed, I have a hacked version of PhiloLogic that does this.

As Hockey indicates, the statistical measure gives a rather different flavor for the collocates, since it attempts to measure the degree of relatedness between the two words. For example, the top collocates of "Platon" in a subset of Frantext shift around significantly.


Word   Rank ->  by zscore    by freq
Speusippe:          1             78
Aristote :          5              2

The reason for this is clear. 4/8 occurrences of Speusippe occur near Platon while 51/793 occurrences of Aristote are near Platon. I think both techniques are valid, and have used them to illuminate various tendencies. The z-score measures the relatedness of two words while the simple counts shows how in general the keyword s typically used. There is, of course, some overlap between the two, but the z-score tends to privilege to more unique constructions and associations.

Now, the obvious question is: "why don't we have the z-score calculation as an option in the standard collocation function in PhiloLogic?" And the answer is speed. The z-score (and other statistical models which I will mention below), attempts to compare expected frequencies of the word distribution against the observed frequencies, where the expected frequency assumes random distribution of words across a text, taking into account differences in frequencies. [Caveat, we know that "Language is never, ever, ever, random", but it is a useful heuristic, particularly for the kinds of simplistic comparisons I am doing.] The bottle neck for a real-time version of z-score collocations has been calculating baseline frequencies for any arbitrary range of documents. This may no longer be a significant problem. In a recent experiment, I built a script to sum the counts from arbitrary documents selected by bibliographic data (ARTFL Frantext word frequency report). While we have had a few users express interest in having more global counts, it would appear that our latest servers have more than enough horsepower to do these kinds of additions very quickly, certain fast enough to be bolted on to a collocation generator as an option. Certainly something to think about for a future revision of the old hopper.

There are, of course, a huge number of ways to calculate collocations. I suspect that there are two major areas: 1) how to identify spans and 2) how to measure the relationships between words. I had this notion that rather than simply look at spans as N words to the right and left, one would count words in pre-identified constructions (such as noun phrases, verb phases, or even clauses). Given the power of modern NLP tools, this is certain an option to think about. Related is the notion that one would rather do collocations on either lemmas or even "stems" (the results of a stemmer which basically strips various characters) which are not words, but can be related to sets of words. The other area of work is the possibility of using other statistical measures of association, such as log-likelihood and mutual information.

I'm pretty sure I've seen standalone packages that support more sophisticated statistical models. If we were going to do anything serious, the first place to start is reading. Reading? What? Yes, indeed. The chapter on Collocation in Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999 is a great place to start. Other titles may include Sabine Bartsch, Structural and functional properties of collocations in English: a corpus study of lexical and pragmatic constraints on lexical co-occurrence (Gunter Narr Verlag, 2004). There is also software. Of course, Martin's WordHoard has an array of collocation measures (documentation) and we should not forget other goodies, such as Collocate (commercial) and the Cobuild Collocation Sampler.

ARTFL Project Research Blog

Collocation Notes

0 comments:

Post a Comment

Labels

Popular Posts

Blog Archive

Developed by ARTFL