Kristin has now implemented this for our Greek and Latin texts. If you wonder what's new about this: Word count for individual documents has always been there in PhiloLogic loads, but the difference here is that you can see frequencies over the entire corpus, or a subset of works/authors.

You can find the forms here:

http://perseus.uchicago.edu/LatinFrequency.html

http://perseus.uchicago.edu/GreekFrequency.html

Update: Forms moved to the 'production site', perseus.uchicago.edu. You can now specify genre as well. Stay tuned for further stats, meant to provide a friendly reminder of Zipf's Law.

Note: the counts are raw frequency counts, without lemmatization.

You can find the forms here:

http://perseus.uchicago.edu/LatinFrequency.html

http://perseus.uchicago.edu/GreekFrequency.html

Update: Forms moved to the 'production site', perseus.uchicago.edu. You can now specify genre as well. Stay tuned for further stats, meant to provide a friendly reminder of Zipf's Law.

Note: the counts are raw frequency counts, without lemmatization.

I have edited the search form a tiny bit - let me know if you encounter any problems.

I did this as a little experiment to see how quickly one could add individual document counts files together. There are a number of applications which require complete word counts for any set of documents. For example, the Z-score calculation of collocations, which I have used for a number of research projects, compares an expected distribution of words against the actual distribution. I did not put this in the standard PhiloLogic releases because it needs raw counts for all of the documents to calculate the expected distributions and I found that this was too slow to perform in real time. Another application would be something like differential relative rates, which we have in PhiloMine. Given the performance of modern machines, it appears that we could certain implement both for relatively small corpora, say low order of thousands, which is to say most of the datasets we are currently working with. One could imagine ways of speeding this up if need be. Another goodie for "Philo4"?

ReplyDeleteThat would be a lovely addition. At an even simpler level, we decided on the following additions to the cript: Sum the list of words that result from a particular query, their count and their frequency, so you get a sense of how much of the corpus you're covering. Classicists could be more aware of typical distributions, and this would be one way to make this clearer. OK - I'll admit, I'd like to see this too. How many words do you need to get 75% of Plato, vs. 75% of Lysias? For Lysias, the magic number turns out to be 1272. Bear in mind that this is still non-lemmatized, so the number of distinct lemmas is actually smaller; and those lemmas will actually cover more ground.

ReplyDeleteUpdate: Lemma searching now available! With a mere 259 lemmas, 79% of Lysias is covered. 375 words will cover 80% of Lysias and Plato combined. 365 will do the same for Plato alone:-)

ReplyDelete1015 words will get you over the 90% threshold in Plato.

A note of caution: accurate lemmatization and disambiguation is still a work in progress..