See UPDATE at the end.
This post actually relates nicely to Mark's recent post. I have recently been working on lemmatized collocation tables for the Greek texts on Perseus. If you just want to see the results, skim to the end, the rest describes how I got there.
It is not so difficult to look up the lemma for each word surrounding a certain search hit, as for these texts the structure to do so is already in place and the information is stored in SQL tables. Efficiency in gathering this information is the main difficulty. Looking up the lemma in the tables now in place can take a couple different SQL queries, which each take up a small chunk of time. For a few lookups, or even a few hundred, this is not too big of a problem. However, for a collocation table spanning five words on either side, we need at least 10 lookups in the databases per hit. The time it takes to do that adds up quite quickly.
So, following a suggestion from Mark, I wrote a script and generated a file with a line for every word in the Perseus database. Basically, it takes each word id starting with 1 on up to 5.5 million something and looks up its lemma. This generated a 5.5 million line file with lines likes this:
2334550 δέ
2334551 ὅς
2334552 ἕκαστος
2334553 ἵππος
2334554 nolemma
2334555 ὅς
2334556 δέ
2334557 πέτομαι
2334558 κονίω
2334559 πεδίον
Now, looking up the words on either side of a hit is much simpler - all you need to know is the "word id" of the hit and you can look at those around it. The "nolemma" entries are primarily punctuation and such which were given word tags at some point.
The size of this massive file however was another hindrance to efficient generation of collocation tables. Although we now know exactly what line we need to look at for the information we need, getting that line is still costly as it could require reading in a couple million lines of a file to get the one we need. After playing around with it a bit, command line grep searching seemed to be the fastest way to go, but even an grep search 10 times per hit adds up fast. So, I tried combining the searches into one massive egrep command to be read into a perl array. My searches looked something like:
egrep "23345[456][0-9]" lemmafile
Giving a window of 30 lines in the file starting at 2334540 and ending at 2334569. This limited the searches to one per hit instead of 10, but it still wasn't fast enough. So, I combine all of the searches like so:
egrep "(2342[234]|33329[678]|...|829[567])[0-9]"
(A bit of accounting for numbers ending in 0 is needed so that a window around 400 doesn't include things in the 490's, but this is not too difficult.)
This looked nice and seemed to work until I tried running it on more hits. It was then coming up with such massive regular expressions to grep for that grep was complaining that they were too big. So, I broke them up into chunks of roughly 350 at a time. Fewer, and the time would go up due to the added number of grep searches; too many more, and grep would overflow again. I may not have hit on the exact time minimizing value, but it is close at least.
Finally, here are some example searches:
Search for logoi.
Search for lemma:logos. (The time estimate is higher than actual load time).
Or, here is the search form:
Search Form.
Make sure that you choose Collocation Table and check the lemma button under the "Refined Search Results" Tab at the bottom.
It can handle most searches, except for very high frequency words. If anyone has ideas on how to make it faster, it could perhaps enable us to get results for all searches. Though perhaps this is not possible without somehow creating and saving those results somewhere.
UPDATE:
After talking to Mark, I altered the way the data is read from the file and now things should be running faster. The reason that all this discussion and speed streamlining for lemmatized collocation tables is necessary is the fact that the texts on Perseus do not have the lemmas embedded in the text. As Mark noted, many of the other databases would allow for much simpler and faster generation of the same data due to the fact that they do have lemmas in the text. However, for the purposes of Perseus, lemmas needed to be separated from the texts to allow them to be more dynamically updated, changed and maintained.
As for the speed, it should now be faster thanks to a handy function in Perl. I had investigated methods for reading a certain line of a file, since I happened to know exactly what lines I needed. However, finding none that did not read the whole contents of the file up to that line, I instead implemented the process described above. I overlooked SEEK. I dismissed it because it starts from a certain byte offset and not a certain line. Nevertheless, we can harness its power by simply padding each line with spaces to ensure every line in our file is the same byte length. With this pointer from Mark and some padding on the lines, knowing the line number and the number of bytes per line is enough to start reading from the exact location in the file that we desire.
Read More
This post actually relates nicely to Mark's recent post. I have recently been working on lemmatized collocation tables for the Greek texts on Perseus. If you just want to see the results, skim to the end, the rest describes how I got there.
It is not so difficult to look up the lemma for each word surrounding a certain search hit, as for these texts the structure to do so is already in place and the information is stored in SQL tables. Efficiency in gathering this information is the main difficulty. Looking up the lemma in the tables now in place can take a couple different SQL queries, which each take up a small chunk of time. For a few lookups, or even a few hundred, this is not too big of a problem. However, for a collocation table spanning five words on either side, we need at least 10 lookups in the databases per hit. The time it takes to do that adds up quite quickly.
So, following a suggestion from Mark, I wrote a script and generated a file with a line for every word in the Perseus database. Basically, it takes each word id starting with 1 on up to 5.5 million something and looks up its lemma. This generated a 5.5 million line file with lines likes this:
2334550 δέ
2334551 ὅς
2334552 ἕκαστος
2334553 ἵππος
2334554 nolemma
2334555 ὅς
2334556 δέ
2334557 πέτομαι
2334558 κονίω
2334559 πεδίον
Now, looking up the words on either side of a hit is much simpler - all you need to know is the "word id" of the hit and you can look at those around it. The "nolemma" entries are primarily punctuation and such which were given word tags at some point.
The size of this massive file however was another hindrance to efficient generation of collocation tables. Although we now know exactly what line we need to look at for the information we need, getting that line is still costly as it could require reading in a couple million lines of a file to get the one we need. After playing around with it a bit, command line grep searching seemed to be the fastest way to go, but even an grep search 10 times per hit adds up fast. So, I tried combining the searches into one massive egrep command to be read into a perl array. My searches looked something like:
egrep "23345[456][0-9]" lemmafile
Giving a window of 30 lines in the file starting at 2334540 and ending at 2334569. This limited the searches to one per hit instead of 10, but it still wasn't fast enough. So, I combine all of the searches like so:
egrep "(2342[234]|33329[678]|...|829[567])[0-9]"
(A bit of accounting for numbers ending in 0 is needed so that a window around 400 doesn't include things in the 490's, but this is not too difficult.)
This looked nice and seemed to work until I tried running it on more hits. It was then coming up with such massive regular expressions to grep for that grep was complaining that they were too big. So, I broke them up into chunks of roughly 350 at a time. Fewer, and the time would go up due to the added number of grep searches; too many more, and grep would overflow again. I may not have hit on the exact time minimizing value, but it is close at least.
Finally, here are some example searches:
Search for logoi.
Search for lemma:logos. (The time estimate is higher than actual load time).
Or, here is the search form:
Search Form.
Make sure that you choose Collocation Table and check the lemma button under the "Refined Search Results" Tab at the bottom.
It can handle most searches, except for very high frequency words. If anyone has ideas on how to make it faster, it could perhaps enable us to get results for all searches. Though perhaps this is not possible without somehow creating and saving those results somewhere.
UPDATE:
After talking to Mark, I altered the way the data is read from the file and now things should be running faster. The reason that all this discussion and speed streamlining for lemmatized collocation tables is necessary is the fact that the texts on Perseus do not have the lemmas embedded in the text. As Mark noted, many of the other databases would allow for much simpler and faster generation of the same data due to the fact that they do have lemmas in the text. However, for the purposes of Perseus, lemmas needed to be separated from the texts to allow them to be more dynamically updated, changed and maintained.
As for the speed, it should now be faster thanks to a handy function in Perl. I had investigated methods for reading a certain line of a file, since I happened to know exactly what lines I needed. However, finding none that did not read the whole contents of the file up to that line, I instead implemented the process described above. I overlooked SEEK. I dismissed it because it starts from a certain byte offset and not a certain line. Nevertheless, we can harness its power by simply padding each line with spaces to ensure every line in our file is the same byte length. With this pointer from Mark and some padding on the lines, knowing the line number and the number of bytes per line is enough to start reading from the exact location in the file that we desire.