Introduction
As I'm sure you all know, the MONK Project (http://monkproject.org/), directed by Martin Mueller and John Unsworth, has generated a large collection of tagged data some of which has been made public and some of which is limited to CIC or other institutions (http://monkproject.org/ downloads/). Each word in this group of different collections is tagged for part of speech, lemma, and normalization. Martin has documented the encoding scheme in great detail at http://panini.northwestern. edu/mmueller/nupos.pdf.
The following is a long post describing in some detail one approach to integrating this kind of information. Some of this will be deeply geeky and you can feel free to skip over sections. There is, towards the bottom of this post, a link to a standard PhiloLogic search form, so you can play with this proof-of-concept build yourself.
Richard and Helma have developed a mechanism to search for part of speech and lemma searching under PhiloLogic for their Greek and Latin databases (link). This is based on some truly inspired hacking by Richard and forms one model of how to handle this kind of functionality. My understanding of this, and Richard please correct me if I am wrong, is that it uses an undocumented feature in the index/search3 subsystem that allows us to have multiple index entries for each word position in the main index. This works and is certainly an approach to be considered as we think about a new series of PhiloLogic.
Build Notes
I have been experimenting with a somewhat different mechanism to handle this kind of problem, which is based on previous examples of mapping multiple word attributes to an index entry, using multiple field "crapser" entries. You may recall that this is the mechanism by which we merged Martin's virtual normalization data to very large collections of early modern English data and is currently running at Northwestern (link). My approach is to index not words, but pairs of surface forms and part of speech tags and to link these to an expanded (5 field) word database (called by crapser) containing the index form, surface form, lemma, part of speech and normalized forms. Here are some index entry forms (and frequencies):
These map to the word vector database which looks like:
To build this I first reduced for fully verbose form of the data in which each token is tagged:
<w eos="0" lem="country" pos="n1" reg="COUNTRY" spe="COUNTRY">COUNTRY</w>
I eliminated all encoding that is redundant, just to make things easier to work with since the files are huge:
<w pos="n1">COUNTRY</w>
Where there is some additional information, I keep it in the encoded document:
<w lem="illustration" pos="n2">ILLUSTRATIONS</w>
I then loaded this data into a very slightly modified PhiloLogic textloader. This simply builds an index representation of the surface form of the word and the part of speech, by getting the PoS from encoding:
and adding this to the index entry:
$theword = $theword . ":" . $thepartofspeech;
When loaded to this point, you have modified index entries. The next step is simply to build a multi-field word vector database (crapser). I did this by reading the input data and adding entries for lemmas or normalizations. This is simply an extension of what is already documented in the "virtual-normalize" directory in the "goodies" in the PhiloLogic release.
The next step was to slightly modify a version of Leonid's original "gimme". The "sequential" version of this function (in the standard Philologic distribution), maps a multi-field (tab delimited) query using regexp patterns in egrep. This is fast and simple. It allows naming of fields, so you can simply specify "lem=justice" and it will generate a regular expression pattern (where TAB = the tab character):
^[^TAB]*TAB[^TAB]*TAB[^TAB]*TABjusticeTAB[^TAB]*$
And you get, of course, full regular expressions. (Note, this renders with some odd spacing, there are no spaces). Swap in this version of crapser and it all appears to run without further modification.
So, to summarize, the implementation does not require any modifications to core system components. It requires only slight modifications to a textloader, which we do all the time for specific databases, and a slightly modified "crapser" with a suitably build word vector database.
The Database
The database has 567 documents containing 38.5 million words (types) and 273,600 index entries. Recall that these are surface form words and part of speech tags and not normal types. The dataset has selections from various sources, including Documenting the American South as well as some British Early Modern texts. It should have full PhiloLogic search and reporting capabilities. You can query the words in the database as usual, simply by typing in words. To force searches on lemmas, normalizations, and parts of speech by specifying (with examples):
lem=conquer
nrm=conquer
pos=pns32
and finally, if you want to get one surface form and part of speech you can search the index entry directly, such as "conquered:vvd". Note that the Part of Speech is specified after a colon and you don't need to specify anything else. This is obviously not a real query interface, but it suggests how we can think about interfaces further along (eg, pull down menus, etc). You can also use regular expressions, such as lem=conque.* Finally, you can combine these, such as "pos=po.* lem=enemy", which means find forms of enemy followed by possessive pronous within three words, such as : "their most mortall enemies ". You will need to consult Martin's discussion of the encoding to see all of the parts of speech. It is an extensive and well reasoned scheme.
After all of that, here is the search form. (Reloaded 7/28/11)
Now, before running of to play with this, there are some important notes following which describes how to use this in more detail.
Discussion
This is a proof-of-concept build. In a full implementation, I would need to add some search syntax to allow the user to indicate a set of combined criteria for a single word. I was having some problems coming up with a use case, but I guess one could want to say search for a particular lemma AND part of speech. It would all work with a little massaging. Aside from that, this simple model should support all of the standard PhiloLogic searching and reporting
features. Do let me know if you find something that does not work.
This model supports disambiguating searches, such as to find dog when it is used as a verb. Try "dog:vvi" for hits like"we can dog them" (thanks Russ for this example). It also appears to work properly form most other searches, such as lemmas, normalizations, etc. Part of speech for single entries looks reasonable in terms of performance.
My primary interest, however, in this experiment is to test performance on sequences of parts of speech searching. For example: "pos=po3. pos=j pos=n1" will find sequences like: "their strange confusion" and "his Princely wisedome". Chains of four also seem to work reasonably. Eg: "pos=vvn pos=po3. pos=j pos=n1" returns phrases like "neglected their even elevation", "stimulated their adventurous courage", and "aroused his little troop". You can always find a part of speech after a particular word (lemma): "after pos=po3.".
Now, this is all fine and dandy. Except that doing conjoined searches on parts of speech reveals a significant conceptual difficulty, which I believe also applies to Richard's implementation. Each part of speech generates thousands of surface form index entries. For example:
"pos=vvn pos=po3. pos=j pos=n1"
generates 81,000 unique terms (index entries) in 4 vectors. The evaluation then does a massive join at the level of index entry evaluation. So, it is SLOW and subject to possible memory buffer overflow or other problems. In fact, the system will begin to generate results of this type fairly quickly, due to PhiloLogic's lazy evaluation (start returning results as soon as you have a few). But it can take several minutes to complete the task. We would certainly not want to put this on a high traffic machine, since if you have many similar queries, it would bog it down. Obviously, we could simply test to make sure that users search criteria would not drag the whole system down or simply lock this database to one user at a time, or some other work around. If we got reasonable French NLP, this could be implemented quickly.
However, I believe we have bumped up upon a conceptual problem. To find POS1 and POS2 and POS3 either in a row or within N number of words requires an evaluation of word and/or part of speech positions in documents.
There are a couple of possible solutions, all of which would require consideration of distinct indexing structures. The first is simply to build another kind of NGRAM POS index which would have sequences of parts of speech indexed and mapped to document regions. The second would be a another kind of index which would look like a standard PhiloLogic index entry, except that it would be ONLY part of speech. This would reduce the size of the word vectors, but would not in itself improve the index evaluation to find those sequences that fit the query in the actual documents, since we still have to return to word positions in documents.
We might call this "The Conjoined Part of Speech Problem (CPSP)". It is, in my opinion, a highly specialized type of search and it is not clear just what the use cases might look like in relatively simple languages (English, French) as opposed to Greek, for which Helma makes a convincing case. So, there is a question of just how important this might be. In email communication, Martin makes the case that it would be and that researchers who want this kind of query would be willing to wait a few minutes.
It would be a trivial and useful experiment to run a load where I would index ONLY part of speech information. This would be a good test to see if evaluation speed for conjoined part of speech searches would be reasonable. In fact, Richard and I did a few quick experiments that suggest this would work. The idea would be to distinguish between simple queries -- and run them as usual -- and multiple PoS queries, which would be run on a dedicated index build. So, build parallel indicies. Oddly enuff, in the current architecture, I suspect that one could simply have a switch to say WHICH database to consult dynamically, simply by evaluating the query and then setting the database argument. That would be another one of my famous, award-winning, hall of shame hacks. But it could be made to work.
Martin has also pointed out another issue, which is searching, sorting, and counting of PoS, lemma, and other data. Now, that makes a lot of sense. I want to search for "country" and find distributions of particular parts of speech. Or, I want to do a collocation table searching on a lemma and counting the lemmas around the word. I think all of this is certainly doable -- the latter is something I wrote about some 15 years ago -- with hacks to the various reporting subsystems (not in 3, which is just too much of a mess). In an SOA model of PhiloLogic, this would be quite reasonably handled, ideally by other teams using PhiloLogic if not here at Chicago.
I think these are important issues to raise, but not necessarily resolve at this time, if (when?) we consider the architecture of any future Philologic4 development effort. For example, the current models of report generators would have to know about lemmas, etc. And we would need to at least leave hooks in any future model to support different indexing schemes for things like.
Finally, watch this space. I believe Richard is doing a build of this data using his model as well.
Please do play around with all of this and let me know what you think. One consideration would be implementing this for selected French collections. We would obviously need real virtual normalizers, lemmatizers and PoS identifiers for a broader range of French than we have now.
Read More
As I'm sure you all know, the MONK Project (http://monkproject.org/), directed by Martin Mueller and John Unsworth, has generated a large collection of tagged data some of which has been made public and some of which is limited to CIC or other institutions (http://monkproject.org/
The following is a long post describing in some detail one approach to integrating this kind of information. Some of this will be deeply geeky and you can feel free to skip over sections. There is, towards the bottom of this post, a link to a standard PhiloLogic search form, so you can play with this proof-of-concept build yourself.
Richard and Helma have developed a mechanism to search for part of speech and lemma searching under PhiloLogic for their Greek and Latin databases (link). This is based on some truly inspired hacking by Richard and forms one model of how to handle this kind of functionality. My understanding of this, and Richard please correct me if I am wrong, is that it uses an undocumented feature in the index/search3 subsystem that allows us to have multiple index entries for each word position in the main index. This works and is certainly an approach to be considered as we think about a new series of PhiloLogic.
Build Notes
I have been experimenting with a somewhat different mechanism to handle this kind of problem, which is based on previous examples of mapping multiple word attributes to an index entry, using multiple field "crapser" entries. You may recall that this is the mechanism by which we merged Martin's virtual normalization data to very large collections of early modern English data and is currently running at Northwestern (link). My approach is to index not words, but pairs of surface forms and part of speech tags and to link these to an expanded (5 field) word database (called by crapser) containing the index form, surface form, lemma, part of speech and normalized forms. Here are some index entry forms (and frequencies):
24 conquer:vvb 445 conquer:vvi 143 conquered:vvd 414 conquered:vvn
These map to the word vector database which looks like:
idx surf pos lem normal conquered:j conquered j conquer conquered conquered:j-vvn conquered j-vvn conquer conquered conquered:n-vvn conquered n-vvn conquer conquered conquered:vvd conquered vvd conquer conquered conquered:vvn conquered vvn conquer conquered
To build this I first reduced for fully verbose form of the data in which each token is tagged:
<w eos="0" lem="country" pos="n1" reg="COUNTRY" spe="COUNTRY">
I eliminated all encoding that is redundant, just to make things easier to work with since the files are huge:
<w pos="n1">COUNTRY</w>
Where there is some additional information, I keep it in the encoded document:
<w lem="illustration" pos="n2">ILLUSTRATIONS</w>
I then loaded this data into a very slightly modified PhiloLogic textloader. This simply builds an index representation of the surface form of the word and the part of speech, by getting the PoS from encoding:
if ($thetag =~ /<w/) { $thepartofspeech = ""; $thetag =~ m/pos="([^"]*)"/i; $thepartofspeech = $1; }
and adding this to the index entry:
$theword = $theword . ":" . $thepartofspeech;
When loaded to this point, you have modified index entries. The next step is simply to build a multi-field word vector database (crapser). I did this by reading the input data and adding entries for lemmas or normalizations. This is simply an extension of what is already documented in the "virtual-normalize" directory in the "goodies" in the PhiloLogic release.
The next step was to slightly modify a version of Leonid's original "gimme". The "sequential" version of this function (in the standard Philologic distribution), maps a multi-field (tab delimited) query using regexp patterns in egrep. This is fast and simple. It allows naming of fields, so you can simply specify "lem=justice" and it will generate a regular expression pattern (where TAB = the tab character):
^[^TAB]*TAB[^TAB]*TAB[^TAB]*TABjusticeTAB[^TAB]*$
And you get, of course, full regular expressions. (Note, this renders with some odd spacing, there are no spaces). Swap in this version of crapser and it all appears to run without further modification.
So, to summarize, the implementation does not require any modifications to core system components. It requires only slight modifications to a textloader, which we do all the time for specific databases, and a slightly modified "crapser" with a suitably build word vector database.
The Database
The database has 567 documents containing 38.5 million words (types) and 273,600 index entries. Recall that these are surface form words and part of speech tags and not normal types. The dataset has selections from various sources, including Documenting the American South as well as some British Early Modern texts. It should have full PhiloLogic search and reporting capabilities. You can query the words in the database as usual, simply by typing in words. To force searches on lemmas, normalizations, and parts of speech by specifying (with examples):
lem=conquer
nrm=conquer
pos=pns32
and finally, if you want to get one surface form and part of speech you can search the index entry directly, such as "conquered:vvd". Note that the Part of Speech is specified after a colon and you don't need to specify anything else. This is obviously not a real query interface, but it suggests how we can think about interfaces further along (eg, pull down menus, etc). You can also use regular expressions, such as lem=conque.* Finally, you can combine these, such as "pos=po.* lem=enemy", which means find forms of enemy followed by possessive pronous within three words, such as : "
After all of that, here is the search form. (Reloaded 7/28/11)
Now, before running of to play with this, there are some important notes following which describes how to use this in more detail.
Discussion
This is a proof-of-concept build. In a full implementation, I would need to add some search syntax to allow the user to indicate a set of combined criteria for a single word. I was having some problems coming up with a use case, but I guess one could want to say search for a particular lemma AND part of speech. It would all work with a little massaging. Aside from that, this simple model should support all of the standard PhiloLogic searching and reporting
features. Do let me know if you find something that does not work.
This model supports disambiguating searches, such as to find dog when it is used as a verb. Try "dog:vvi" for hits like
Now, this is all fine and dandy. Except that doing conjoined searches on parts of speech reveals a significant conceptual difficulty, which I believe also applies to Richard's implementation. Each part of speech generates thousands of surface form index entries. For example:
"pos=vvn pos=po3. pos=j pos=n1"
generates 81,000 unique terms (index entries) in 4 vectors. The evaluation then does a massive join at the level of index entry evaluation. So, it is SLOW and subject to possible memory buffer overflow or other problems. In fact, the system will begin to generate results of this type fairly quickly, due to PhiloLogic's lazy evaluation (start returning results as soon as you have a few). But it can take several minutes to complete the task. We would certainly not want to put this on a high traffic machine, since if you have many similar queries, it would bog it down. Obviously, we could simply test to make sure that users search criteria would not drag the whole system down or simply lock this database to one user at a time, or some other work around. If we got reasonable French NLP, this could be implemented quickly.
However, I believe we have bumped up upon a conceptual problem. To find POS1 and POS2 and POS3 either in a row or within N number of words requires an evaluation of word and/or part of speech positions in documents.
There are a couple of possible solutions, all of which would require consideration of distinct indexing structures. The first is simply to build another kind of NGRAM POS index which would have sequences of parts of speech indexed and mapped to document regions. The second would be a another kind of index which would look like a standard PhiloLogic index entry, except that it would be ONLY part of speech. This would reduce the size of the word vectors, but would not in itself improve the index evaluation to find those sequences that fit the query in the actual documents, since we still have to return to word positions in documents.
We might call this "The Conjoined Part of Speech Problem (CPSP)". It is, in my opinion, a highly specialized type of search and it is not clear just what the use cases might look like in relatively simple languages (English, French) as opposed to Greek, for which Helma makes a convincing case. So, there is a question of just how important this might be. In email communication, Martin makes the case that it would be and that researchers who want this kind of query would be willing to wait a few minutes.
It would be a trivial and useful experiment to run a load where I would index ONLY part of speech information. This would be a good test to see if evaluation speed for conjoined part of speech searches would be reasonable. In fact, Richard and I did a few quick experiments that suggest this would work. The idea would be to distinguish between simple queries -- and run them as usual -- and multiple PoS queries, which would be run on a dedicated index build. So, build parallel indicies. Oddly enuff, in the current architecture, I suspect that one could simply have a switch to say WHICH database to consult dynamically, simply by evaluating the query and then setting the database argument. That would be another one of my famous, award-winning, hall of shame hacks. But it could be made to work.
Martin has also pointed out another issue, which is searching, sorting, and counting of PoS, lemma, and other data. Now, that makes a lot of sense. I want to search for "country" and find distributions of particular parts of speech. Or, I want to do a collocation table searching on a lemma and counting the lemmas around the word. I think all of this is certainly doable -- the latter is something I wrote about some 15 years ago -- with hacks to the various reporting subsystems (not in 3, which is just too much of a mess). In an SOA model of PhiloLogic, this would be quite reasonably handled, ideally by other teams using PhiloLogic if not here at Chicago.
I think these are important issues to raise, but not necessarily resolve at this time, if (when?) we consider the architecture of any future Philologic4 development effort. For example, the current models of report generators would have to know about lemmas, etc. And we would need to at least leave hooks in any future model to support different indexing schemes for things like.
Finally, watch this space. I believe Richard is doing a build of this data using his model as well.
Please do play around with all of this and let me know what you think. One consideration would be implementing this for selected French collections. We would obviously need real virtual normalizers, lemmatizers and PoS identifiers for a broader range of French than we have now.