Natural Language Morphology Queries in Perseus

1 comment
Natural language queries are now possible on Perseus under Philologic. Previously, Richard had implemented searching for various parts of speech in various forms. For instance, as noted in the About page for Perseus, a search for 'pos:v*roa*' will return all the instances of perfect active aorist verbs in the selected corpus. Now, a search for 'form:could-I-please-have-some-perfect-active-optatives?' will return the same results. In fact, searching for 'form:perf-act-opt', 'form:perfect-active-optative', 'form:perfection-of-action-optimizations', or 'form:perfact-actovy-opts-pretty-please' will all accomplish this same task. Note that the dashes are necessary between the words, otherwise a search for plural nouns written as 'form:plural nouns' will actually be searching for any plural word followed by the word "nouns", which will fail. I carefully chose shorter forms of all the keywords, such as "impf" and "ind" for "imperfect" and "indicative" so that a search including any word starting with "ind" will match indicatives regardless of what follows the 'd'. Hopefully, there are no overlapping matches (such as using "im" to abbreviate "imperfect" which would also match "imperative"). If you do encounter any, please let me know. Potentially, we could put a list of acceptable abbreviations somewhere, although they are fairly straightforward and typing the full term out is always a fail-safe method.

Basically, the modified crapser script simply translates searches beginning with "form:" into the corresponding "pos:" search. Using a hash of regular expressions and string searching, it simply returns the corresponding code. In the previous example, the search is actually looking for "pos:....roa..". Notice that it fills in the empty space of the code with dots, allowing them to be anything. I implemented an alternative filler, the dash, so that when you search for something like "form:perf-act-opt-exact", you will actually be searching for "pos:----roa--" (and your search will fail because there are no terms that are only and exactly perfect active optative without other specifications).

One limitation that this method of natural language querying has is that it cannot match the versatility of the "pos:" searches. That is, because it selects either dots or dashes as fillers, you cannot get a mixture of them in your search. You cannot run a search such as "pos:v-.sroa---". However, this limitation will likely have little effect for the average user and the user needing such a search can still obtain it using the "pos:" method. An alternative method involving drop down input boxes for each slot of the code would enable the full power of the pos searches, but it would also be potentially more tedious to implement and potentially tedious to use as well. Such a input form would require the user to know more about the encoding than the "form:" searching I implemented does. For example, a user would need to know that "verb" is required in the first slot, even if "aorist optative" makes that the only possibility. Whereas searching for 'form:aorist-optative' works without the user ever needing to know that a 'v' is required in the first slot.
Next PostNewer Post Previous PostOlder Post Home

1 comment:

  1. I am delighted that this is now possible. I hope it will make morphological searching accessible to more users. The one downside is that TreeTagger's automated tagging errors will now be more obvious to more people:-) But to that I say, please, vote early and vote often..

    ReplyDelete