ARTFL Project Research Blog

Frequencies in the Greek and Latin texts

Helma Friday, November 20, 2009 3 comments

Earlier this year Mark built a frequency query for the French texts (affectionately named wordcount.pl)

Kristin has now implemented this for our Greek and Latin texts. If you wonder what's new about this: Word count for individual documents has always been there in PhiloLogic loads, but the difference here is that you can see frequencies over the entire corpus, or a subset of works/authors.

You can find the forms here:
http://perseus.uchicago.edu/LatinFrequency.html
http://perseus.uchicago.edu/GreekFrequency.html

Update: Forms moved to the 'production site', perseus.uchicago.edu. You can now specify genre as well. Stay tuned for further stats, meant to provide a friendly reminder of Zipf's Law.

Note: the counts are raw frequency counts, without lemmatization.

I have edited the search form a tiny bit - let me know if you encounter any problems.

Do LDA generated topics match human identified topics?

Clovis Wednesday, November 18, 2009 1 comment

I've been experimenting lately on how LDA generated topics and the Encyclopédie classes of knowledge match. The experiment was conducted in the following way:

- I chose 100 classes of knowledge in the Encyclopédie, and picked 50 articles of each.

- I then ran a first LDA topic trainer choosing 100 topics.

- I then proceeded to identify each generated topic and name after the Encyclopédie classes of knowledge.

- My plan was then to look at the topic proportions per article and see if the top topic would correspond to its class of knowledge. Would the computer manage to classify the articles in the same way the encyclopedists had?

I was not able to get that far when choosing 100 topics for my first LDA run. This is because LDA will always generate a couple topics which aren't really topics, but are just lists of very common words and they just happen to be used in the same documents. Therefore, one should always disregard these topics and focus on the others. What this means is that I had to add a couple more topics to my LDA run in order to get 100 identifiable topics. So I settled with 103 topics. I found 3 distributions of words which were unidentifiable, so I dismissed them.

The results show that LDA topics and the Encyclopédie classes of knowledge do not match (see links to results below). Some do very well, like Artillerie, for which the corresponding distribution of words is :

canon piece poudre artillerie boulet fusil ligne calibre mortier bombe feu charge culasse livre met chambre pouce lumiere roue affut diametre coup batterie levier bouche ame flasque balle tourillon tire

Other distribution of words make sense in themselves but do not match any of the original classes of knowledge. For instance, there is no topic on 'teinture', 'peinture'. What we get instead is a mixture of both classes of knowledge which could be identified as colors :

couleur rouge blanc bleu tableau jaune verd peinture ombre teinture noir toile tableaux nuance papier etoffe bien teint peintre pinceau trait teinturier melange veut figure teindre feuille beau sert colle

Now the topic modeler is not wrong here. It's telling us that these words tend to occur together, which is true. Another significant example is the one with 'Boutonnier', 'Soie', and 'Rubanier' :

soie fil rouet corde brin tour main bouton gauche longueur boutonnier droite attache bout fils tourner sert molette noeud cordon doigt piece emerillon moule broche ouvrage ruban rochet branche aiguille

What we get here is a topic about the art of making clothes, which is more general than 'Boutonnier' or 'Rubanier'.

For this to actually work, the philosophes would have had to have been extremely rigorous in their choice of vocabulary, because this is what LDA expects. Also, another problem is that LDA considers that each document is a mixture of topics, and not made out of one topic. So if one document is exclusively focused on one topic, LDA will still try to extract a certain number of topics out of it. If this is the case, then you are going to get some topics which are mere subdivisions of the class of knowledge in this document. The reason why our experiment broke down could be that the LDA topic trainer created new subdivisions for some classes of knowledge, or regrouped several classes of knowledge. These are all valid as topics, but do not correspond to human identified topics.

Link to results

Section Highlighting in Philologic

Kristin Friday, November 13, 2009 1 comment

In many of the Perseus texts currently loaded under philologic, the section labels would overlap and be unreadable. These labels come from the milestone tags in the xml text and are placed along the edge of the text. One particularly problematic text in this regard was the New Testament, as the sections were verses and were thus often small sections of text.

In order to fix the overlapping issue, I wrote a little bit of javascript to hide the tags which would be placed in the same position as a previous tag. I also added a function to recalculate this if the window is resized. My main function is fairly simple:

function killOverlap (){
$lastOffset = 0;
$(".mstonecustom").each(function (i) {
if (this.offsetTop == $lastOffset){
this.className = "mstonen2";
}
else {
$lastOffset = this.offsetTop;
}});}

I also added a function which highlights a section when you hover over its milestone label along the side of the text. This seems useful to me, as often it is helpful to know where a section starts and ends. This was a slightly more complex problem. I had to alter the citequery3.pl script in order to add a span tag and some ids in order to get the javascript to work. The javascript was then fairly simple:

In order for it to work though, you have to alter the citequery3.pl script with this:

my $spanid = $citepoints{$offsets[$offset]};
$spanid =~ s/.*\.([0-9]+)\.([0-9]+)$/a$1b$2/;
#...
$tempstring =~ s/(^<[^>]+>)/$1<span class="mstonecustom" id="$spanid">$citepoints{$offsets[$offset]}<\/span>/;
#... {
$tempstring =~ s/<span class="mstonecustom" id="$spanid">$citepoints{$offsets[$offset]}<\/span>//;}

$milesubstrings[$offset] = "<span class=" . $citeunits{$offsets[$offset]} . " id="text">" . $tempstring . "<\/span>";

That's about it. It may come in useful again someday. For an example, take a look at this.

Towards PhiloLogic4

Mark Monday, November 02, 2009 Leave a Comment

Earlier this year I wrote a long discussion paper called "Renovating PhiloLogic" which provided an overview of the system architecture, a frank review of the strengths and (many) failings of the current implementation of the 3 series of PhiloLogic, and proposed a general design model for what would effectively be a complete reimplementation of the system, retaining only selected portions of the existing code base. While we are still discussing this, often in great detail, a few general objectives for any future renovation have emerged, including:

service oriented architecture;
release of new system in perl module libraries;
multiple database query support, and,
options for advanced or extended indexing models.

I will be putting together a public version of this discussion draft in the near future and will blog it when I have something ready.

Before sallying forth to do start working on a PhiloLogic4, there are a number of preliminary steps that Richard and I agree are required in order to 1) support the existing PhiloLogic3 series, and 2) clear the existing (messy) code base of some of the most egregious sections of the system, most notably the loader. Some of these are simply housekeeping and updates, some of these are patches and bug fixes, and some others are clean-ups which should streamline the current system and help in any redevelopment.

We will start by retasking one of our current machines, a 32 bit OS-X installation, to be the primary PhiloLogic development machine. We will also get the Linux branch on a 32 bit Linux machine (flavor to be determined). There is a known 64 bit installation problem which we will address at the end of this initial process. When we reach the right step, we will install it all on 64 bit machines and fix it then, hopefully with much less effort on a streamlined version, while releasing upgraded 32 bit versions on the way. The other element for our consideration is the degree to which we can merge the OS-X and Linux branches of the system. Right now, we have two completely distinct branches. It would be much better to have one, which we think may be accomplished in a couple of different ways.

We are currently thinking of 4 distinct steps, which should each result in new maintenance releases of PhiloLogic3.

Step One

Apply the most recent OS-X Leopard patch kit to both the OS-X and Linux branches as required and feasible. This is the patch kit that Richard and I assembled for the migration to our new servers and has some nifty little extensions. We will also be updating the PhiloLogic code release site (Google Code) and retooling the new PhiloLogic site, which will then be referred from the existing location (philologic.uchicago.edu). Maintenance release when done. [MVO]

Step Two

The PhiloLogic loader currently using a GNU Makefile scheme to load databases. This made good sense many years ago, when loads could take many hours (or days), but is probably no longer needed. There are also many places where we use various utilities (sed, gawk, gzip, etc.) which add complications and make the entire scheme more brittle. Our current thinking is to fold all of the Makefile functions into a revised version of philoload, but may determine a better way to proceed once we get into it. We're planning a maintenance release of this when done. [MVO]

Step Three

The current PhiloLogic loader performs a number of C compiles, many of which are no longer needed. For example, the system still compiles the search2 binaries. These were left in Philologic3 in order to have backwards compatibility. We need to keep the ability to generate the correct pack and unpack libraries which are used by search3. Once we have cleared out all unnecessary C compiles, we will investigate a couple of known bugs in search3, and attempt to resolve these. Again, once done, we would do a maintenance release. [RW and MVO]

Step Four

As noted above, some users have reported 64 bit compile problem on either installation or load. Once we have the loader streamlined, eliminating as much of the old C compiles are possible, we will investigate this problem. We're hoping that this will be easily remedied and, even better, could be resolved in a combined release which would merge the current OS-X and Linux branches. This would be the terminal release of the PhiloLogic3 series. Any future releases would be only for bug fixes.

We hoping that these steps will result in a stable terminal release of the PhiloLogic3 series, which will be easier to install and use. It will also result in significant streamlining which will help in any future Philologic renovation or a new PhiloLogic4 series.

This is an initial plan, so please do post your comments, suggestions, and complaints.

ARTFL Project Research Blog

Frequencies in the Greek and Latin texts

Do LDA generated topics match human identified topics?

Section Highlighting in Philologic

Towards PhiloLogic4

Labels

Popular Posts

Blog Archive

Developed by ARTFL