<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8901065416749663157</id><updated>2011-09-27T21:38:23.954-05:00</updated><category term='nlp'/><category term='Archives Parlementaires'/><category term='encyclopédie'/><category term='perseus'/><category term='Topic modeling'/><category term='sLDA'/><category term='software'/><category term='development'/><category term='nosql'/><category term='similarity'/><category term='alignment'/><category term='philologic'/><category term='vsm'/><category term='architecture'/><category term='philoline'/><category term='LDA'/><category term='monk'/><title type='text'>ARTFL Project Research Blog</title><subtitle type='html'>The ARTFL Team discusses current research, recent experiments, and ongoing software development.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://artfl.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>48</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-8209074478221799351</id><published>2011-03-30T10:39:00.007-05:00</published><updated>2011-03-30T11:05:25.305-05:00</updated><title type='text'>shlax and ElementTree</title><content type='html'>I've just pushed a few commits to the central &lt;a href="https://github.com/rwhaling/libphilo/"&gt;philo4 repository&lt;/a&gt;;&lt;div&gt;mostly small bugfixes to the makefile and the parser, but I added a convenience method to the shlax XML parser.  &lt;br /&gt;&lt;br /&gt;As you may know, Python has a really nice XML library called &lt;a href="http://effbot.org/zone/element-index.htm"&gt;ElementTree&lt;/a&gt;, but it has a few quirks:&lt;/div&gt;&lt;div&gt;1) it uses standard, "fussy" XML parsers that choke on the slightest flaw, and&lt;/div&gt;&lt;div&gt;2) it has a formally correct but incomprehensible approach to namespaces that is exceedingly impractical for day-to-day TEI hacking.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In this update, I've added a &lt;a href="https://github.com/rwhaling/libphilo/blob/master/python/philologic/shlaxtree.py"&gt;shlaxtree&lt;/a&gt; module to the philo4 distribution that hooks our fault-tolerant, namespace-agnostic XML parser up to ElementTree's XPath evaluator and serialization facilities.  It generally prefers the 1.3 version of ElementTree, which is standard in python 2.7, but a simple install in 2.6 and 2.5.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Basically, the method philologic.shlaxtree.parse() will take in a file object, and return the root node of the xml document in the file, assuming it found one.  You can use this to make a simple bibliographic extractor like &lt;a href="https://github.com/rwhaling/libphilo/blob/master/python/examples/elements.py"&gt;so&lt;/a&gt;: &lt;/div&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;#!/usr/bin/env python&lt;br /&gt;import philologic.shlaxtree as st&lt;br /&gt;import sys&lt;br /&gt;import codecs&lt;br /&gt;&lt;br /&gt;for filename in sys.argv[1:]:&lt;br /&gt;    file = codecs.open(filename,"r","utf-8")&lt;br /&gt;    root = st.parse(file)&lt;br /&gt;    header = root.find("teiHeader")&lt;br /&gt;    print st.et.tostring(header)&lt;br /&gt;    print header.findtext(".//titleStmt/title")&lt;br /&gt;    print header.findtext(".//titleStmt/author")&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;Not bad for 10 lines, no?  What's really cool is that you can modify trees, nodes, and fragments before writing them out, with neat recursive functions and what not.  I've been using it for converting old SGML dictionaries to TEI--once you get the hang of it, it's much easier than regular expressions, and much easier to maintain and modify as well.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-8209074478221799351?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2011/03/shlax-and-elementtree.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/8209074478221799351'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/8209074478221799351'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2011/03/shlax-and-elementtree.html' title='shlax and ElementTree'/><author><name>Richard</name><uri>http://www.blogger.com/profile/06345844875619851744</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-2067878686084267419</id><published>2010-08-18T10:44:00.006-05:00</published><updated>2010-08-18T20:40:18.831-05:00</updated><title type='text'>shlax: a shallow, lazy XML parser in python</title><content type='html'>&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;Recently, I stumbled upon a paper from the dawn age of XML:&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman', serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:Times;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;"REX: XML Shallow Parsing with Regular Expressions", Robert D. Cameron&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:Times;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style=" ;font-family:Georgia, serif;"&gt;&lt;a href="http://www.cs.sfu.ca/~cameron/REX.html"&gt;http://www.cs.sfu.ca/~cameron/REX.html&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;It describes how to do something I'd never seen done before: parse the entirety of standard XML syntax in a single regular expression.  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;We've all written short regexes to find some particular feature in an XML document, but we've also all seen those fail because of oddities of whitespace, quoting, linebreaks, etc., that are perfectly legal, but hard to account for in a short, line-by-line regular expression.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Standard XML parsers, like &lt;a href="http://expat.sourceforge.net/"&gt;expat&lt;/a&gt;, are fabulous, well maintained, and efficient.  However, they have a common achilles heel: the &lt;a href="http://www.w3.org/TR/2008/REC-xml-20081126/#proc-types"&gt;XML standard&lt;/a&gt;'s insistence that XML processors "MUST" report a fatal error if a document contains unbalanced tags.  For working with HTML or SGML based documents, this is disastrous!&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;In contrast, Cameron's regex-based parser is extremely fault-tolerant--it extracts as much structure from the document as possible, and reports the rest as plain text.  Further, it supports "round-tripping": the ability to exactly re-generate a document from parser output, which standard parser typically lack.  As a corollary of this property, it becomes possible to report absolute byte offsets, which is a "killer feature" for the purposes of indexing.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Because of all these benefits, I've opted to translate his source code from javascript to python. I call my modified implementation "shlax" [pronounced like "shellacs", sort of], a shallow, lazy XML parser.  "Shallow" meaning that it doesn't check for well-formedness, and simply reports tokens, offsets, and attributes as best it can.  "Lazy" meaning that it iterates over the input, and yields one object at a time--so you don't have to write 8 asynchronous event handlers to use it, as in a typical SAX-style parser.  This is often called a "pull" parser, but "shpux" doesn't sound as good, does it?&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;If you're interested, you can look at the source at the &lt;a href="http://github.com/rwhaling/libphilo/blob/master/python/philologic/shlax.py"&gt;libphilo github repo&lt;/a&gt;.  The regular expression itself is built up over the course of about 30 expressions, to allow for maintainability and readability.  I've made some further modifications to Cameron's code to fit our typical workflow.  I've buffered the text input, which allows us to iterate over a file-handle, rather than a string--this saves vast amounts of memory for processing large XML files, in particular.  And I return "node" objects, rather than strings, that contain several useful items of information: &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;the original text content of the node&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;the "type" of the node: text, StartTag,EndTag, or Markup[for DTD's, comments, etc.]&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;any attributes the node has&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;the absolute byte offset in the string or file&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;You don't need anything more than that to power PhiloLogic.  If you'd like to see an example of how to use it, take a look at my &lt;a href="http://github.com/rwhaling/libphilo/blob/master/python/philologic/DirtyParser.py"&gt;DirtyParser&lt;/a&gt; class, which takes as input a set of xpaths to recognize for containers and metadata, and outputs a set of objects suitable for the index builder I wrote about last time.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Oh, and about performance: shlax is noticeably slower than Mark's perl loader.  I've tried to mitigate for that in a variety of ways, but in general, python's regex engine is not as fast as perl's.  On the other hand, I've recently had a lot of success with running a load in parallel on an 8-core machine, which I'll write about when the code settles.  That said, if efficiency is a concern, our best option would be to use well-formed XML with a standard parser.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;So, my major development push now is to refactor the loader into a framework that can handle multiple parser backends, flexible metadata recognizers, and multiple simultaneous parser processes.  I'll be posting about that as soon as it's ready.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-2067878686084267419?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/08/shlax-shallow-lazy-xml-parser-in-python.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2067878686084267419'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2067878686084267419'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/08/shlax-shallow-lazy-xml-parser-in-python.html' title='shlax: a shallow, lazy XML parser in python'/><author><name>Richard</name><uri>http://www.blogger.com/profile/06345844875619851744</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-7362140742578767310</id><published>2010-05-19T10:47:00.008-05:00</published><updated>2010-05-19T12:34:56.167-05:00</updated><title type='text'>A Unified Index Construction Library</title><content type='html'>I've spent the last two weeks replacing PhiloLogic's index-construction routines, following my prior work on the query and database interfaces.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The legacy index-packing code dates back to sometime before PhiloLogic 2, and is spread over 3 executable programs linked together by a Makefile and some obscure binary state files.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Unfortunately, the 3 programs all link to different versions of the same compression library, so they couldn't simply be refactored and recompiled as a single unit.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Instead, I worked backwards from the decompression routines I wrote last month, to write a new index construction library from scratch.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Thus, I had the luxury of being able to define an abstract, high-level interface that meets my four major goals:&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1)simple, efficient operation&lt;/div&gt;&lt;div&gt;2)flexible enough for various index formats&lt;/div&gt;&lt;div&gt;3)easy to bind to other languages.&lt;/div&gt;&lt;div&gt;4)fully compatible with 3-series PhiloLogic&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The main loop is below.  It's pretty clean.  All the details are handled by a hit-buffer object named "hb" that does compression, memory management, and database interfacing.&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new',serif;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;div&gt;&lt;pre&gt;while(1) {&lt;br /&gt; // as long as we read lines from standard input.&lt;br /&gt; if (fgets(line,511,stdin) == NULL) {&lt;br /&gt;   hitbuffer_finish(hb);&lt;br /&gt;   break;&lt;br /&gt; }&lt;br /&gt; // scan for hits in standard Philo3 format.&lt;br /&gt; state = sscanf(line,&lt;br /&gt;           "%s %d %d %d %d %d %d %d %d %s\n",&lt;br /&gt;           word, &amp;amp;hit[0],...);&lt;br /&gt;&lt;br /&gt; if (state == 10) {&lt;br /&gt;   // if we read a valid hit&lt;br /&gt;   if ((strcmp(word,hb-&gt;word))) {&lt;br /&gt;     //if we have a new word...&lt;br /&gt;     hitbuffer_finish(hb); // write out the current buffer.&lt;br /&gt;     hitbuffer_init(hb, word); // and reinitialize&lt;br /&gt;     uniq_words += 1LLU; //LLU for a 64-bit unsigned int.&lt;br /&gt;   }&lt;br /&gt;   hitbuffer_inc(hb, hit); //add the hit to whichever word you're on.&lt;br /&gt;   totalhits += 1LLU;&lt;br /&gt; }&lt;br /&gt; else {&lt;br /&gt;   fprintf(stderr, "Couldn't understand hit.\n");&lt;br /&gt; }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style=";font-family:Georgia,serif;font-size:16;"  &gt;The code is publicly available on github, but I'm having some problems with their web interface.  I'll post a link once it's sorted out.&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-7362140742578767310?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/05/unified-index-construction-library.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7362140742578767310'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7362140742578767310'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/05/unified-index-construction-library.html' title='A Unified Index Construction Library'/><author><name>Richard</name><uri>http://www.blogger.com/profile/06345844875619851744</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-2148160125472244239</id><published>2010-05-06T11:06:00.002-05:00</published><updated>2010-05-06T11:24:32.228-05:00</updated><title type='text'>Vector Processing for OHCO</title><content type='html'>I've posted an expanded version of my &lt;a href="http://docs.google.com/present/view?id=dhdrzp66_52dsj2zmgg"&gt;CI Days talk&lt;/a&gt; on Google docs.  I'd recommend looking at the speaker notes (click "actions" on the bottom left) since I won't be narrating it in person.&lt;br /&gt;&lt;br /&gt;The presentation is an attempt to describe, somewhat formally, how PhiloLogic is capable of performing as well as it does.  This comes from spending three years learning how Leonid's search core works, and attempting to extend and elucidate whatever I can.  It's also the intellectual framework that I'm using to plan new features, like search on line and meter position, metadata, joins, etc.  Hopefully, I can get someone who's better at math than I am to help me tighten up the formalities.&lt;br /&gt;&lt;br /&gt;Basically, I refer to the infamous OHCO thesis as a useful axiom for translating the features of a text into a set of numerical objects, and then compare the characteristics of this representation to XML or Relational approaches.  I'd love to know how interesting/useful/comprehensible others find the presentation, or the concept.  What needs more explanation?  What gets tedious?&lt;br /&gt;&lt;br /&gt;If you look at the speaker notes, you can see me derive a claim that PhiloLogic runs 866 times faster than a relational database for word search.  Math is fun!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-2148160125472244239?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/05/vector-processing-for-ohco.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2148160125472244239'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2148160125472244239'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/05/vector-processing-for-ohco.html' title='Vector Processing for OHCO'/><author><name>Richard</name><uri>http://www.blogger.com/profile/06345844875619851744</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-5908921046720478854</id><published>2010-04-13T15:27:00.007-05:00</published><updated>2010-04-13T16:18:44.503-05:00</updated><title type='text'>PhiloLogic proto-binding for Python</title><content type='html'>&lt;p&gt;In an earlier post, I mentioned that I'd try to to call the philologic C routines via ctypes, a Python Foreign Function Interface library.  I did, and it worked awesomely well!  Ctypes lets you call C functions from python without writing any glue at all in some cases, giving you access to high-performance C routines in a clean, modern programming language. We'd ultimately want a much more hand-crafted approach, but for prototyping interfaces, this is a very, very useful tool.&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;First, I had to compile the search engine as a shared library, rather than an executable:&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;gcc -dynamiclib -std=gnu99 search.o word.o retreive.o level.o gmap.o blockmap.o log.o out.o plugin/libindex.a db/db.o db/bitsvector.o db/unpack.o  -lgdbm -o libphilo.dylib&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;All that refactoring certainly paid off.  The search4 executable will now happily link against the shared library with no modification,  and so can any other program that wants high-speed text object search:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;#!/usr/bin/python&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;import sys,os&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;from ctypes import *&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# First, we need to get the C standard library loaded in&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# so that we can pass python's input on to the search engine.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;stdlib=cdll.LoadLibrary("libc.dylib")&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;stdin = stdlib.fdopen(sys.stdin.fileno(),"r")&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# Honestly, that's an architectural error.  &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# I'd prefer to pass in strings, not a file handle&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# Now load in philologic from a shared library&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;libphilo = cdll.LoadLibrary("./libphilo.dylib")&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# Give it a path to the database.  The C routines parse the db definitions.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;db = libphilo.init_dbh_folder("/var/lib/philologic/databases/mvotest5/")&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# now initialize a new search object, with some reasonable defaults.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;s = libphilo.new_search(db,"phrase",None,1,100000,0,None)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# Read words from standard input.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;libphilo.process_input(s,stdin)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# Then dump the results to standard output.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;libphilo.search_pass(s,0)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;# Done.&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;That was pretty easy, right?  Notice that there weren't any boilerplate classes.  I could hold pointers to arbitrary data in regular variables, and pass them directly into the C subroutines as void pointers.  Not safe, but very, very convenient.&lt;br /&gt;&lt;br /&gt;Of course, this opens us up for quite a bit more work: the C library really needs a lot more ways to get data in and out than a pair of input/output file descriptors, I would say.  In all likelihood, after some more experiments, we'll eventually settle on a set of standard interfaces, and generate lower-level bindings with SWIG, which would alow us to call philo natively from Perl or PHP or Ruby or Java or LISP or Lua or...anything, really.&lt;/p&gt;&lt;p&gt;Ctypes still has some advantages over automatically-generated wrappers, however.  In particular, it lets you pass python functions back into C, allowing us to write search operators in python, rather than C--for example, a metadata join, or a custom optimizer for part-of-speech searching.  Neat!&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-5908921046720478854?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/04/philologic-proto-binding-for-python.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5908921046720478854'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5908921046720478854'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/04/philologic-proto-binding-for-python.html' title='PhiloLogic proto-binding for Python'/><author><name>Richard</name><uri>http://www.blogger.com/profile/06345844875619851744</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-8577995859616695028</id><published>2010-04-08T15:14:00.004-05:00</published><updated>2010-04-08T15:41:41.864-05:00</updated><title type='text'>Unix Daemon Recipes</title><content type='html'>I was digging through some &lt;a href="http://www.faqs.org/faqs/unix-faq/programmer/faq/"&gt;older UNIX folkways&lt;/a&gt; when I stumbled upon an answer to a long-standing PhiloLogic design question:&lt;br /&gt;&lt;br /&gt;How do I create a long-running worker process that will neither:&lt;br /&gt;&lt;br /&gt;1) terminate when it's parent terminates, such as a terminal session or a CGI script, or&lt;br /&gt;2) create the dreaded "zombie" processes that clog process tables and eventually crash the system.&lt;br /&gt;&lt;br /&gt;as it turns out, this is the same basic problem as any UNIX daemon program; this just happens to be one designed to, eventually, terminate.  PhiloLogic needs processes of this nature at various places: most prominently, to allow the CGI interface to return preliminary results.&lt;br /&gt;&lt;br /&gt;Currently, we use a lightweight Perl daemon process, called nserver.pl, to accept search requests from the CGI scripts, invoke the search engine, and then clean up the process after it terminates.  Effective, but there's a simpler way, with a tricky UNIX idiom.&lt;br /&gt;&lt;br /&gt;First, fork().  This allows you to return control to the terminal or CGI script.  If you aren't going to exit immediately you should SIGCHLD as well, so that you don't get interrupted later.&lt;br /&gt;&lt;br /&gt;Second, have the child process call setsid() to gain a new session, and thus detach from the parent.  This prevents terminal hangups from killing the child process.&lt;br /&gt;&lt;br /&gt;Third, call fork() again, then immediately exit the (original) child.  The new "grandchild" process is now an "orphan", and detached from a terminal, so it will run to completion, and then be reaped by the system, so you can do whatever long-term analytics you like.&lt;br /&gt;&lt;br /&gt;A command line example could go like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;#!/usr/bin/perl&lt;br /&gt;use POSIX qw(setsid);&lt;br /&gt;&lt;br /&gt;my $word = $ARGV[0] or die "Usage:searchwork.pl word outfile\n";&lt;br /&gt;my $outfile = $ARGV[1] or die "Usage:searchwork.pl word outfile\n";&lt;br /&gt;&lt;br /&gt;print STDERR "starting worker process.\n";&lt;br /&gt;&amp;daemonize;&lt;br /&gt;&lt;br /&gt;open(SEARCH, "| search4 --ascii --limit 1000000 /var/lib/philologic/somedb);&lt;br /&gt;&lt;br /&gt;print SEARCH "$word\n";&lt;br /&gt;close(SEARCH);&lt;br /&gt;&lt;br /&gt;exit;&lt;br /&gt;&lt;br /&gt;sub daemonize {&lt;br /&gt;    open STDIN, '/dev/null'   or die "Can't read /dev/null: $!";&lt;br /&gt;    open STDOUT, '&gt;&gt;/dev/null' or die "Can't write to /dev/null: $!";&lt;br /&gt;    open STDERR, '&gt;&gt;/dev/null' or die "Can't write to /dev/null: $!";&lt;br /&gt;    defined(my $childpid = fork)   or die "Can't fork: $!";&lt;br /&gt;    if ($childpid) {&lt;br /&gt;        print STDERR "[parent process exiting]\n";&lt;br /&gt;        exit;&lt;br /&gt;    }&lt;br /&gt;    setsid                    or die "Can't start a new session: $!";&lt;br /&gt;    print STDERR "Child detached from terminal\n";&lt;br /&gt;    defined(my $grandchildpid = fork) or die "Can't fork: $!";&lt;br /&gt;    if ($grandchildpid) {&lt;br /&gt;        print STDERR "[child process exiting]\n";&lt;br /&gt;        exit;&lt;br /&gt;    }&lt;br /&gt;    umask 0;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The benefit is that a similar &amp;daemonize subroutine could entirely replace nserver, and thus vastly simplify the installation process.  There's clearly a lot more that could be done with routing and control, of course, but this is an exciting proof of concept, particularly for UNIX geeks like myself.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-8577995859616695028?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/04/unix-daemon-recipes.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/8577995859616695028'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/8577995859616695028'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/04/unix-daemon-recipes.html' title='Unix Daemon Recipes'/><author><name>Richard</name><uri>http://www.blogger.com/profile/06345844875619851744</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-2269005523121717741</id><published>2010-03-31T11:23:00.007-05:00</published><updated>2010-03-31T12:27:21.753-05:00</updated><title type='text'>The Joy of Refactoring Legacy Code</title><content type='html'>I've spent the last few weeks rehabbing PhiloLogic's low-level search engine, and I thought I'd write up the process a bit.&lt;br /&gt;&lt;br /&gt;PhiloLogic is commonly known as being a rather large Perl/CGI project, but all of the actual database interactions are done by our custom search engine, which is in highly optimized C. The flow of control in a typical Philo install looks something like this:&lt;br /&gt;&lt;br /&gt;--CGI script &lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;search3t&lt;/span&gt; accepts user requests, and parses them.&lt;br /&gt;--CGI passes requests off to a long-running Perl daemon process, called &lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;nserver&lt;/span&gt;.&lt;br /&gt;--nserver spawns a long-running worker process &lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;search3&lt;/span&gt; to evaluate the request&lt;br /&gt;--the worker process loads in a compiled decompression module, at runtime, specific to the database.&lt;br /&gt;--&lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;search3t&lt;/span&gt; watches the results of the worker process&lt;br /&gt;--when the worker is finished, or outputs more than 50 results, &lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;search3t&lt;/span&gt; passes them off to a report generator.&lt;br /&gt;&lt;br /&gt;This architecture is extremely efficient, but as PhiloLogic has accrued features over the years it has started to grow less flexible, and parts of the code base have started to decay.  The command line arguments to &lt;span class="Apple-style-span" style="font-family: 'courier new';"&gt;search3&lt;/span&gt;, in particular, are arcane and undocumented.  A typical example:&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;export SYSTEM_DIR=/path/to/db&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;export LD_LIBRARY_PATH=/path/to/db/specific/decompression/lib/&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;search3 -P:binary -E:L=1000000 -S:phrase  -E:L=1000000  -C:1 /tmp/corpus.10866 &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The internals are quite a bit scarier.  Arguments are processed haphazardly in bizarre corners of the code, and many paths and filenames are hard-coded in.  And terrifying unsafe type-casts abound.  Casting a structure containing an array of ints into an array of ints?  Oh my.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;I've long been advocating a much, much simpler interface to the search engine.  The holy grail would be a single-point-of-entry that could be installed as a C library, and called from any scripting language with appropriate interfacing code.  There are several obstacles, particularly with respect to caching and memory management, but the main one is organizational.&lt;br /&gt;&lt;br /&gt;How do you take a 15-year-old C executable, in some state of disrepair, and reconfigure the "good parts" into a modern C library?  Slowly and carefully.  Modern debugging tools like &lt;a href="http://valgrind.org/"&gt;Valgrind&lt;/a&gt; help, as does the collective C wisdom preserved by Google.  A particular issue is imperative vs. object-oriented or functional style.  Older C programs tend to use a few global variables to represent whatever global data structure they work upon--in effect, what modern OOP practices would call a &lt;a href="http://en.wikipedia.org/wiki/Singleton_pattern"&gt;"singleton" object&lt;/a&gt;, but in practice a real headache.&lt;br /&gt;&lt;br /&gt;For example, PhiloLogic typically chooses to represent the database being searched as a global variable, often set in the OS's environment.  But what if you want to search two databases at once?  What if you don't have a UNIX system?  An object-oriented representation of the large-scale constructs of a program allows the code to go above and beyond its original purpose.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Or maybe I'm just a neat freak--regardless, the [simplified] top-level architecture of 'search3.999' {an asymptotic approach to an as-yet unannounced product} should show the point of it all:&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;{&lt;br /&gt;&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    static struct option long_options[] = {&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;        &lt;/span&gt;{"ascii", no_argument, 0, 'a'},&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;{"corpussize", required_argument, 0, 'c'},&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;{"corpusfile", required_argument, 0, 'f'},&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;{"debug", required_argument, 0, 'd'},&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;{"limit", required_argument, 0, 'l'},&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;{0,0,0,0}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;    &lt;/span&gt;};&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;//&lt;/span&gt;...process options with GNU getopt_long...&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style=" white-space: pre;font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;    &lt;/span&gt;db = init_dbh_folder(dbname);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;    &lt;/span&gt;if (!method_set) {&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;        &lt;/span&gt;strncpy(method,"phrase",256);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    }&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;    &lt;/span&gt;s = new_search(db,&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;                   method, &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;                   ascii_set,&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;                   corpussize,&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;                   corpusfile,&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style=" ;font-family:'courier new';"&gt;                   debug,&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;                   limit);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;    &lt;/span&gt;status = process_input ( s, stdin );&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style=" white-space: pre;font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre;"&gt;//&lt;/span&gt;...print output...&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;//...free memory...&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style=" ;font-family:'courier new';"&gt;&lt;span class="Apple-style-span" style="white-space: pre; "&gt;    &lt;/span&gt;return 0&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;An equivalent command-line call would be:&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;search3999 --ascii --limit 1000000 --corpussize 1 --corpusfile /tmp/corpus.10866 dbname search_method&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;which is definitely an improvement.  It can also print a help message.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Beyond organizational issues, I also ended up rewriting large portions of the decompression routines.  The database can now fully configure itself at runtime, which adds about 4 ms to each request, but with the benefit that database builds no longer require compilation.  TODO: The overhead can be eliminated if we store that database parameters as integers, rather than as formatted text files.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think at this point the codebase is clean enough to try hooking up to python, via &lt;a href="http://docs.python.org/library/ctypes.html"&gt;ctypes&lt;/a&gt;, and then experiment with other scripting language bindings.  Once I clean up the makefiles I'll put it up on our repository.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-2269005523121717741?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/03/joy-of-refactoring-legacy-code.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2269005523121717741'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2269005523121717741'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/03/joy-of-refactoring-legacy-code.html' title='The Joy of Refactoring Legacy Code'/><author><name>Richard</name><uri>http://www.blogger.com/profile/06345844875619851744</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-557705615386583631</id><published>2010-03-29T23:43:00.013-05:00</published><updated>2010-04-01T06:00:47.203-05:00</updated><title type='text'>Reclassifying the Encyclopédie</title><content type='html'>Diderot and D'Alembert's &lt;em&gt;Encyclopédie&lt;/em&gt; might almost have been designed as a document classification exercise. For starters, it comes complete with a branching, hierarchical ontology of classes of knowledge. Out of the 77,000+ articles contained therein, 60,000 are classified according to this system, while 17,000 were left unclassified, providing a ready-made training set and evaluation set, respectively. To make it challenging, within the classified articles, the editors have chosen to apply some classifications with obfuscatory intent or, at least, result, rendering topic boundaries somewhat fuzzy. The categories span the entire breadth of human knowledge, and the articles range from brief renvois ("see XXX") to protracted philosophical treatises. In short, it has everything to make a machine learner happy, and miserable, in one package.&lt;div&gt;&lt;br /&gt;&lt;br /&gt;At ARTFL we've been mining this rich vein for some time now. We presented &lt;a href="http://docs.google.com/present/view?id=dfddkspw_65dcrfj2"&gt;Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie&lt;/a&gt; at &lt;a href="http://digitalhumanities.org/dh2007/"&gt;Digital Humanities 2007&lt;/a&gt;, detailing our initial attempts at classification and the critical interpretation of machine learning results. We followed up at &lt;a href="http://www.ekl.oulu.fi/dh2008/"&gt;DH 2008&lt;/a&gt; with &lt;a href="http://docs.google.com/present/view?skipauth=true&amp;amp;id=dfddkspw_205fk8299hg"&gt;Twisted Roads and Hidden Paths&lt;/a&gt;, in which we expanded our toolkit to include k-nearest-neighbor vector space classifications, and a meta-classifying decision tree. Where we had previously achieved around 72% accuracy in categorizing articles medium-length and long articles using Naive Bayes alone, using multiple classifiers combined in this way we were able to get similar rates of accuracy over the entire encyclopedia, including the very short articles, which are quite difficult to classify due to their dearth of distinctive content. This post describes an effort to productionize the results of that latter paper, in order to insert our new, machine-generated classifications into our public edition of the &lt;em&gt;Encyclopédie&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;For the impatient, jump ahead to the &lt;a href="http://encyclopedie.uchicago.edu/node/175"&gt;ARTFL Encyclopédie search form&lt;/a&gt; and start digging. The new machine generated classifications can be searched just as any other item of Philologic metadata, allowing very sophisticated queries to be constructed.&lt;br /&gt;&lt;br /&gt;For instance, we can ask questions like "Are there any articles originally classified under Géographie that are reclassified as Philsophie?" In fact there are &lt;a href="http://artflx.uchicago.edu/cgi-bin/philologic/search3t?dbname=encyclopedie0310reclass&amp;amp;word=&amp;amp;dgdivhead=&amp;amp;dgdivocauthor=&amp;amp;ExcludeDiderot3=on&amp;amp;dgdivocplacename=&amp;amp;dgdivocsalutation=G%C3%A9ographie&amp;amp;dgdivocclassification=&amp;amp;dgdivocdateline=philosophie&amp;amp;dgdivocpartofspeech=&amp;amp;dgdivtype=&amp;amp;CONJUNCT=PHRASE&amp;amp;DISTANCE=3&amp;amp;PROXY=or+fewer&amp;amp;OUTPUT=conc&amp;amp;POLESPAN=5&amp;amp;KWSS=1&amp;amp;KWSSPRLIM=500"&gt;several&lt;/a&gt;, and it's interesting to peruse them and deduce why their original classifications and generated classifications fall as they do. The editors followed a policy of not including biographies in the &lt;em&gt;Encyclopédie&lt;/em&gt;, but evidently could not restrain themselves in many cases. Instead of creating a biography class, however, they categorized such entries under the headword corresponding to the region of the notable person's birth, and assigned it the class &lt;em&gt;Géographie&lt;/em&gt;. Thus the article JOPOLI contains a  discussion of the philosopher Augustin Nyphus, born there in 1472, and hence is classified by our machine learner under Philosophie.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Our goals in re-classifying the &lt;i&gt;Encyclopédie&lt;/i&gt; are several: to provide better access for our users by adding class descriptions to previously unclassified articles; to identify articles that are re-classified differently from their original classes, allowing users to find them by their generated classes which are often more indicative of the overall content of an article; and to identify interesting patterns in the authors' uses of their classification system, again primarily by seeing what classes tend to be re-classified differently.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;We initially undertook to examine a wide range of classifiers including Naive Bayesian, SVM and KNN vector space, with a range of parameters for word count normalization and other settings. After examining hundreds of such runs, we found two that, combined, provided the greatest accuracy in correctly re-classifying articles to their previous classifications: Naive Bayes, using simple word counts, and KNN, using 50 neighbors and tf-idf values for the feature vectors.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Each classifier alone was right about 64% of the time -- but together, at least one of them was right 77% of the time. If we could only decide which one to trust when they differed on a given classification decision, we could reap a substantial gain in accuracy on the previously classified articles, and presumably get more useful classifications of the unclassified articles. We must note that the class labels for each article, which appear at the beginning of the text, we retained for these runs, giving our classifiers an unfair advantage in re-classifying the articles that had such labels present. The class labels get no more weight, however, than any other words in the article. We retained them because our primary objective is to accurately classify the unclassified articles, which do not contain these labels, but may well contain words from these labels in other contexts.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;It turned out that KNN was most accurate on smaller articles and smaller classes, whereas Naive Bayes worked best on longer articles that belonged to bigger classes, which gave us something to go on when deciding which classifier got to make the call when they were at odds with each other. By feeding the article and class meta-data into a simple decision tree classifier, along with the results of each classifier, we were able to learn some rules for deciding which classifier to prefer for a given decision where they disagreed on the class assignment. See the decision tree in the &lt;a href="http://www.ekl.oulu.fi/dh2008/"&gt;DH 2008&lt;/a&gt; with &lt;a href="http://docs.google.com/present/view?skipauth=true&amp;amp;id=dfddkspw_205fk8299hg"&gt;DH 2008 paper&lt;/a&gt; for the details.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Of course, we couldn't make the perfect decision every time, but we were close enough to increase our accuracy on previously classified articles to 73%, 9% higher than the average of the individual classifiers. By using a meta-classifier to learn the relative strengths and weaknesses of the sub-classifiers, we were able to better exploit them to get more interesting data for our users, and peel back another layer of the great Encyclopédie. Additionally, we learned characteristics of the classifiers themselves that will enable us to target their applications more precisely in the future.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;P.S.: For all you Diderot-philes, here are the stats on the original and machine-learned classes of the articles he authored:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://spreadsheets.google.com/pub?key=tk7J_NqaNUxIPcNychj3ugw&amp;amp;output=html"&gt;Diderot articles, original classifications&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://spreadsheets.google.com/pub?key=tg8oKSs_jwx2vPKtluVpWcg&amp;amp;output=html"&gt;Diderot articles, generated classifications&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;If any of this piques your interest, please do get in touch with us (artfl dawt project at gee-male dawt com should work). We'd love to discuss our work and possible future directions. Or come check us out in Virginia next week!&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-557705615386583631?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/03/reclassifying-encyclopedie.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/557705615386583631'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/557705615386583631'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/03/reclassifying-encyclopedie.html' title='Reclassifying the Encyclopédie'/><author><name>Russ</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-3208269732125593414</id><published>2010-03-08T17:16:00.004-06:00</published><updated>2010-03-31T17:53:41.252-05:00</updated><title type='text'>using the JSON perl mod</title><content type='html'>I just thought I'd make a quick blog post on how to use the JSON perl mod. Why use JSON when we have XML, I'll leave that to Russ or Richard, but to make a long story short, easier object handling for the projected javascript driven DVLF.&lt;br /&gt;So, the perl JSON module is actually very easy and nice to use, it will convert your perl data structure into JSON without a sweat.&lt;br /&gt;Here's a quick example which I hope will be useful :&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;#!/usr/bin/perl&lt;/blockquote&gt;&lt;blockquote&gt;use strict;&lt;/blockquote&gt;&lt;blockquote&gt;use warnings;&lt;/blockquote&gt;&lt;blockquote&gt;use JSON;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;blockquote&gt;my %hash;&lt;/blockquote&gt;&lt;blockquote&gt;foreach my $file (@list_of_files) &amp;nbsp;{&lt;/blockquote&gt;&lt;blockquote&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;open(FILE,"$file");&lt;/blockquote&gt;&lt;blockquote&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;my @list;&lt;/blockquote&gt;&lt;blockquote&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;while (&amp;lt;FILE&amp;gt;&lt;file&gt;) {&lt;/file&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;push(@array,$_);&lt;/blockquote&gt;&lt;blockquote&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/blockquote&gt;&lt;blockquote&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;%hash{$file} =&amp;nbsp;[@array]; &amp;nbsp; &amp;nbsp;# store array reference in hash&lt;/blockquote&gt;&lt;blockquote&gt;}&lt;/blockquote&gt;&lt;blockquote&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;blockquote&gt;my $obj = \%results;&amp;nbsp;&lt;/blockquote&gt;&lt;blockquote&gt;my $json = new JSON;&lt;/blockquote&gt;&lt;blockquote&gt;my $js = $json-&amp;amp;gt;encode($obj, {pretty =&amp;amp;gt; 1, indent =&amp;amp;gt; 2});&amp;nbsp;# convert Perl data structure to JSON representation&lt;/blockquote&gt;&lt;blockquote&gt;$output .= "$js\n\n";&lt;/blockquote&gt;&lt;blockquote&gt;print $output;&lt;/blockquote&gt;&lt;br /&gt;And done!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-3208269732125593414?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/03/using-json-perl-mod.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3208269732125593414'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3208269732125593414'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/03/using-json-perl-mod.html' title='using the JSON perl mod'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-700445055414277335</id><published>2010-02-01T09:07:00.007-06:00</published><updated>2010-02-03T16:45:34.107-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='development'/><category scheme='http://www.blogger.com/atom/ns#' term='philologic'/><category scheme='http://www.blogger.com/atom/ns#' term='perseus'/><title type='text'>Lemma Collocations on Perseus</title><content type='html'>&lt;span style="color: rgb(51, 51, 255);"&gt;See UPDATE at the end.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This post actually relates nicely to Mark's recent post. I have recently been working on lemmatized collocation tables for the Greek texts on Perseus. If you just want to see the results, skim to the end, the rest describes how I got there.&lt;br /&gt;&lt;br /&gt;It is not so difficult to look up the lemma for each word surrounding a certain search hit, as for these texts the structure to do so is already in place and the information is stored in SQL tables. Efficiency in gathering this information is the main difficulty. Looking up the lemma in the tables now in place can take a couple different SQL queries, which each take up a small chunk of time. For a few lookups, or even a few hundred, this is not too big of a problem. However, for a collocation table spanning five words on either side, we need at least 10 lookups in the databases per hit. The time it takes to do that adds up quite quickly.&lt;br /&gt;&lt;br /&gt;So, following a suggestion from Mark, I wrote a script and generated a file with a line for every word in the Perseus database. Basically, it takes each word id starting with 1 on up to 5.5 million something and looks up its lemma. This generated a 5.5 million line file with lines likes this:&lt;br /&gt;&lt;br /&gt;2334550 δέ&lt;br /&gt;2334551 ὅς&lt;br /&gt;2334552 ἕκαστος&lt;br /&gt;2334553 ἵππος&lt;br /&gt;2334554 nolemma&lt;br /&gt;2334555 ὅς&lt;br /&gt;2334556 δέ&lt;br /&gt;2334557 πέτομαι&lt;br /&gt;2334558 κονίω&lt;br /&gt;2334559 πεδίον&lt;br /&gt;&lt;br /&gt;Now, looking up the words on either side of a hit is much simpler - all you need to know is the "word id" of the hit and you can look at those around it. The "nolemma" entries are primarily punctuation and such which were given word tags at some point.&lt;br /&gt;&lt;br /&gt;The size of this massive file however was another hindrance to efficient generation of collocation tables. Although we now know exactly what line we need to look at for the information we need, getting that line is still costly as it could require reading in a couple million lines of a file to get the one we need. After playing around with it a bit, command line grep searching seemed to be the fastest way to go, but even an grep search 10 times per hit adds up fast. So, I tried combining the searches into one massive egrep command to be read into a perl array. My searches looked something like:&lt;br /&gt;&lt;br /&gt;egrep "23345[456][0-9]" lemmafile&lt;br /&gt;&lt;br /&gt;Giving a window of 30 lines in the file starting at 2334540 and ending at 2334569. This limited the searches to one per hit instead of 10, but it still wasn't fast enough. So, I combine all of the searches like so:&lt;br /&gt;&lt;br /&gt;egrep "(2342[234]|33329[678]|...|829[567])[0-9]"&lt;br /&gt;&lt;br /&gt;(A bit of accounting for numbers ending in 0 is needed so that a window around 400 doesn't include things in the 490's, but this is not too difficult.)&lt;br /&gt;&lt;br /&gt;This looked nice and seemed to work until I tried running it on more hits. It was then coming up with such massive regular expressions to grep for that grep was complaining that they were too big. So, I broke them up into chunks of roughly 350 at a time. Fewer, and the time would go up due to the added number of grep searches; too many more, and grep would overflow again. I may not have hit on the exact time minimizing value, but it is close at least.&lt;br /&gt;&lt;br /&gt;Finally, here are some example searches:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://grade-devel.uchicago.edu/cgi-bin/philologic/search3torth?dbname=PerseusGreekDev&amp;amp;word=logoi&amp;amp;ORTHMODE=LEM&amp;amp;CONJUNCT=PHRASE&amp;amp;DISTANCE=3&amp;amp;title=&amp;amp;author=&amp;amp;date=&amp;amp;DFPERIOD=1&amp;amp;OUTPUT=PF&amp;amp;POLESPAN=5&amp;amp;SEARCHBY=lemma&amp;amp;THMPRTLIMIT=1&amp;amp;KWSS=1&amp;amp;KWSSPRLIM=500&amp;amp;trsortorder=author%2C+title&amp;amp;genre=&amp;amp;publisher=&amp;amp;pubplace=&amp;amp;editor=&amp;amp;pubdate=&amp;amp;language=&amp;amp;shrtcite=&amp;amp;filename=&amp;amp;filesize=&amp;amp;sortorder=author%2C+title&amp;amp;dgdivhead=&amp;amp;dgdivtype=&amp;amp;dgdivlang=&amp;amp;dgdivn=&amp;amp;dgdivid=&amp;amp;dgdivocauthor=&amp;amp;dgdivocdateline=&amp;amp;dgdivocsalutation=&amp;amp;dgsubdivtag=&amp;amp;dgsubdivtype=&amp;amp;dgsubdivn=&amp;amp;dgsubdivid=&amp;amp;dgsubdivwho="&gt;Search for logoi.&lt;/a&gt;&lt;br /&gt;&lt;a href="http://grade-devel.uchicago.edu/cgi-bin/philologic/search3torth?dbname=PerseusGreekDev&amp;amp;word=lemma:logos&amp;amp;ORTHMODE=LEM&amp;amp;CONJUNCT=PHRASE&amp;amp;DISTANCE=3&amp;amp;title=&amp;amp;author=&amp;amp;date=&amp;amp;DFPERIOD=1&amp;amp;OUTPUT=PF&amp;amp;POLESPAN=5&amp;amp;SEARCHBY=lemma&amp;amp;THMPRTLIMIT=1&amp;amp;KWSS=1&amp;amp;KWSSPRLIM=500&amp;amp;trsortorder=author%2C+title&amp;amp;genre=&amp;amp;publisher=&amp;amp;pubplace=&amp;amp;editor=&amp;amp;pubdate=&amp;amp;language=&amp;amp;shrtcite=&amp;amp;filename=&amp;amp;filesize=&amp;amp;sortorder=author%2C+title&amp;amp;dgdivhead=&amp;amp;dgdivtype=&amp;amp;dgdivlang=&amp;amp;dgdivn=&amp;amp;dgdivid=&amp;amp;dgdivocauthor=&amp;amp;dgdivocdateline=&amp;amp;dgdivocsalutation=&amp;amp;dgsubdivtag=&amp;amp;dgsubdivtype=&amp;amp;dgsubdivn=&amp;amp;dgsubdivid=&amp;amp;dgsubdivwho="&gt;Search for lemma:logos.&lt;/a&gt; (The time estimate is higher than actual load time).&lt;br /&gt;&lt;br /&gt;Or, here is the search form:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://grade-devel.uchicago.edu/philologic/PerseusGreekDev.whizbang.form.html#"&gt;Search Form.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Make sure that you choose Collocation Table and check the lemma button under the "Refined Search Results" Tab at the bottom.&lt;br /&gt;&lt;br /&gt;It can handle most searches, except for very high frequency words. If anyone has ideas on how to make it faster, it could perhaps enable us to get results for all searches. Though perhaps this is not possible without somehow creating and saving those results somewhere.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;UPDATE:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;After talking to Mark, I altered the way the data is read from the file and now things should be running faster. The reason that all this discussion and speed streamlining for lemmatized collocation tables is necessary is the fact that the texts on Perseus do not have the lemmas embedded in the text. As Mark noted, many of the other databases would allow for much simpler and faster generation of the same data due to the fact that they do have lemmas in the text. However, for the purposes of Perseus, lemmas needed to be separated from the texts to allow them to be more dynamically updated, changed and maintained.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;As for the speed, it should now be faster thanks to a handy function in Perl. I had investigated methods for reading a certain line of a file, since I happened to know exactly what lines I needed. However, finding none that did not read the whole contents of the file up to that line, I instead implemented the process described above. I overlooked SEEK. I dismissed it because it starts from a certain byte offset and not a certain line. Nevertheless, we can harness its power by simply padding each line with spaces to ensure every line in our file is the same byte length. With this pointer from Mark and some padding on the lines, knowing the line number and the number of bytes per line is enough to start reading from the exact location in the file that we desire.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-700445055414277335?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/02/lemma-collocations-on-perseus.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/700445055414277335'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/700445055414277335'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/02/lemma-collocations-on-perseus.html' title='Lemma Collocations on Perseus'/><author><name>Kristin</name><uri>http://www.blogger.com/profile/16706344780694707122</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-69090438454455359</id><published>2010-01-20T11:14:00.009-06:00</published><updated>2011-07-29T11:05:33.032-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='monk'/><category scheme='http://www.blogger.com/atom/ns#' term='nlp'/><category scheme='http://www.blogger.com/atom/ns#' term='philologic'/><title type='text'>MONK Data Under PhiloLogic: 1</title><content type='html'>&lt;span style="font-weight: bold;"&gt;Introduction&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As I'm sure you all know, the &lt;span class="il"&gt;MONK&lt;/span&gt; Project (&lt;a href="http://monkproject.org/" target="_blank"&gt;http://monkproject.org/&lt;/a&gt;), directed by Martin Mueller and John Unsworth, has generated a large collection of tagged data some of which has been made public and some of which is limited to CIC or other institutions (&lt;a href="http://monkproject.org/downloads/" target="_blank"&gt;http://monkproject.org/&lt;wbr&gt;&lt;/wbr&gt;downloads/&lt;/a&gt;).    Each word in this group of different collections is tagged for part of speech, lemma, and  normalization.  Martin has documented the encoding scheme in great detail at &lt;a href="http://panini.northwestern.edu/mmueller/nupos.pdf" target="_blank"&gt;http://panini.northwestern.&lt;wbr&gt;&lt;/wbr&gt;edu/mmueller/nupos.pdf&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The following is a &lt;span style="font-weight: bold;"&gt;long&lt;/span&gt; post describing in some detail one approach to integrating this kind of information.  Some of this will be deeply geeky and you can feel free to skip over sections.  There is, towards the bottom of this post, a link to a standard PhiloLogic search form, so you can play with this proof-of-concept build yourself.&lt;br /&gt;&lt;br /&gt;Richard and Helma have developed a mechanism to search for part of speech and lemma searching under PhiloLogic for their Greek and Latin databases (&lt;a href="http://perseus.uchicago.edu/"&gt;link&lt;/a&gt;).  This is based on some truly inspired hacking by Richard and forms one model of how to handle this kind of functionality.  My understanding of this, and Richard please correct me if I am wrong, is that it uses an undocumented feature in the index/search3 subsystem that allows us to have multiple index entries for each word position in the main index.  This works and is certainly an approach to be considered as we think about a new series of PhiloLogic.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Build Notes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I have been experimenting with a somewhat different mechanism to handle this kind of problem, which is based on previous examples of mapping multiple word attributes to an index entry, using multiple field "crapser" entries.  You may recall that this is the mechanism by which we merged Martin's virtual normalization data to very large collections of early modern English data and is currently running at Northwestern (&lt;a href="http://philologic.northwestern.edu/philologic/"&gt;link&lt;/a&gt;).    My approach is to index not words, but pairs of surface forms and part of speech tags and to link these to an expanded (5 field) word database (called by crapser) containing the index form, surface form, lemma, part of speech and normalized forms.   Here are some index entry forms (and frequencies):&lt;br /&gt;&lt;pre&gt;24 conquer:vvb&lt;br /&gt;445 conquer:vvi&lt;br /&gt;143 conquered:vvd&lt;br /&gt;414 conquered:vvn&lt;/pre&gt;&lt;br /&gt;These map to the word vector database which looks like:&lt;br /&gt;&lt;pre&gt;idx               surf      pos        lem        normal&lt;br /&gt;conquered:j     conquered    j       conquer     conquered&lt;br /&gt;conquered:j-vvn conquered    j-vvn   conquer     conquered&lt;br /&gt;conquered:n-vvn conquered    n-vvn   conquer     conquered&lt;br /&gt;conquered:vvd   conquered    vvd     conquer     conquered&lt;br /&gt;conquered:vvn   conquered    vvn     conquer     conquered&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;To build this I first reduced for fully verbose form of the data in which each token is tagged:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 85%;"&gt;&amp;lt;w eos="0" lem="country" pos="n1" reg="COUNTRY" spe="COUNTRY"&amp;gt;&lt;/span&gt;&lt;tok="country" id="hooper-001490" ord="103" part="N"&gt;&lt;span style="font-size: 85%;"&gt;COUNTRY&amp;lt;/w&amp;gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;I eliminated all encoding that is redundant, just to make things easier to work with since the files are huge:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&amp;lt;w pos="n1"&amp;gt;COUNTRY&amp;lt;/w&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Where there is some additional information, I keep it in the encoded document:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&amp;lt;w lem="illustration" pos="n2"&amp;gt;ILLUSTRATIONS&amp;lt;/w&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I then loaded this data into a very slightly modified PhiloLogic textloader.  This simply builds an index representation of the surface form of the word and the part of speech, by getting the PoS from encoding:&lt;br /&gt;&lt;/tok="country"&gt;&lt;br /&gt;&lt;pre&gt;if ($thetag =~ /&amp;lt;w/) {&lt;br /&gt;$thepartofspeech = "";&lt;br /&gt;$thetag =~ m/pos="([^"]*)"/i;&lt;br /&gt;$thepartofspeech = $1;&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;and adding this to the index entry:&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;$theword = $theword . ":" . $thepartofspeech;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When loaded to this point, you have modified index entries.  The next step is simply to build a multi-field word vector database (crapser).  I did this by reading the input data and adding entries for lemmas or normalizations.   This is simply an extension of  what is already documented in the "virtual-normalize" directory in the "goodies" in the PhiloLogic release.&lt;br /&gt;&lt;br /&gt;The next step was to slightly modify a version of  Leonid's original "gimme".  The "sequential" version of this function (in the standard Philologic distribution), maps a multi-field (tab delimited) query using regexp patterns in egrep.   This is fast and simple.  It allows naming of fields, so you can simply specify "lem=justice" and it will generate a regular expression pattern (where TAB = the tab character):&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;^[^TAB]*TAB[^TAB]*TAB[^TAB]*TABjusticeTAB[^TAB]*$&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;And you get, of course, full regular expressions.  (Note, this renders with some odd spacing, there are no spaces).   Swap in this version of crapser and it all appears to run without further modification.&lt;br /&gt;&lt;br /&gt;So, to summarize, the implementation does not require any modifications to core system components.  It requires only slight modifications to a textloader, which we do all the time for specific databases, and a slightly modified "crapser" with a suitably build word vector database.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Database&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The database has 567 documents containing 38.5 million words (types) and 273,600 index entries.  Recall that these are surface form words and part of speech tags and not normal types.  The dataset has selections from various sources, including Documenting the American South as well as some British Early Modern texts.   It should have full PhiloLogic search and reporting capabilities.   You can query the words in the database as usual, simply by typing in words.  To force searches on lemmas, normalizations, and parts of speech by specifying (with examples):&lt;br /&gt;&lt;br /&gt;lem=conquer&lt;br /&gt;nrm=conquer&lt;br /&gt;pos=pns32&lt;br /&gt;&lt;br /&gt;and finally, if you want to get one surface form and part of speech you can search the index entry directly, such as "conquered:vvd".  Note that the Part of Speech is specified after a colon and you don't need to specify anything else.  This is obviously not a real query interface, but it suggests how we can think about interfaces further along (eg, pull down menus, etc).  You can also use regular expressions, such as lem=conque.*  Finally, you can combine these, such as "pos=po.* lem=enemy", which means find forms of enemy followed by possessive pronous within three words, such as :  "&lt;w pos="po32"&gt;&lt;span style="color: #cc3300;"&gt;&lt;b&gt;their&lt;/b&gt;&lt;/span&gt;&lt;/w&gt;  &lt;w pos="av-ds"&gt;most&lt;/w&gt;  &lt;w pos="vvi"&gt;mortall&lt;/w&gt;  &lt;w lem="enemy" pos="n2"&gt;&lt;span style="color: #cc3300;"&gt;&lt;b&gt;enemies&lt;/b&gt;&lt;/span&gt;&lt;/w&gt;".  You will need to consult Martin's discussion of the encoding to see all of the parts of speech.  It is an extensive and well reasoned scheme.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 130%;"&gt;After all of that, here is the &lt;a href="http://pantagruel.ci.uchicago.edu/html/philologic/monkmvo2.whizbang.form.html"&gt;search form&lt;/a&gt;.&amp;nbsp; (Reloaded 7/28/11)&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Now, before running of to play with this, there are some important notes following which describes how to use this in more detail.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Discussion&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is a proof-of-concept build.  In a full implementation, I would need to add some search syntax to allow the user to indicate a set of combined criteria for a single word.   I was having some problems coming up with a use case, but I guess one could want to say search for a particular lemma AND part of speech.   It would all work with a little massaging.  Aside from that, this simple model should support all of the standard PhiloLogic searching and reporting&lt;br /&gt;features.  Do let me know if you find something that does not work.&lt;br /&gt;&lt;br /&gt;This model supports disambiguating searches, such as to find dog when it is used as a verb.  Try "dog:vvi" for hits like &lt;w pos="pns12"&gt;"we&lt;/w&gt;  &lt;w pos="vmb"&gt;can&lt;/w&gt;  &lt;w pos="vvi"&gt;&lt;span style="color: #cc3300;"&gt;&lt;b&gt;dog&lt;/b&gt;&lt;/span&gt;&lt;/w&gt;  &lt;w lem="they" pos="pno32"&gt;them" (thanks Russ for this example).   It also appears to work properly form most other searches, such as lemmas, normalizations, etc.   Part of speech for single entries looks reasonable in terms of performance.&lt;br /&gt;&lt;br /&gt;&lt;/w&gt;My primary interest, however, in this experiment is to test performance on sequences of parts of speech searching.  For example: "pos=po3. pos=j pos=n1" will find sequences like: "their strange confusion" and "his Princely wisedome".   Chains of four also seem to work reasonably.  Eg: "pos=vvn pos=po3. pos=j pos=n1" returns phrases like "neglected their even elevation",  "stimulated their adventurous courage", and "aroused his little troop".   You can always find a part of speech after a particular word (lemma): "after pos=po3.".&lt;br /&gt;&lt;br /&gt;Now, this is all fine and dandy.  Except that doing conjoined searches on parts of speech reveals a significant conceptual difficulty, which I believe also applies to Richard's implementation.  Each part of speech generates thousands of surface form index entries.  For example:&lt;br /&gt;"pos=vvn pos=po3. pos=j pos=n1"&lt;br /&gt;generates 81,000 unique terms (index entries) in 4 vectors.  The evaluation then does a massive join at the level of index entry evaluation.  So, it is SLOW and subject to possible memory buffer overflow or other problems.  In fact, the system will begin to generate results of this type fairly quickly, due to PhiloLogic's lazy evaluation (start returning results as soon as you have a few).  But it can take several minutes to complete the task.  We would certainly not want to put this on a high traffic machine, since if you have many similar queries, it would bog it down.  Obviously, we could simply test to make sure that users search criteria would not drag the whole system down or simply lock this database to one user at a time, or some other work around.  If we got reasonable French NLP, this could be implemented quickly.&lt;br /&gt;&lt;br /&gt;However, I believe we have bumped up upon a conceptual problem.  To find POS1 and POS2 and POS3 either in a row or within N number of words requires an evaluation of word and/or part of speech positions in documents.&lt;br /&gt;&lt;br /&gt;There are a couple of possible solutions, all of which would require consideration of distinct indexing structures.  The first is simply to build another kind of NGRAM POS index which would have sequences of parts of speech indexed and mapped to document regions.  The second would be a another kind of index which would look like a standard PhiloLogic index entry, except that it would be ONLY part of speech. This would reduce the size of the word vectors, but would not in itself improve the index evaluation to find those sequences that fit the query in the actual documents, since we still have to return to word positions in documents.&lt;br /&gt;&lt;br /&gt;We might call this "The Conjoined Part of Speech Problem (CPSP)".  It is, in my opinion, a highly specialized type of search and it is not clear just what the use cases might look like in relatively simple languages (English, French) as opposed to Greek, for which Helma makes a convincing case.  So, there is a question of just how important this might be.  In email communication, Martin makes the case that it would be and that researchers who want this kind of query would be willing to wait a few minutes.&lt;br /&gt;&lt;br /&gt;It would be a trivial and useful experiment to run a load where I would index ONLY part of speech information.  This would be a good test to see if evaluation speed for conjoined part of speech searches would be reasonable.  In fact, Richard and I did a few quick experiments that suggest this would work.  The idea would be to distinguish between simple queries -- and run them as usual -- and multiple PoS queries, which would be run on a dedicated index build.  So, build parallel indicies.  Oddly enuff, in the current architecture, I suspect that one could simply have a switch to say WHICH database to consult dynamically, simply by evaluating the query and then setting the database argument.  That would be another one of my famous, award-winning, hall of shame hacks.  But it could be made to work.&lt;br /&gt;&lt;br /&gt;Martin has also pointed out another issue, which is searching, sorting, and counting of PoS, lemma, and other data.  Now, that makes a lot of sense.  I want to search for "country" and find distributions of particular parts of speech.  Or, I want to do a collocation table searching on a lemma and counting the lemmas around the word.  I think all of this is certainly doable -- the latter is something I wrote about some 15 years ago -- with hacks to the various reporting subsystems (not in 3, which is just too much of a mess).  In an SOA model of PhiloLogic, this would be quite reasonably handled, ideally by other teams using PhiloLogic if not here at Chicago.&lt;br /&gt;&lt;br /&gt;I think these are important issues to raise, but not necessarily resolve at this time, if (when?) we consider the architecture of any future Philologic4 development effort.  For example, the current models of report generators would have to know about lemmas, etc.  And we would need to at least leave hooks in any future model to support different indexing schemes for things like.&lt;br /&gt;&lt;br /&gt;Finally, watch this space.  I believe Richard is doing a build of this data using his model as well.&lt;br /&gt;&lt;br /&gt;Please do play around with all of this and let me know what you think.  One consideration would be implementing this for selected French collections.  We would obviously need real virtual normalizers, lemmatizers and PoS identifiers for a broader range of French than we have now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-69090438454455359?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/01/monk-data-under-philologic-1.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/69090438454455359'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/69090438454455359'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/01/monk-data-under-philologic-1.html' title='MONK Data Under PhiloLogic: 1'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-651879735629543429</id><published>2010-01-14T13:01:00.007-06:00</published><updated>2010-01-15T16:48:52.283-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='philologic'/><category scheme='http://www.blogger.com/atom/ns#' term='perseus'/><title type='text'>KWIC Modifications</title><content type='html'>I have been working on getting a cleaner output format from KWIC for the Greek texts on Perseus. Helma was desirous of the KWIC output leaving in the word tags which occur in the Perseus texts in order that the word lookup function be usable directly from the KWIC results page. Since KWIC leaves as little formatting as possible, it strips out all tags, including the word tags, from the text. While I worked on that, I also added a few other modifications to KWIC to give a better look to the results page for the Greek texts.&lt;br /&gt;&lt;br /&gt;The problem with KWIC for Greek texts is that Greek fonts to not support single-width fonts, which KWIC uses to align the results more cleanly. In addition, the title lines, which give the bibliographic information and link, can be different lengths and this also causes problems aligning the search terms. See for instance, &lt;a href="http://artflx.uchicago.edu/cgi-bin/philologic/search3torth?dbname=PerseusGreekTexts&amp;amp;word=%E1%BC%94%CF%87%CE%B8%CE%BF%CF%82&amp;amp;OUTPUT=kwic&amp;amp;ORTHMODE=ORG&amp;amp;CONJUNCT=PHRASE&amp;amp;DISTANCE=3&amp;amp;author=&amp;amp;title=&amp;amp;POLESPAN=5&amp;amp;THMPRTLIMIT=1&amp;amp;KWSS=1&amp;amp;KWSSPRLIM=500&amp;amp;trsortorder=author%2C+title&amp;amp;editor=&amp;amp;pubdate=&amp;amp;language=&amp;amp;shrtcite=&amp;amp;filename=&amp;amp;genre=&amp;amp;sortorder=author%2C+title&amp;amp;dgdivhead=&amp;amp;dgdivtype=&amp;amp;dgsubdivwho=&amp;amp;dgsubdivn=&amp;amp;dgsubdivtag=&amp;amp;dgsubdivtype="&gt;this search&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;To solve the word tag problem I just made a few modifications to the KwicFormat subroutine in philosubs. The main edit there was changing the line that stripped all tags into this:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;    $bf=~ s#&lt;(?!(w |/w))[^&gt;]*&gt;# #gi;  #keep only word tags&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For the alignment issue, things were more complicated. Keeping track of the length of the left side of the line doesn't allow for a consistent place on the page due to the differing widths of the letters. In the end, I modified artfl_kwic to chop the left side of the hit to a size as close to a certain length as possible without breaking any words. Previously, both the right and the left were chopped to a certain length regardless of breaking words and length including tags, often resulting in very little content. Now, only the length of the display string is accounted for and in addition the length is adjusted for the length of the bibliographic title.&lt;br /&gt;&lt;br /&gt;I also added a span around the left and right sides of the hit to allow for positioning and alignment using Javascript (and CSS). Then, adding the following lines to the Results Header, the search terms are all lined up in a neat line:&lt;br /&gt;&lt;pre id="line1"&gt;span.left { right:46%; position:absolute; }&lt;br /&gt;span.right { left:54.5%; position:absolute;&lt;br /&gt;         height:18px; overflow:hidden;}&lt;br /&gt;&lt;/pre&gt;The numbers may look a little messy, but they give nice results. I found that without the decimal, the two sides were a bit too far apart, but there may be another way around that.&lt;br /&gt;&lt;br /&gt;The extra bits for formatting the right span are in place of trimming the content of the right side in perl as I did for the left side. I found that the overflow:hidden attribute is quite handy if you can get it to work (it is a bit tempermental). As long as it is found in an absolutely positioned object with restricted size, AND it is contained within an element with overflow set to auto, it should work. It simply hides any content that does not fit within the given boundaries. It gives a very clean look to the right side of the page and even adjusts to different window sides so that the content never leaks to the next line.&lt;br /&gt;&lt;br /&gt;Take a look at &lt;a href="http://grade-devel.uchicago.edu/cgi-bin/philologic/search3torth?dbname=PerseusGreekDev&amp;amp;word=logos&amp;amp;ORTHMODE=LEM&amp;amp;OUTPUT=kwic&amp;amp;CONJUNCT=PHRASE&amp;amp;DISTANCE=3&amp;amp;title=&amp;amp;author=Aeschines&amp;amp;date=&amp;amp;DFPERIOD=1&amp;amp;POLESPAN=5&amp;amp;THMPRTLIMIT=1&amp;amp;KWSS=1&amp;amp;KWSSPRLIM=500&amp;amp;trsortorder=author%2C+title&amp;amp;genre=&amp;amp;publisher=&amp;amp;pubplace=&amp;amp;editor=&amp;amp;pubdate=&amp;amp;language=&amp;amp;shrtcite=&amp;amp;filename=&amp;amp;filesize=&amp;amp;sortorder=author%2C+title&amp;amp;dgdivhead=&amp;amp;dgdivtype=&amp;amp;dgdivlang=&amp;amp;dgdivn=&amp;amp;dgdivid=&amp;amp;dgdivocauthor=&amp;amp;dgdivocdateline=&amp;amp;dgdivocsalutation=&amp;amp;dgsubdivtag=&amp;amp;dgsubdivtype=&amp;amp;dgsubdivn=&amp;amp;dgsubdivid=&amp;amp;dgsubdivwho="&gt;this page&lt;/a&gt; and play with the window size to see what I mean. Unfortunately, there is no such nice property for trimming the overflow off of the left side instead of the right side. That is why I did it in perl instead of Javascript. There is a function called clip in javascript which is designed to clip an image, but again the way it works makes it much easier to trim from the right side than the left side. One could probably twist the clip function enough to make something similar happen for the left side (and make everything nicely adjustable and lovely), but for now, it is happening in perl. (I tried for a while, but my concoctions just seemed to slow things down and not add anything exciting results).&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;UPDATE&lt;/span&gt;: I couldn't resist playing a bit more with the javascript, and now it works like I wanted it to! Now, if you click on the link above, it won't illustrate what I said it would, because the javascript has been improved. I added this function:&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:courier new;font-size:85%;left:20px" &gt;function trimKwicLines(){&lt;br /&gt;       var contentwidth = $(".content").width();&lt;br /&gt;       $(".left").each(function (i) {&lt;br /&gt;               var width = $(this).width();&lt;br /&gt;               var leftoffset = contentwidth*.4 - width*1 - 2;&lt;br /&gt;               $(this).css("left", leftoffset);&lt;br /&gt;       });&lt;br /&gt;}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;And changed things here and there in the CSS.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-651879735629543429?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2010/01/kwic-modifications.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/651879735629543429'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/651879735629543429'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2010/01/kwic-modifications.html' title='KWIC Modifications'/><author><name>Kristin</name><uri>http://www.blogger.com/profile/16706344780694707122</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-828093640707778565</id><published>2009-12-14T13:28:00.007-06:00</published><updated>2009-12-14T16:46:56.795-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='philologic'/><category scheme='http://www.blogger.com/atom/ns#' term='perseus'/><title type='text'>Natural Language Morphology Queries in Perseus</title><content type='html'>Natural language queries are now possible on Perseus under Philologic. Previously, Richard had implemented searching for various parts of speech in various forms. For instance, as noted in the &lt;a href="http://perseus.uchicago.edu/about.html"&gt;About&lt;/a&gt; page for Perseus, a &lt;a href="http://artflx.uchicago.edu/cgi-bin/philologic/search3torth?dbname=PerseusGreekTexts&amp;amp;word=pos%3Av*roa*&amp;amp;OUTPUT=conc&amp;amp;ORTHMODE=LEM&amp;amp;CONJUNCT=PHRASE&amp;amp;DISTANCE=3&amp;amp;author=&amp;amp;title=&amp;amp;POLESPAN=5&amp;amp;THMPRTLIMIT=1&amp;amp;KWSS=1&amp;amp;KWSSPRLIM=500&amp;amp;trsortorder=author%2C+title&amp;amp;editor=&amp;amp;pubdate=&amp;amp;language=&amp;amp;shrtcite=&amp;amp;filename=&amp;amp;genre=&amp;amp;sortorder=author%2C+title&amp;amp;dgdivhead=&amp;amp;dgdivtype=&amp;amp;dgsubdivwho=&amp;amp;dgsubdivn=&amp;amp;dgsubdivtag=&amp;amp;dgsubdivtype="&gt;search for 'pos:v*roa*'&lt;/a&gt; will return all the instances of perfect active aorist verbs in the selected corpus. Now, a &lt;a href="http://artflx.uchicago.edu/cgi-bin/philologic/search3torth?dbname=PerseusGreekTexts&amp;amp;word=form%3Acould-I-please-have-some-perfect-active-optatives%3F&amp;amp;OUTPUT=conc&amp;amp;ORTHMODE=ORG&amp;amp;CONJUNCT=PHRASE&amp;amp;DISTANCE=3&amp;amp;author=&amp;amp;title=&amp;amp;POLESPAN=5&amp;amp;THMPRTLIMIT=1&amp;amp;KWSS=1&amp;amp;KWSSPRLIM=500&amp;amp;trsortorder=author%2C+title&amp;amp;editor=&amp;amp;pubdate=&amp;amp;language=&amp;amp;shrtcite=&amp;amp;filename=&amp;amp;genre=&amp;amp;sortorder=author%2C+title&amp;amp;dgdivhead=&amp;amp;dgdivtype=&amp;amp;dgsubdivwho=&amp;amp;dgsubdivn=&amp;amp;dgsubdivtag=&amp;amp;dgsubdivtype="&gt;search for 'form:could-I-please-have-some-perfect-active-optatives?&lt;/a&gt;&lt;a href="http://artflx.uchicago.edu/cgi-bin/philologic/search3torth?dbname=PerseusGreekTexts&amp;amp;word=form%3Acould-I-please-have-some-perfect-active-optatives%3F&amp;amp;OUTPUT=conc&amp;amp;ORTHMODE=ORG&amp;amp;CONJUNCT=PHRASE&amp;amp;DISTANCE=3&amp;amp;author=&amp;amp;title=&amp;amp;POLESPAN=5&amp;amp;THMPRTLIMIT=1&amp;amp;KWSS=1&amp;amp;KWSSPRLIM=500&amp;amp;trsortorder=author%2C+title&amp;amp;editor=&amp;amp;pubdate=&amp;amp;language=&amp;amp;shrtcite=&amp;amp;filename=&amp;amp;genre=&amp;amp;sortorder=author%2C+title&amp;amp;dgdivhead=&amp;amp;dgdivtype=&amp;amp;dgsubdivwho=&amp;amp;dgsubdivn=&amp;amp;dgsubdivtag=&amp;amp;dgsubdivtype="&gt;'&lt;/a&gt; will return the same results. In fact, searching for 'form:perf-act-opt', 'form:perfect-active-optative', 'form:perfection-of-action-optimizations', or 'form:perfact-actovy-opts-pretty-please' will all accomplish this same task. Note that the dashes are necessary between the words, otherwise a search for plural nouns written as 'form:plural nouns' will actually be searching for any plural word followed by the word "nouns", which will fail. I carefully chose shorter forms of all the keywords, such as "impf" and "ind" for "imperfect" and "indicative" so that a search including any word starting with "ind" will match indicatives regardless of what follows the 'd'. Hopefully, there are no overlapping matches (such as using "im" to abbreviate "imperfect" which would also match "imperative"). If you do encounter any, please let me know. Potentially, we could put a list of acceptable abbreviations somewhere, although they are fairly straightforward and typing the full term out is always a fail-safe method.&lt;br /&gt;&lt;br /&gt;Basically, the modified crapser script simply translates searches beginning with "form:" into the corresponding "pos:" search. Using a hash of regular expressions and string searching, it simply returns the corresponding code. In the previous example, the search is actually looking for "pos:....roa..". Notice that it fills in the empty space of the code with dots, allowing them to be anything. I implemented an alternative filler, the dash, so that when you search for something like "form:perf-act-opt-exact", you will actually be searching for "pos:----roa--" (and your search will fail because there are no terms that are only and exactly perfect active optative without other specifications).&lt;br /&gt;&lt;br /&gt;One limitation that this method of natural language querying has is that it cannot match the versatility of the "pos:" searches. That is, because it selects either dots or dashes as fillers, you cannot get a mixture of them in your search. You cannot run a search such as "pos:v-.sroa---". However, this limitation will likely have little effect for the average user and the user needing such a search can still obtain it using the "pos:" method. An alternative method involving drop down input boxes for each slot of the code would enable the full power of the pos searches, but it would also be potentially more tedious to implement and potentially tedious to use as well. Such a input form would require the user to know more about the encoding than the "form:" searching I implemented does. For example, a user would need to know that "verb" is required in the first slot, even if "aorist optative" makes that the only possibility. Whereas searching for 'form:aorist-optative' works without the user ever needing to know that a 'v' is required in the first slot.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-828093640707778565?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/12/natural-language-queries-in-perseus.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/828093640707778565'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/828093640707778565'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/12/natural-language-queries-in-perseus.html' title='Natural Language Morphology Queries in Perseus'/><author><name>Kristin</name><uri>http://www.blogger.com/profile/16706344780694707122</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-7436467004949015337</id><published>2009-12-14T12:09:00.005-06:00</published><updated>2009-12-15T16:51:08.540-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='similarity'/><category scheme='http://www.blogger.com/atom/ns#' term='encyclopédie'/><category scheme='http://www.blogger.com/atom/ns#' term='vsm'/><title type='text'>Encyclopédie: Similar Article Identification II</title><content type='html'>After doing a series of revisions as part of my last post this subject (&lt;a href="http://artfl.blogspot.com/2009/12/encyclopedie-similar-article.html"&gt;link&lt;/a&gt;), I thought it might be helpful to provide an update posting.  We have been interested in teasing out how the VSM handles small vs large articles and to get some sense of why various similar articles are selected.  Over the weekend, I reran the vector space similarity function on 39,218 articles, taking some 29 hours.   I excluded some 150 surface forms of words in a stopword list, all sequences of numbers (and roman numerals), as well as features (in this case word stems) found in more than 1568 and less than 35 articles.  This last step removed features like &lt;span style="font-style: italic;"&gt;blanch, entend, mort&lt;/span&gt;, and so on.   Thus, I removed some 600 features, leaving 10,157 features used for the calculation.  Here is the search form:&lt;br /&gt;&lt;br /&gt;&lt;form action="http://artflx.uchicago.edu/cgi-bin/extras/encvectspace.pl"&gt;Headword: &lt;input name="headword" size="25"&gt; (e.g. tradition)&lt;br /&gt;Author:    &lt;input name="author" size="25"&gt; (e.g. Holbach)&lt;br /&gt;Classification:&lt;input name="normclass" size="25"&gt; (e.g. Horlogerie)&lt;br /&gt;English Class:&lt;input name="englishclass" size="25"&gt; (e.g. Clockmaking)&lt;br /&gt;Size (words): &lt;input name="wordcount" size="15"&gt; (e.g. 250- or 250-1000)&lt;br /&gt;Show Top: &lt;input name="shownumtop" size="3" value="25"&gt; articles (e.g. 10 or 50) &lt;p&gt;&lt;input value="SEARCH" type="submit"&gt; &lt;input value="CLEAR" type="reset"&gt;&lt;/p&gt; &lt;/form&gt;The number of matching terms for small articles can be, of course, very small.  For example, article "&lt;a href="http://artfl.uchicago.edu/cgi-bin/philologic31/getobject.pl?c.124:26.encyclopedie1108"&gt;Tout-Bec&lt;/a&gt;" (62 words) is left with four stems [amer 1|oiseau 2|ornith 1|bec 3].   The first most of the most &lt;a href="http://artflx.uchicago.edu/cgi-bin/extras/encvectspace.pl?objectid=124:26"&gt;similar articles&lt;/a&gt; is &lt;a href="http://artfl.uchicago.edu/cgi-bin/philologic31/getobject.pl?c.106:102.encyclopedie1108"&gt;Rhinoceros&lt;/a&gt; (&lt;i&gt;Hist. nat. Ornith&lt;/i&gt;.) -- remember, only the main article here -- matches on three stems:&lt;br /&gt;&lt;pre&gt;word               frq1     frq2&lt;br /&gt;bec                 3        5&lt;br /&gt;oiseau              2        2&lt;br /&gt;ornith              1        1&lt;br /&gt;&lt;/pre&gt;Are these similar?   Well, both very small articles refer to kinds of rare birds that are notable by their beaks, one with a very large beak and one that looks like it has two or more beaks.  It is also important to note that "ornith" (the class of knowledge) in both is picked up by this example.   The next article down (Pipeliene) matches on:&lt;br /&gt;&lt;pre&gt;amer                1        1&lt;br /&gt;bec                 3        1&lt;br /&gt;oiseau              2        2&lt;br /&gt;&lt;/pre&gt;The third most similar in this example is "&lt;i&gt;Connoissance des Oiseaux par le bec &amp;amp; par les pattes&lt;/i&gt;.", a plate legend, with as you expect, lots of beaks.  This matches on two stems, &lt;span style="font-style: italic;"&gt;bec&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;oiseau&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;It seems that the size of the query article, now that I have eliminated many function words and other extraneous data, carries a significant impact.   The larger the article, the more possible matches you will get (Zipf's Law applies).   Longer articles will tend to be most similar to other longer articles, and shorter will match better to shorter.  So, similarity would appear to be a function of relative frequencies of common features and the length of the articles.   We saw this in our original examination of the &lt;span style="font-style: italic;"&gt;Encyclopédie&lt;/span&gt; and the &lt;span style="font-style: italic;"&gt;Dictionnaire de Trévoux&lt;/span&gt;, and had built in some restrictions in terms of size as well as comparing articles with the same first letter rather than all to all.   As far as I can tell, the kind of more of feature pruning shown here does not have a significant impact on larger articles.&lt;br /&gt;&lt;br /&gt;User feedback might be significant in determining just how many features and what kinds of features are required to get more interesting matches.  For any pair, we could store the VSM score, the sizes, and the matching features along with the user rating of the match.  That might generate some actionable data for future applications.&lt;br /&gt;&lt;br /&gt;[&lt;span style="color: rgb(51, 51, 255);"&gt;Aside:  In some cases, similar passages lead to possibly related plates and legends.  &lt;/span&gt;&lt;a style="color: rgb(51, 51, 255);" href="http://artflx.uchicago.edu/cgi-bin/extras/encvectspace.pl?objectid=12:482"&gt;Cadrature&lt;/a&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;, for example, links to numerous plate legends dealing with clockmaking&lt;/span&gt;.]&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-7436467004949015337?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/12/encyclopedie-similar-article_14.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7436467004949015337'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7436467004949015337'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/12/encyclopedie-similar-article_14.html' title='Encyclopédie: Similar Article Identification II'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-236460639160808635</id><published>2009-12-13T17:06:00.000-06:00</published><updated>2010-03-25T11:22:00.519-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><category scheme='http://www.blogger.com/atom/ns#' term='encyclopédie'/><title type='text'>Mapping Encyclopédie classes of knowledge to LDA generated topics</title><content type='html'>&lt;div&gt;As was described in my &lt;a href="http://artfl.blogspot.com/2009/11/do-lda-generated-topic-match-human.html"&gt;previous blog entry&lt;/a&gt;, I've been working on comparing the results given by LDA generated topics with the classes of knowledge identified by the &lt;i&gt;philosophes&lt;/i&gt; in the Encyclopédie. My initial experiment was to try to see if out of 5000 articles belonging to 100 classes of knowledge, with 50 articles per class, I would find those 100 topics using an LDA topic modeler. My conclusion was that it didn't find all of them, but still found quite a few.&amp;nbsp;Since then, I have played a bit more with this dataset and have come up with better results.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Since a topic modeler will give you the topic proportion per article (I just use the top three), what I tried to do this time was to &lt;a href="http://spreadsheets.google.com/pub?key=tBuEA4WyAUfTopkOXPR3h8w&amp;amp;output=html"&gt;draw up a table&lt;/a&gt; with each class of knowledge, and what the topic modeler identified in terms of topics for each class of knowledge. Before looking at this, it's important to keep in mind that in the sample of articles I used, there are 50 articles per class of knowledge. Therefore, the closer the number of the dominant topic in a class of knowledge gets to 50, the better the topic modeler will have done in identifying the class of knowledge and in reproducing the human classification.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course, the classification of articles in the Encyclopédie can be at times a little puzzling. The articles were written by a large number of people and therefore the classification is not always consistent. With that in mind, one should not expect to get perfect matches using a topic modeler. Moreover, since the topic modeler will assume that each article is about N number of topics, the calculation might be further off.&lt;br /&gt;For my experiment, I settled on 107 topics, of which I eliminated 7, which were identified as stopwords lists.&amp;nbsp;When looking at the results of this experiment, there are 41 classes of knowledge in which we find 40 or more articles grouped within the same LDA topic.  This means that 41% of the classes of knowledge were identified with a great level of accuracy. If we look at topics that have more than 25 articles matching the same class of knowledge we get up to 83 classes (or 83%).&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If we look at those results, there are strange flaws, such as &lt;i&gt;physique&lt;/i&gt; and &lt;i&gt;divination&lt;/i&gt; that don't seem to be identified. This might be due to a miscalculation, but I have yet to figure out what it could be. Highly specialized classes, such as &lt;i&gt;corroyerie&lt;/i&gt;, &lt;i&gt;poésie&lt;/i&gt;, or &lt;i&gt;astronomie&lt;/i&gt; get excellent matches, which is to be expected.&lt;br /&gt;This experiment also gave us an idea of what the percentage of LDA topics are to be considered as stopwords lists. Between 5 and 10% of the topics should be discarded when using an LDA classifier.&lt;br /&gt;Finally, we should consider that LDA generated topics do not systematically match human identified topics. An unsupervised model is bound to give different results, it would be interesting to see how well supervised LDA (sLDA) would do in our particular test case.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-236460639160808635?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/12/mapping-encyclopedie-classes-of.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/236460639160808635'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/236460639160808635'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/12/mapping-encyclopedie-classes-of.html' title='Mapping Encyclopédie classes of knowledge to LDA generated topics'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-5922987674475128806</id><published>2009-12-07T10:09:00.005-06:00</published><updated>2009-12-07T10:21:05.311-06:00</updated><title type='text'>Index Design Notes 1: PhiloLogic Index Overview</title><content type='html'>I've been playing around with some perl code in response to several questions about the structure of PhiloLogic's main word index--I'll post it soon, but in the meantime, I thought I'd try to give a conceptual overview of how the index works.  As you may know, PhiloLogic's main index data structure is a hash table supporting O(1) lookup of any given keyword.  You may also know that PhiloLogic only stores integers in the index: all text objects are represented as hierarchical addresses, something like a normalized, fixed-width Xpointer.  &lt;br /&gt;&lt;br /&gt;Let's say we can represent the position of some occurrence of the word "cat" as&lt;br /&gt;0 1 2 -1 1 12 7 135556 56&lt;br /&gt;which could be interpreted as &lt;br /&gt;document 0, &lt;br /&gt;book 1, &lt;br /&gt;chapter 2, &lt;br /&gt;section &lt;undefined&gt;, &lt;br /&gt;paragraph 1, &lt;br /&gt;sentence 12, &lt;br /&gt;word 7, &lt;br /&gt;byte 135556, &lt;br /&gt;page 56, for example.  &lt;br /&gt;&lt;br /&gt;A structured, positional index allows us to evaluate phrase queries, positional queries, or metadata queries very efficiently.  Unfortunately, storing each of these 9 numbers as 32-bit integers  would take 36 bytes of disk space, for every occurence of the word.  In contrast, it's actually possible to encode all 9 of the above numbers in just 39 bits, if we store them efficiently--that's a 93% saving.  The document field has the value 0, which we can store in a single bit, whereas byte position, our most expensive, can be stored in just 18 bits.  The difficulty being that the simple array of integers becomes a single long bit string stored in a hash.  First we encode each number in binary, like so&lt;br /&gt;0 1 01 11 1 0011 111 001000011000100001 000111&lt;br /&gt;&lt;br /&gt;but this is only 18 bits, so we have to pad it off with 6 extra bits to get an even byte alignment, and then we can store it in our hash table under "cat".&lt;br /&gt;&lt;br /&gt;Now, suppose that we use somthing like this format to index a set of small documents with 10,000 words total.  We can expect, among other things, a handful of occurrences of "cat", and probably somewhere around a few hundred occurrences of the word "the".  In a GDBM table, duplicate keywords aren't permitted--there can be exactly one record of "cat".  For a database this size, it would be feasible to append every occurrence into a single long bit string  Let's say our text structures require 50 bits to encode, and that we have 5 occurrences of cat.  We look up "cat" in GDBM, and get a packed bit string 32 bytes, or 256 bits long.  we can divide that by the size of a single occurrence, so we know that we have 5 occurrences and 6 bits of padding.  &lt;br /&gt;&lt;br /&gt;"The", on the other hand, would be at least on the order of few kilobytes, maybe more.  1 or 2 K of memory is quite cheap on a modern machine, but as your database scales into the millions of words, you could have hundreds of thousands, even millions of occurrences of the most frequent words.  At some point, you will certainly not want to have to load megabytes of data into memory at once for each key-word lookup.  Indeed, in a search for "the cat", you'd prefer not to read every occurrence of "the" in the first place.  &lt;br /&gt;&lt;br /&gt;Since PhiloLogic currently doesn't support updating a live database, and all word occurrences are kept in sorted order, it's relatively easy for us to devise an on-disk, cache-friendly data structure that can meet our requirements.  Let's divide up the word occurences into 2-kilobyte blocks, and keep track of the first position in each block.  Then, we can rapidly skip hundreds of occurrences of a frequent word, like "the", when we know that the next occurence of "cat" isn't in the same document!  &lt;br /&gt;&lt;br /&gt;Of course, to perform this optimization, we would need to know the frequency of all terms in a query before we scan through them, so we'll have to add that information to the main hash table.  Finally, we'd prefer not to pay the overhead of an additional disk seek for low-frequency words, so we'll need a flag in each key-word entry to signal whether we have:&lt;br /&gt;1) a low frequency word, with all occurences stored inline&lt;br /&gt;or&lt;br /&gt;2) a high frequency word, stored in the block tree.&lt;br /&gt;&lt;br /&gt;Just like the actual positional parameters, the frequencies and tree headers can also be compressed to an optimal size on a per-database level.  In philologic, this is stored in databasedir/src/dbspecs.h, a c header file that is generated at the same time as the index, then compiled into a custom compression/decompression module for each loaded database, which the search engine can dynamically load and unload at run time.&lt;br /&gt;&lt;br /&gt;In a later post, I'll provide some perl code to unpack the indices, and try to think about what a clean search API would look like.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-5922987674475128806?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/12/index-design-notes-1-philologic-index.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5922987674475128806'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5922987674475128806'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/12/index-design-notes-1-philologic-index.html' title='Index Design Notes 1: PhiloLogic Index Overview'/><author><name>Richard</name><uri>http://www.blogger.com/profile/06345844875619851744</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-8417929062078146872</id><published>2009-12-03T11:22:00.015-06:00</published><updated>2009-12-14T13:45:46.694-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='similarity'/><category scheme='http://www.blogger.com/atom/ns#' term='encyclopédie'/><category scheme='http://www.blogger.com/atom/ns#' term='vsm'/><title type='text'>Encyclopédie: Similar Article Identification</title><content type='html'>The &lt;a href="http://en.wikipedia.org/wiki/Vector_space_model"&gt;Vector Space Model&lt;/a&gt; (VSM) is a classic approach to information retrieval.   We integrated this as a standard function in &lt;a href="http://code.google.com/p/philomine/"&gt;PhiloMine&lt;/a&gt; and have used it for a number of specific research projects, such as identifying borrowings from the &lt;span style="font-style: italic;"&gt;Dictionnaire de Trévoux&lt;/span&gt; in the &lt;a href="http://encyclopedie.uchicago.edu/"&gt;&lt;span style="font-style: italic;"&gt;Encyclopédie&lt;/span&gt;&lt;/a&gt;, which is described in our forthcoming paper "Plundering Philosophers" and related talks[1].  While originally developed by Gerard Salton[2] in 1975 as a model for classic information retrieval, where a user submits a query and gets results in an ranked relevancy list, the algorithm is also very useful to identify similar blocks of text, such as encyclopedia articles or other delimited objects.    Indeed, this kind of use of the VSM was proposed by Salton and Singhal[3] in a paper presented months before Salton's death. They demonstrated the use of VSM to produce links between parts of documents, forming a type of automatic  hypertext:&lt;br /&gt;&lt;blockquote&gt;The capability of generating weighted vectors for arbitrary texts also makes it possible to decompose individual documents into pieces and explore the relationships between these text pieces. [...] Such insights can be used for picking only the "good" parts of the document to be presented to the reader.&lt;/blockquote&gt;Salton and Singhal further argued that manual link creation would be impractical for huge amounts of text, but these conclusions may have had limited influence given the general interest at that time in human generated hypertext links on the WWW.&lt;br /&gt;&lt;br /&gt;Based on earlier work using PhiloMine, we have seen a number of "interesting" -- and at times unexpected -- connections between articles in the &lt;span style="font-style: italic;"&gt;Encyclopédie&lt;/span&gt;, often drawing connections between previously unrelated articles, if by unrelated we mean having different authors, classes of knowledge and few cross-references (renvois) between them.    One might consider this kind of similarity measure between articles as a kind of intertextual discovery tool, where the system would propose articles possibly related to a specific article.&lt;br /&gt;&lt;br /&gt;The Vector Space Model functions by comparing a query vector to all of the vectors in a corpus, making it an expensive calculation, not always suitable to real time use.  In this experiment, I have recast the VSM implementation in PhiloMine to function as a batch job to generate a database of 27,753 &lt;span style="font-style: italic;"&gt;Encyclopédie&lt;/span&gt; articles (those with 100 or more words) with the 20 most similar articles for each article.   To do this, I pruned features (word stems) which more than 8,325 and less than 41 articles, resulting in a vector size of 10,431 features.  I used a standard French word &lt;a href="http://search.cpan.org/%7Ecreamyg/Lingua-Stem-Snowball-0.952/lib/Lingua/Stem/Snowball.pm"&gt;stemmer&lt;/a&gt; to reduce lexical variation and a Log Normalization function to handle variations in article sizes.     The task took about 17 hours to run.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;Update (December 7): I have replaced the VSM build above with the same on  39,200 articles -- all articles with 60 or more words -- which took about 29 hours to run.  I pruned features found in more than 11,200 documents and less than 50, leaving 9,710 features.  This may change some results by adding more small articles.  Note, this is about as large a VSM task as can be performed in memory using perl hashes, since anything large runs out of memory.  If we want to go larger, probably store vectors on disk and TIE them to perl hashes.           &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The results for a query shows the 20 most similar articles, ranked by the similarity score, where an exact match is equal to 1.  For example, the article &lt;a href="http://artflx.uchicago.edu/cgi-bin/extras/encvectspace.pl?headword=OUESSANT"&gt;OUESSANT&lt;/a&gt; (Modern Geography) -- &lt;span style="color: rgb(51, 51, 255);"&gt;based on 27,000 articles&lt;/span&gt; -- is related to the articles VERTU [0.274], Luxe [0.267], ECONOMIE ou OECONOMIE [0.265], POPULATION [0.263], CHRISTIANISME [0.261], SOCIÉTÉ [0.256], AVERTISSEMENT DES ÉDITEURS (suite) [0.255], MANICHÉISME [0.254], CYNIQUE, secte de philosophes anciens [0.254], Gout [0.250], EDUCATION [0.248] and so on.   This reflects the discussion of the moral conditions of the inhabitants of the small island off the coast of Brittany.&lt;br /&gt;&lt;br /&gt;You can give it a try using this form (&lt;span style="color: rgb(51, 51, 255);"&gt;again now for 39,200 articles&lt;/span&gt;):&lt;br /&gt;&lt;br /&gt;&lt;form action="http://artflx.uchicago.edu/cgi-bin/extras/encvectspace.pl"&gt;Headword: &lt;input name="headword" size="25"&gt; (e.g. tradition)&lt;br /&gt;Author:    &lt;input name="author" size="25"&gt; (e.g. Holbach)&lt;br /&gt;Classification:&lt;input name="normclass" size="25"&gt; (e.g. Horlogerie)&lt;br /&gt;English Class:&lt;input name="englishclass" size="25"&gt; (e.g. Clockmaking)&lt;br /&gt;Size (words): &lt;input name="wordcount" size="15"&gt; (e.g. 250- or 250-1000)&lt;br /&gt;Show Top: &lt;input name="shownumtop" size="3" value="25"&gt; articles (e.g. 10 or 50) &lt;p&gt;&lt;input value="SEARCH" type="submit"&gt; &lt;input value="CLEAR" type="reset"&gt;&lt;/p&gt; &lt;/form&gt;[&lt;span style="color: rgb(51, 51, 255);"&gt;Dec 9: I added word count info for each article.  You can restrict searches to articles in ranges of size.  Also, now storing 50 top matches, which you can limit.  Showing matching articles which are smaller than source article.  Dec 10: added function to display matching stems for any pairwise comparison for inspection&lt;/span&gt;.]&lt;br /&gt;&lt;br /&gt;There are a number of other options that I might add to the VSM calculations, including using TF-IDF as an alternative normalization weighting scheme and use of virtual normalization to again reduce lexical variations and improve the performance of the stemming algorithm.   I have also thought of using Latent Semantic Analysis as another way to handle similarity weighting, but given that we have many query terms, it is not clear that LSA would help all that much.&lt;br /&gt;&lt;br /&gt;In a real production environment, I think we will add a "similar article link" from articles in the Encyclopédie.  We have talked about having users rank the quality of the similarity performance.  The scores assigned are somewhat helpful in ranking, but not in assessing an absolute number, since they can vary by the size of the input article.   VSM is an unsupervised learning model.  It is not clear to me that we could integrate user evaluations in any systematic fashion, but this is certainly an interesting subject of further consideration.&lt;br /&gt;&lt;br /&gt;As always, please let me know what you think.  I have a couple of general queries.  I have used main and sub articles (as well plate legends, etc.) as units of similarity calculation.  Should I use main entries only?  I also limited this to articles with more than 100 words.  At 50 words, we have some 43,000 articles.  Should I do this for a full implementation?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;[1] See Timothy Allen, Stéphane Douard, Charles Cooney, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, and Robert Voyer, "Plundering Philosophers: Identifying Sources of the Encyclopédie", &lt;i&gt;Journal of the Association for History and Computing&lt;/i&gt; (forthcoming 2009).    Also, see Ceglowski, Maxiej. 2003: "Building a Vector Space Search Engine in Perl", Perl.com [http://www.perl.com/pub/a/2003/02/19/engine.html].&lt;br /&gt;&lt;br /&gt;[2] Salton, G., A. Wong, and C. S. Yang. 1975: "A Vector Space Model for Automatic Indexing," &lt;span style="font-style: italic;"&gt;Communications of the ACM&lt;/span&gt; 18/11: 613-620.&lt;br /&gt;&lt;br /&gt;[3] Singhal, A. and Salton, G. 1995: "Automatic Text Broswing Using Vector Space Model" in  &lt;span style="font-style: italic;"&gt;Proceedings of the Dual-Use Technologies and Applications Conference &lt;/span&gt;318-324.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-8417929062078146872?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/12/encyclopedie-similar-article.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/8417929062078146872'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/8417929062078146872'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/12/encyclopedie-similar-article.html' title='Encyclopédie: Similar Article Identification'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-2160016817131546026</id><published>2009-11-20T15:39:00.007-06:00</published><updated>2009-11-25T14:45:11.090-06:00</updated><title type='text'>Frequencies in the Greek and Latin texts</title><content type='html'>Earlier this year Mark built a frequency query for the French texts (affectionately named wordcount.pl)&lt;div&gt;Kristin has now implemented this for our Greek and Latin texts. If you wonder what's new about this: Word count for individual documents has always been there in PhiloLogic loads, but the difference here is that you can see frequencies over the entire corpus, or a subset of works/authors.&lt;br /&gt;&lt;br /&gt;You can find the forms here:&lt;br /&gt;&lt;a href="http://perseus.uchicago.edu/LatinFrequency.html"&gt;http://perseus.uchicago.edu/LatinFrequency.html&lt;/a&gt;&lt;br /&gt;&lt;a href="http://perseus.uchicago.edu/GreekFrequency.html"&gt;http://perseus.uchicago.edu/GreekFrequency.html&lt;/a&gt;&lt;span class="Apple-style-span" style="text-decoration: line-through;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Update: Forms moved to the 'production site', perseus.uchicago.edu. You can now specify genre as well. Stay tuned for further stats, meant to provide a friendly reminder of &lt;a href="http://en.wikipedia.org/wiki/Zipf's_law"&gt;Zipf's Law&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Note: the counts are raw frequency counts, without lemmatization.&lt;div&gt;I have edited the search form a tiny bit - let me know if you encounter any problems. &lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-2160016817131546026?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/11/frequencies-in-greek-and-latin-texts.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2160016817131546026'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2160016817131546026'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/11/frequencies-in-greek-and-latin-texts.html' title='Frequencies in the Greek and Latin texts'/><author><name>Helma</name><uri>http://www.blogger.com/profile/09370867366875949424</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-7897751395675468726</id><published>2009-11-18T14:41:00.009-06:00</published><updated>2010-03-25T11:11:26.301-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><title type='text'>Do LDA generated topics match human identified topics?</title><content type='html'>I've been experimenting lately on how LDA generated topics and the Encyclopédie classes of knowledge match. The experiment was conducted in the following way:&lt;div&gt;- I chose 100 classes of knowledge in the Encyclopédie, and picked 50 articles of each.&lt;/div&gt;&lt;div&gt;- I then ran a first LDA topic trainer choosing 100 topics. &lt;/div&gt;&lt;div&gt;- I then proceeded to identify each generated topic and name after the Encyclopédie classes of knowledge. &lt;/div&gt;&lt;div&gt;- My plan was then to look at the topic proportions per article and see if the top topic would correspond to its class of knowledge. Would the computer manage to classify the articles in the same way the encyclopedists had?&lt;/div&gt;&lt;div&gt;I was not able to get that far when choosing 100 topics for my first LDA run. This is because LDA will always generate a couple topics which aren't really topics, but are just lists of very common words and they just happen to be used in the same documents. Therefore, one should always disregard these topics and focus on the others. What this means is that I had to add a couple more topics to my LDA run in order to get 100 identifiable topics. So I settled with 103 topics. I found 3 distributions of words which were unidentifiable, so I dismissed them.  &lt;/div&gt;&lt;div&gt;The results show that LDA topics and the Encyclopédie classes of knowledge do not match (see links to results below). Some do very well, like Artillerie, for which the corresponding distribution of words is :&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 255);"&gt;canon piece poudre artillerie boulet fusil ligne calibre mortier bombe feu charge culasse livre met chambre pouce lumiere roue affut diametre coup batterie levier bouche ame flasque balle tourillon tire&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Other distribution of words make sense in themselves but do not match any of the original classes of knowledge. For instance, there is no topic on 'teinture', 'peinture'. What we get instead is a mixture of both classes of knowledge which could be identified as colors :&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 255);"&gt;couleur rouge blanc bleu tableau jaune verd peinture ombre teinture noir toile tableaux nuance papier etoffe bien teint peintre pinceau trait teinturier melange veut figure teindre feuille beau sert colle&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Now the topic modeler is not wrong here. It's telling us that these words tend to occur together, which is true. Another significant example is the one with 'Boutonnier', 'Soie', and 'Rubanier' :&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: rgb(51, 51, 255);"&gt;soie fil rouet corde brin tour main bouton gauche longueur boutonnier droite attache bout fils tourner sert molette noeud cordon doigt piece emerillon moule broche ouvrage ruban rochet branche aiguille&lt;/span&gt;&lt;/div&gt;&lt;div&gt;What we get here is a topic about the art of making clothes, which is more general than 'Boutonnier' or 'Rubanier'. &lt;/div&gt;&lt;div&gt;For this to actually work, the philosophes would have had to have been extremely rigorous in their choice of vocabulary, because this is what LDA expects. Also, another problem is that LDA considers that each document is a mixture of topics, and not made out of one topic. So if one document is exclusively focused on one topic, LDA will still try to extract a certain number of topics out of it. If this is the case, then you are going to get some topics which are mere subdivisions of the class of knowledge in this document. The reason why our experiment broke down could be that the LDA topic trainer created new subdivisions for some classes of knowledge, or regrouped several classes of knowledge. These are all valid as topics, but do not correspond to human identified topics.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://docs.google.com/View?id=dgrbcw9z_90hpbj4xhb"&gt;Link to results&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-7897751395675468726?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/11/do-lda-generated-topic-match-human.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7897751395675468726'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7897751395675468726'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/11/do-lda-generated-topic-match-human.html' title='Do LDA generated topics match human identified topics?'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-2390221144656172292</id><published>2009-11-13T10:12:00.002-06:00</published><updated>2009-11-13T12:33:56.011-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='philologic'/><title type='text'>Section Highlighting in Philologic</title><content type='html'>In many of the Perseus texts currently loaded under philologic, the section labels would overlap and be unreadable. These labels come from the milestone tags in the xml text and are placed along the edge of the text. One particularly problematic text in this regard was the New Testament, as the sections were verses and were thus often small sections of text.&lt;br /&gt;&lt;br /&gt;In order to fix the overlapping issue, I wrote a little bit of javascript to hide the tags which would be placed in the same position as a previous tag. I also added a function to recalculate this if the window is resized. My main function is fairly simple:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left; color: rgb(51, 51, 255); margin-left: 40px;"&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;function killOverlap (){&lt;/span&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;$lastOffset = 0;&lt;/span&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;$(".mstonecustom").each(function (i) {&lt;/span&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;        if (this.offsetTop == $lastOffset){&lt;/span&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;                this.className = "mstonen2";&lt;/span&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;        }&lt;/span&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;        else {&lt;/span&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;                $lastOffset = this.offsetTop;&lt;/span&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;        }&lt;/span&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;});&lt;/span&gt;&lt;span style=";font-family:arial;font-size:78%;"  &gt;}&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;I also added a function which highlights a section when you hover over its milestone label along the side of the text. This seems useful to me, as often it is helpful to know where a section starts and ends. This was a slightly more complex problem. I had to alter the citequery3.pl script in order to add a span tag and some ids in order to get the javascript to work. The javascript was then fairly simple:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left; color: rgb(51, 51, 255); margin-left: 40px;"&gt;&lt;span style="color: rgb(51, 51, 255);font-family:arial;font-size:78%;"  &gt;function highlight(){&lt;br /&gt;$(".mstonecustom").hover(&lt;br /&gt;function () {&lt;br /&gt;myid = jq("text" + $(this).attr('id'));&lt;br /&gt;           $("w", myid).css({"font-weight" : "bolder"});},&lt;br /&gt;function () {&lt;br /&gt;myid = jq("text" + $(this).attr('id'));&lt;br /&gt;            $("w", myid).css({"font-weight" : "normal"});})}&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;In order for it to work though, you have to alter the citequery3.pl script with this:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left; color: rgb(51, 51, 255); margin-left: 40px;"&gt;&lt;span style="color: rgb(51, 51, 255);font-size:78%;" &gt;my $spanid = $citepoints{$offsets[$offset]};&lt;br /&gt;                         $spanid =~ s/.*\.([0-9]+)\.([0-9]+)$/a$1b$2/;&lt;br /&gt;#...&lt;br /&gt;                         $tempstring =~ s/(^&lt;[^&gt;]+&gt;)/$1&amp;lt;span class="mstonecustom" id="$spanid"&amp;gt;$citepoints{$offsets[$offset]}&lt;\/span&gt;/;&lt;br /&gt;                         #... {&lt;br /&gt;                                         $tempstring =~ s/&amp;lt;span class="mstonecustom" id="$spanid"&amp;gt;$citepoints{$offsets[$offset]}&lt;\/span&gt;//;}&lt;br /&gt;&lt;br /&gt;                         $milesubstrings[$offset] = "&amp;lt;span class=" . $citeunits{$offsets[$offset]} . " id="text"&amp;gt;" . $tempstring . "&lt;\/span&gt;";&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;That's about it. It may come in useful again someday. For an example, take a look at &lt;a href="http://grade-devel.uchicago.edu/cgi-bin/citequery3.pl?dbname=PerseusGreekDev&amp;amp;getid=0&amp;amp;query=NT%20I%20Corinthians.13"&gt;this&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-2390221144656172292?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/11/section-highlighting-in-philologic.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2390221144656172292'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/2390221144656172292'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/11/section-highlighting-in-philologic.html' title='Section Highlighting in Philologic'/><author><name>Kristin</name><uri>http://www.blogger.com/profile/16706344780694707122</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-3592844232513039966</id><published>2009-11-02T10:55:00.010-06:00</published><updated>2009-11-02T12:50:43.665-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='development'/><category scheme='http://www.blogger.com/atom/ns#' term='philologic'/><title type='text'>Towards PhiloLogic4</title><content type='html'>Earlier this year I wrote a long discussion paper called "Renovating PhiloLogic" which provided an overview of the system architecture, a frank review of the strengths and (many) failings of the current implementation of the 3 series of PhiloLogic, and proposed a general design model for what would effectively be a complete reimplementation of the system, retaining only selected portions of the existing code base.   While we are still discussing this, often in great detail, a few general objectives for any future renovation have emerged, including:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;service oriented architecture;&lt;/li&gt;&lt;li&gt;release of new system in perl module libraries;&lt;/li&gt;&lt;li&gt;multiple database query support, and,&lt;br /&gt;&lt;/li&gt;&lt;li&gt;options for advanced or extended indexing models.&lt;/li&gt;&lt;/ul&gt;I will be putting together a public version of this discussion draft in the near future and will blog it when I have something ready. &lt;br /&gt;&lt;br /&gt;Before sallying forth to do start working on a PhiloLogic4, there are a number of preliminary steps that Richard and I agree are required in order to 1) support the existing PhiloLogic3 series, and 2) clear the existing (messy) code base of some of the most egregious sections of the system, most notably the loader.   Some of these are simply housekeeping and updates, some of these are patches and bug fixes, and some others are clean-ups which should streamline the current system and help in any redevelopment. &lt;br /&gt;&lt;br /&gt;We will start by retasking one of our current machines, a 32 bit OS-X installation, to be the primary PhiloLogic development machine.  We will also get the Linux branch on a 32 bit Linux machine (flavor to be determined).  There is a known &lt;a href="http://artfl.blogspot.com/2009/06/philologic-ubuntu-64-bit-compilation.html"&gt;64 bit installation problem&lt;/a&gt; which we will address at the end of this initial process.   When we reach the right step, we will install it all on 64 bit machines and fix it then, hopefully with much less effort on a streamlined version, while releasing upgraded 32 bit versions on the way.   The other element for our consideration is the degree to which we can merge the OS-X and Linux branches of the system.  Right now, we have two completely distinct branches.  It would be much better to have one, which we think may be accomplished in a couple of different ways.&lt;br /&gt;&lt;br /&gt;We are currently thinking of 4 distinct steps, which should each result in new maintenance releases of PhiloLogic3. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Step One&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Apply the most recent &lt;a href="http://philologic.googlecode.com/files/philologic.osx.patches.tar.Z"&gt;OS-X Leopard patch kit&lt;/a&gt; to both the OS-X and Linux branches as required and feasible.  This is the patch kit that Richard and I assembled for the migration to our new servers and has some nifty little extensions.   We will also be updating the PhiloLogic &lt;a href="http://code.google.com/p/philologic/"&gt;code release site&lt;/a&gt; (Google Code) and retooling the new &lt;a href="https://sites.google.com/site/philologicartfl/"&gt;PhiloLogic&lt;/a&gt; site, which will then be referred from the existing location (&lt;a href="http://philologic.uchicago.edu/"&gt;philologic.uchicago.edu&lt;/a&gt;).   Maintenance release when done.  [MVO]&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Step Two&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The PhiloLogic loader currently using a GNU Makefile scheme to load databases.  This made good sense many years ago, when loads could take many hours (or days), but is probably no longer needed.  There are also many places where we use various utilities (sed, gawk, gzip, etc.) which add complications and make the entire scheme more brittle.  Our current thinking is to fold all of the Makefile functions into a revised version of philoload, but may determine a better way to proceed once we get into it.  We're planning a maintenance release of this when done.  [MVO]&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Step Three&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The current PhiloLogic loader performs a number of C compiles, many of which are no longer needed.  For example, the system still compiles the search2 binaries.  These were left in Philologic3 in order to have backwards compatibility.  We need to keep the ability to generate the correct pack and unpack libraries which are used by search3.  Once we have cleared out all unnecessary C compiles, we will investigate a couple of known bugs in search3, and attempt to resolve these.  Again, once done, we would do a maintenance release.  [RW and MVO]&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Step Four&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As noted above, some users have reported 64 bit compile problem on either installation or load.  Once we have the loader streamlined, eliminating as much of the old C compiles are possible,  we will investigate this problem.   We're hoping that this will be easily remedied and, even better, could be resolved in a combined release which would merge the current OS-X and Linux branches.   This would be the &lt;span style="font-weight: bold;"&gt;terminal&lt;/span&gt; release of the PhiloLogic3 series.  Any future releases would be only for bug fixes. &lt;br /&gt;&lt;br /&gt;We hoping that these steps will result in a stable terminal release of the PhiloLogic3 series, which will be easier to install and use.  It will also result in significant streamlining which will help in any future Philologic renovation or a new PhiloLogic4 series. &lt;br /&gt;&lt;br /&gt;This is an initial plan, so please do post your comments, suggestions, and complaints.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-3592844232513039966?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/11/towards-philologic4.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3592844232513039966'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3592844232513039966'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/11/towards-philologic4.html' title='Towards PhiloLogic4'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-82826688758059461</id><published>2009-10-29T10:54:00.022-05:00</published><updated>2009-12-08T15:49:10.787-06:00</updated><title type='text'>Encyclopédie under KinoSearch</title><content type='html'>&lt;form action="http://robespierre.uchicago.edu/cgi-bin/kstest.pl"&gt;One of the things that I have wanted to do for a while is to examine implementations of &lt;a href="http://lucene.apache.org/"&gt;Lucene&lt;/a&gt;, both as a search tool to complement PhiloLogic and possibly as a model for future PhiloLogic renovations.   Late this summer, Clovis identified a particular nice open source, perl implementation of Lucene called &lt;a href="http://www.rectangular.com/kinosearch/"&gt;KinoSearch&lt;/a&gt;.   This looks like it will fit both bills very nicely indeed.  As a little experiment, I loaded 73,000 articles (and other objects) from the Encyclopédie, and cooked up a super simple query script.  This allows you to type in query words and get links to articles sorted by their &lt;a href="http://en.wikipedia.org/wiki/Relevance_%28information_retrieval%29"&gt;relevancy&lt;/a&gt; to your query (the italicized number next to the headword).   At this time, I am limiting to the top 100 "hits".    Words should be lower case,  accents are required, and words should be separated by spaces.  Try it:&lt;br /&gt;&lt;br /&gt;Query Words: &lt;input name="words" size="30"&gt;&lt;input value="Go" type="submit"&gt; or &lt;input id="resetbutton" value="Clear" type="reset"&gt;&lt;br /&gt;Require all words&lt;input name="boolop" type="checkbox" value="AND"&gt;&lt;/form&gt;&lt;br /&gt;Here are a couple of examples which you can block copy in: &lt;tt&gt;&lt;br /&gt;artisan laboureur ouvrier paysan&lt;br /&gt;&lt;/tt&gt; &lt;tt&gt;malade symptome douleur estomac&lt;/tt&gt;&lt;br /&gt;&lt;tt&gt;peuple pays nation ancien république décadence&lt;/tt&gt;&lt;br /&gt;&lt;br /&gt;The first thing to notice is search speed.  Lucene is known to be robust, massively scalable, and fast.  The KinoSearch implementation is certainly very fast.  A six term search returns in a real .35 seconds and less than 1/10 of a second of system time, using time on the command line.  I did not time the indexing run, but think 10 minutes or so.   [Addition: by reading 147 TEI files rather than 77,000 split files, the loading indexing time for the Encyclopédie is falls to (using time) &lt;span style="font-family:courier new;"&gt;real 2m45.9s, user 2m33.8s&lt;/span&gt;&lt;span style="font-family:courier new;"&gt; sys 0m11.1s&lt;/span&gt;.]&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The KinoSearch developer, Marvin Humphrey, has a &lt;a href="http://www.rectangular.com/downloads/KinoSearch_OSCON2006.pdf"&gt;splendid slide show&lt;/a&gt;, outlining how it works, with specific reference to the kind of parameters, such as stemmers and stopwords, that one needs to consider as well as an overview of the indexing scheme.   Clovis and I thought this might be the easiest way to begin working with Lucene, since it is a perl module with C components, so it is easy to install and get running.   Given the performance and utility of KinoSearch, I suspect that we will be using it extensively for projects where ranked relevancy results are of interest.  These might include structured texts, such as newspaper and encyclopedia articles, and possibly large collections of uncorrected OCR materials which may not suitable for text analysis applications supported by PhiloLogic.   Also, on first review, the code base is very nicely designed and, since it has many of the same kinds of functions as PhiloLogic, strikes me as being a really fine model of how we might want to renovate PhiloLogic.&lt;br /&gt;&lt;br /&gt;For this experiment, I took the articles as individual documents in TEI, which Clovis had prepared for other work.  For each article, I grabbed the headword and PhiloLogic document id, which are loaded as fielded data.  The rest of the article is stripped of all encoding and loaded in.  It would be perfectly simple to read the data from our normal TEI files.  We could see simply adding a script that would load source data from a PhiloLogic database build, to add a different kind of search, which would need to have a different search box/form.&lt;br /&gt;&lt;br /&gt;I have not played at all with parameters and I can imagine that we would want to perform some functions, such as using simple rules for normalization, on input, since it uses a &lt;a href="http://search.cpan.org/%7Ecreamyg/Lingua-Stem-Snowball-0.952/lib/Lingua/Stem/Snowball.pm"&gt;stemmer package&lt;/a&gt; also by M Humphrey.   Please email me, post comments, or add a blog entry here if you see problems, particularly search oddities, have ideas about other use cases, or more general interface notions.  I will be writing a more generalized loader and query script -- with paging, numbers of hits per page, filtering by minimum relvancy scores and looking at a version of the Philologic object fetch which would try to high-light matching terms -- and moving that over to our main servers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-82826688758059461?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/10/encyclopedie-under-kinosearch.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/82826688758059461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/82826688758059461'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/10/encyclopedie-under-kinosearch.html' title='Encyclopédie under KinoSearch'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-3453832048214306120</id><published>2009-10-26T14:35:00.002-05:00</published><updated>2009-10-26T15:22:38.166-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><title type='text'>back to comparing similar documents</title><content type='html'>I mentioned a &lt;a href="http://artfl.blogspot.com/2009/08/finding-related-articles-using-topic.html"&gt;little while ago&lt;/a&gt; some work I did on comparing one document with the rest of the corpus it belongs to ( the examples I used in that blog post will not give the same results anymore, the results might not be as good, I haven't optimized the new code for the Encyclopédie yet).  The idea behind it was to use the topic proportions for each article generated from LDA, and come up with a set of calculations to decide which document(s) was closest to the original document.  The reason why I'm mentioning here once more is that I've been through that code again,  cleaned it up quite a bit, improved its performance, tweaked the calculations. Basically, I made it usable for other people but myself. Last time I built a basic search form to use with Encyclopédie articles. This time I'm going to show the command line version, which has a couple more options than the web version.&lt;br /&gt;In the web version, I was using both the top three topics in each document, and their individual proportion within that document.  For instance, Document A would have topic 1, 2 and 3 as its main topics. Topic1 would have a proportion of 0.36, Topic2 0.12, Topic3 0.09. In the command line version, there's the option of only using the topics, without the proportion. The order of importance of each topic is of course still respected. Depending on the corpus you're looking at, you might want to use one model rather than the other. It does give different results. One could of course tweak this some more and decide to only take the proportion of the prominent topic, therefore giving it more importance. There is definitely room for improvement.&lt;br /&gt;There was also another option that was left out of the web version. By default, I set a tolerance level, that is the score needed by each document in order to be given as a result of the query. In the command line version, I made it possible to define this tolerance in order to get more or fewer results. This option is currently only possible with the refined model (the one with topic proportions). The code is currently living in&lt;br /&gt;robespierre:/Users/clovis/LDA_scripts/&lt;br /&gt;It's called compare_to_all.pl. There's some documentation in the header to explain how to use it. It's fairly simple. I might do some more work on it, and will update the script accordingly.&lt;br /&gt;There are other applications of this script besides using on a corpus made of well defined documents. One could very well imagine applying this to a corpus subdivided in chunks of text using a text segmentation algorithm. On could then try to find passages on the same topic(s) using a combination of LDA and this script. The Archives parlementaires could be a good test case.&lt;br /&gt;Another option would be to run every document of a corpus against the whole corpus and store all the results in a SQL database. This would allow having a corpus where each document can be linked to various others according to the mixture of topics they are made of.&lt;br /&gt;I will try to give more concrete results some time soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-3453832048214306120?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/10/back-to-comparing-similar-documents.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3453832048214306120'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3453832048214306120'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/10/back-to-comparing-similar-documents.html' title='back to comparing similar documents'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-5826131627373545331</id><published>2009-10-26T11:30:00.002-05:00</published><updated>2009-10-26T11:32:10.607-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><category scheme='http://www.blogger.com/atom/ns#' term='sLDA'/><title type='text'>Supervised LDA: Preliminary Results on Homer</title><content type='html'>&lt;div&gt;While Clovis has been running&lt;span class="Apple-style-span"  style="font-size:medium;"&gt; &lt;a href="http://artfl.blogspot.com/2009/08/preliminary-results-on-topic-modeling.html#comments"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;LDA tests on &lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="line-height: 20px;"&gt;&lt;a href="http://artfl.blogspot.com/2009/08/preliminary-results-on-topic-modeling.html#comments"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;Encyclopédie&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt; text&lt;/span&gt;s using the &lt;a href="http://mallet.cs.umass.edu/index.php"&gt;Mallet&lt;/a&gt; code, I have been running some tests using the sLDA algorithm. After a few minor glitches, Richard and I managed to get the &lt;a href="http://www.cs.princeton.edu/%7Echongw/slda/"&gt;sLDA code&lt;/a&gt;, written by Chong Wang and David Blei,  from &lt;a href="http://www.cs.princeton.edu/%7Eblei/"&gt;Blei's website&lt;/a&gt; up and running. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Unlike LDA, &lt;a href="http://www.cs.princeton.edu/%7Eblei/papers/BleiMcAuliffe2007.pdf"&gt;sLDA&lt;/a&gt; (Supervised Latent Dirichlet Allocation), requires a training set of documents paired with corresponding class labels or responses. As Blei suggests, these can be categories, responses, ratings, counts or many other things. In my experiments on Homeric texts, I used only two classes, corresponding to Homer's two major works: the Iliad and the Odyssey. Akin to LDA, topics are inferred from the given texts and a model is made of the data. This model, having seen the class labels of the texts it was trained on, can then be used to infer the class labels of previously unseen data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For my experiments, I modified the xml versions of the Homer texts that we have on hand using a few simple perl scripts. Getting the xml transformed into an acceptable format for Wang's code required a bit of finagling, but was not too terrible. My scripts first took the xml and split it into books (the 24 books of the Iliad and likewise for the Odyssey), then stripped the xml tags from the text. Saving out four books from each text for applying the inference step, I took the rest of the books and output the corresponding data file necessary for input into the algorithm (&lt;a href="http://www.cs.princeton.edu/%7Echongw/slda/readme.txt"&gt;data format here&lt;/a&gt;). &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I played around a bit with leaving out words that occurred extremely frequently or extremely rarely. For the results I am posting here, the English vocabulary was vast and I cut it down to words that occurred between 10 and 60 times. This probably cuts it down too much though, so it would be good to try some variations. Richard has suggested also cutting out the proper nouns before running sLDA in order to focus more on the semantic topics. For the Greek vocabulary, I used the words occurring between 3 and 100 times, after stripping out the accents.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Running the inference part of sLDA on the 8 books that I had saved out seemed to work quite well. It got all 8 correctly labeled as to whether they belonged to the Iliad or to the Odyssey. In a reverse run, the inference was able to again achieve 100 percent accuracy on labeling the 40 books after having been trained on only the 8 remaining books. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The raw results of the trials give a matrix of betas with a column for each word, and a row for each topic. These betas thus give a log based weighting of each word in each topic. Following this are the etas, with a column for each topic and a row for each class. These etas give the weightings of each topic in each class, as far as I understand it. Richard and I slightly altered the sLDA code to output an eta for each class, rather than one less than the number of classes as it was giving us. As far as we understand the algorithm as presented in Blei's paper, it should be giving us an eta for each class. Our modification didn't seem to break anything, so we are assuming that it worked, as the results seem to be looking nice. Using the final model data, I have a perl script that outputs the top words in each topic along with the top topics in each class. These are the results that I am giving below.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;Results of my sLDA Experiments on Homer:&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;English Text:&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/TopWords10.txt"&gt;10 Topics&lt;/a&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;Greek Text:&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/GTopWords10.txt"&gt;10 Topics&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;               &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/TopWords25.txt"&gt;25 Topics&lt;/a&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;                     &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/GTopWords25.txt"&gt;25 Topics&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;              &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/TopWords50.txt"&gt;50 Topics&lt;/a&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;                   &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/GTopWords50.txt" charset="UTF-8"&gt;50 Topics&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;              &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/TopWords75.txt"&gt;75 Topics&lt;/a&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;                     &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/GTopWords75.txt"&gt;75 Topics&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;                &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/TopWords100.txt"&gt;100 Topics&lt;/a&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;                    &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/GTopWords100.txt"&gt;100 Topics&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;               &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/TopWords150.txt"&gt;150 Topics&lt;/a&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;                    &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/GTopWords150.txt"&gt;150 Topics&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;               &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/TopWords200.txt"&gt;200 Topics&lt;/a&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;                       &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/GTopWords200.txt"&gt;200 Topics&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;               &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/TopWords250.txt"&gt;250 Topics&lt;/a&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;                    &lt;/span&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/GTopWords250.txt"&gt;250 Topics&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Also, samples of the output from Blei and Wang's code, corresponding to the English Text with 100 topics:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/final.model.text"&gt;Final Model&lt;/a&gt;: gives the betas and the etas which I used to output my results&lt;br /&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/likelihood.dat"&gt;Likelihood&lt;/a&gt;: the likelihood of these documents, given the model&lt;br /&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/final.gamma"&gt;Gammas&lt;/a&gt;&lt;br /&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/word-assignments.dat"&gt;Word-assignments&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/inf-labels.dat"&gt;Inferred Labels&lt;/a&gt;: Iliad has label '0', Odyssey has label '1'.&lt;br /&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/inf-likelihood.dat"&gt;Inferred Likelihood&lt;/a&gt;: the likelihood the previously unseen texts&lt;br /&gt;&lt;a href="http://nettest625.uchicago.edu/sLDA/inf-gamma.dat"&gt;Inferred Gammas&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I have not played around much with the gammas, but they seem to give a weighting of each topic in each document. Thus you could figure out for which book of the Iliad or the Odyssey a specific topic was the most prevalent. It would be interesting to see if this correctly pinpoints which book the cyclops comes in for instance, as this is a fairly easily identifiable topic in most of the trials.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-5826131627373545331?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/10/supervised-lda-preliminary-results-on.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5826131627373545331'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5826131627373545331'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/10/supervised-lda-preliminary-results-on.html' title='Supervised LDA: Preliminary Results on Homer'/><author><name>Kristin</name><uri>http://www.blogger.com/profile/16706344780694707122</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-73937886701055528</id><published>2009-10-22T14:08:00.002-05:00</published><updated>2009-10-22T16:17:42.083-05:00</updated><title type='text'>Encyclopédie Renvois Search/Linker</title><content type='html'>During the summer (2009), a user (UofC PhD, tenured elsewhere) wrote to ask if there was any way to search the &lt;a href="http://encyclopedie.uchicago.edu/"&gt;Encyclopédie&lt;/a&gt; and "generate a list of all articles that cross-reference a given article".   We went back and forth a bit, and I slapped a little toy together and let him play with it, to which his reply was "Oh, this is cool!  Five minutes of playing with the search engine and I can tell you it shows fun stuff...".   This is, of course, an excellent suggestion which we have talked about in the past, usually in the context of visualizing relationships of articles in various ways.   At the highest level, visualizing the relationships of the &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt; is what Gilles and I attempted to do in our general "&lt;a href="http://rde.revues.org/index122.html"&gt;cartography paper&lt;/a&gt;"[1] and, more recently, Robert and Glenn (et. al.) tried, in a radically different way, to do in their work on "&lt;a href="http://docs.google.com/present/view?id=dfddkspw_179ckcrtbcd&amp;amp;skipauth=true"&gt;centroids&lt;/a&gt;"[2].&lt;br /&gt;&lt;br /&gt;The current implementation of the &lt;span style="font-style: italic;"&gt;Encyclopédie&lt;/span&gt; under PhiloLogic will allow users to follow &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt; links (within operational limits to be outlined below), but does not support searching and navigating the &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt; in any kind of systematic fashion.   Since this is something I think warrants further consideration, I thought it might be helpful to document this toy, give some examples, let folks play with it, outline some of the current issues, and conclude with some ideas about what might be done going forward.&lt;br /&gt;&lt;br /&gt;To construct this toy, I wrote a recognizer to extract metadata for each article in the &lt;span style="font-style: italic;"&gt;Encyclopédie&lt;/span&gt; which has one or more &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt;.  As part of the original development of the &lt;span style="font-style: italic;"&gt;Encyclopédie&lt;/span&gt;, each cross reference was automatically detected from certain typographic and lexical clues.  This resulted in roughly 61,000 cross-references.  Accordingly, the extracted database has 61,000 records.  I loaded these into a simple MySQL database and used a standard script to support searching and reporting.  The search parameters may include articles headwords, authors, normalized and English classes of knowledge as well as the term(s) being cross referenced.   For example, there are 39 cross-referenced article pairs for the headword &lt;a href="http://artflx.uchicago.edu/cgi-bin/extras/encarts2renvois.pl?headword=estomac"&gt;estomac&lt;/a&gt;.   As you can see from the output, I'm listing the headword, author, classes of knowledge, and the cross referenced term.  You can get the article of the cross referenced term or the cross-references in that article.  Thus, the second example shows the link to Digestion:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;ESTOMAC, ventriculus (Tarin: Anatomie, Anatomy ) ==&gt; Digestion || renvois&lt;br /&gt;   [The renvois of Digestion find 56 articles pairs, including one to intestins]&lt;br /&gt;DIGESTION (Venel: Economie animale, Animal economy ) ==&gt; Intestins || renvois&lt;br /&gt;Intestins (unknown: Anatomie, Anatomy ) ==&gt; Chyle || renvois&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;and so on ==&gt;lymphe==&gt;sang==&gt;&lt;span style="font-style: italic;"&gt;ad nauseum&lt;/span&gt;.   No, there is no &lt;span style="font-style: italic;"&gt;ad nauseum&lt;/span&gt;, just how you might feel after going round and round.&lt;br /&gt;&lt;br /&gt;Now, there are problems, but please go ahead and play with this now using the &lt;a href="http://encyclopedie.uchicago.edu/node/173"&gt;submit form&lt;/a&gt;, as long as you promise to come back and read thru the rest of this and let me know about any other problems.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Problems&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As noted above, the renvois were identified automatically.  And as with most of these things, it worked reasonably well.  But you will see link errors and other things which indicate problems.  Glenn reported these to me and I was going to eliminate them.  On second thought, this little toy lets to consider the &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt; rather more systematically.  Where you see a link error is (probably) a recognizer error, which either failed to get a string to link or got confused by some typography.  The linking mechanism itself is based on string searches.  In other words, whenever you click on a &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt;, you are in fact performing a search on the headwords.   This simple heuristic works reasonably well, returning string matched headwords.  In some cases, you get nothing because there is no headword that has the &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt; word(s), and at other times you will get quite a list of articles, which may or may not include what the authors/editors intended.  It is, of course, well known that many renvois simply don't correspond to an article and many others differ in various ways from the article headwords.  I am also applying a few rules to renvois searching to try to improve recall and reduce noise.  So, this also adds another level of indirection.&lt;br /&gt;&lt;br /&gt;Now, ideally, one would go through the entire database, examine each &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt; and build a direct link to the &lt;span style="font-weight: bold;"&gt;one&lt;/span&gt; article that the authors/editors intended.  But we're talking 60,000+ &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt; against 72,000 (or so) articles and it is not clear that humans could resolve this in many instances.  When Gilles and I worked on this, we used a series of (long forgotten) heuristics to filter out noise and errors.  So, this simple toy works within operational limits and gives us a way to more systematically identify possible errors and ways to improve it.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Future Work&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Aside from being a quick and dirty to way get some notion of errors in the&lt;span style="font-style: italic;"&gt; renvois&lt;/span&gt;, we might be able to make this more presentable.  Please feel free to play with this and suggest ways to think about.  In the long haul, I would &lt;span style="font-weight: bold;"&gt;love&lt;/span&gt; a totally cool visualization.  A clickable directed graph, so you could click on a node and re-center it on another article, or class of knowledge or author.  Maybe something like &lt;a href="http://www.visualcomplexity.com/vc/project.cfm?id=288"&gt;Tricot's &lt;/a&gt;representation of the classes of knowledge.  Or maybe something like &lt;a href="http://www.cs.utoronto.ca/%7Eccollins/research/docuburst/index.html"&gt;DocuBurst&lt;/a&gt;.   Marti Heast's chapter on &lt;a href="http://searchuserinterfaces.com/book/sui_ch11_text_analysis_visualization.html"&gt;visualizing text analysis&lt;/a&gt;, is a treasure-trove of great ideas.&lt;br /&gt;&lt;br /&gt;For the immediate term, I would like to recast this simple model to allow the user to specify number of steps.  So, set the number of iterations to follow, so you would get something like:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&lt;span style="font-size:60%;"&gt;ESTOMAC, ventriculus (Tarin: Anatomie, Anatomy ) ==&gt; Digestion || renvois&lt;br /&gt;  DIGESTION (Venel: Economie animale, Animal economy ) ==&gt; Intestins || renvois&lt;br /&gt;        Intestins (unknown: Anatomie, Anatomy ) ==&gt; Viscere || renvois&lt;br /&gt;ESTOMAC, ventriculus (Tarin: Anatomie, Anatomy ) ==&gt; Chyle || renvois&lt;br /&gt;  CHYLE (Tarin: Anatomie | Physiologie, Anatomy. Physiology ) ==&gt; Sanguification || renvois&lt;br /&gt;        SANGUIFICATION (unknown: Physiologie, Physiology ) ==&gt; Respiration || renvois&lt;br /&gt;              RESPIRATION (unknown: Anatomie | Physiologie, Anatomy | Physiology ) ==&gt; Air || renvois&lt;/span&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Following this chains of &lt;span style="font-style: italic;"&gt;renvois&lt;/span&gt; either until you run out or your hit an iteration limit.  I will try to follow this up with both the multi-iteration model and see if I can recover some of what Liz tried to do using &lt;a href="http://www.graphviz.org/"&gt;GraphViz&lt;/a&gt; to generate clickable directed graphs.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;References&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;[1] Gilles &lt;span class="smallcaps"&gt;Blanchard&lt;/span&gt; et Mark &lt;span class="smallcaps"&gt;Olsen&lt;/span&gt;, « Le système de renvoi dans l’&lt;em&gt;Encyclopédie&lt;/em&gt;: Une cartographie des structures de connaissances au &lt;span style="font-variant: small-caps;"&gt;XVIII&lt;/span&gt;&lt;sup&gt;e&lt;/sup&gt;&lt;span style="vertical-align: super;font-size:4.06pt;" &gt; &lt;/span&gt;siècle », &lt;em&gt;Recherches sur Diderot et sur l'Encyclopédie&lt;/em&gt;, numéro 31-32 &lt;em&gt;L'Encyclopédie en ses nouveaux atours électroniques: vices et vertus du virtuel&lt;/em&gt;, (2002) [En ligne], mis en ligne le 16 mars 2008.&lt;br /&gt;&lt;br /&gt;[2] Charles Cooney, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, and Robert Voyer, "Re-engineering the tree of knowledge: Vector space analysis and centroid-based clustering in the &lt;i&gt;Encyclopédie&lt;/i&gt;", Digital Humanities 2008, University of Oulu, Oulu, Finland, June 25-29, 2008&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-73937886701055528?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/10/encyclopedie-renvois-searchlinker.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/73937886701055528'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/73937886701055528'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/10/encyclopedie-renvois-searchlinker.html' title='Encyclopédie Renvois Search/Linker'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-4975997298926567867</id><published>2009-10-06T14:48:00.007-05:00</published><updated>2009-10-07T12:25:52.653-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Archives Parlementaires'/><title type='text'>Archives Parlementaires: lèse (more)</title><content type='html'>As I mentioned in my last in this thread, I was a bit surprised to see just how prevalent the construction &lt;span style="font-style: italic;"&gt;lèse nation&lt;/span&gt; had become early in the Revolution.   The following is a sorted KWIC of lEse in the AP, with the object type restricted to "&lt;a href="http://en.wikipedia.org/wiki/Cahiers_de_dol%C3%A9ances"&gt;cahiers&lt;/a&gt;", resulting in 38 occurrences.  These are, of course, the complaints sent to the King, reflecting relatively early developments of Revolutionary discourse.   Keeping in mind all of the caveats regarding this data, we can see some interesting and possibly contradictory uses:&lt;br /&gt;&lt;span style="font-size:70%;"&gt;&lt;pre&gt;CAHIER: (p.319)sent être, comme criminels de lèse-humanité au premier chef, et ils se joindront au&lt;br /&gt;CAHIER GÉN...: (p.77)manière de juger, qui   lèse les droits les plus sacrés des citoyens, doit av&lt;br /&gt;CAHIER: (p.697)r individus, cette concession lèse les et avoir eu d'autre mo dre une r {La partie d&lt;br /&gt;CAHIER: (p.108)e, excepté dans les crimes de lèse-majesté au premier chef. Art. 33. Qu'aucun jugem&lt;br /&gt;CAHIER: (p.791) si ce n'est pour le crime de lèse-majesté au premier chef, et réduite aux seuls c&lt;br /&gt;CAHIER: (p.448)té seulement pour le crime de lèse-majesté au premier chef ou pour celui de haute t&lt;br /&gt;CAHIER: (p.409)s choses saintes, et crime de lèse-majesté, dans tous les cas spécifiés par l'ord&lt;br /&gt;CAHIER: (p.260)istériels, sauf pour crime de lêse-majesté, de haute trahison et autres cas, qui se&lt;br /&gt;CAHIER: (p.42)e, à l'exception des crimes de lèse-majesté, de péculat et de concussion; mais, dan&lt;br /&gt;CAHIER: (p.780), si ce n'était pour crime de lèse-majesté divine et humaine. Art. 9. Qu'ii soit as&lt;br /&gt;CAHIER: (p.476)ée, si ce n'est pour crime de lèse-majesté divine et humaine. Art. 8. Qu'il soit as&lt;br /&gt;CAHIER: (p.584)our le meurtre et le crime de lèse-majesté divine ou humaine, et que hors de ce cas&lt;br /&gt;CAHIER: (p.378)ont seuls juges des crimes de lèse-majesté et de lèse-nation. Art. 8. Le compte de&lt;br /&gt;CAHIER: (p.42)re précise ce qui est crime de lèse-majesté. Et que l'on établisse quels sont les c&lt;br /&gt;CAHIER.: (p.117)déclaré coupable du crime de lèse-majesté etnation. et comme tel, puni des peines&lt;br /&gt;CAHIER GÉN...: (p.671) excepté le crime de   lèse-majesté, le poison, l'incendie et assassinat sur&lt;br /&gt;CAHIER: (p.660) les cas, excepté le crime de lèse majesté, le poison, l'incendie et assassinat sur&lt;br /&gt;CAHIER: (p.532)hommes coupables elu crime de lèse-majesté nationale; l'exemple elu passé nous a m&lt;br /&gt;CAHIER: (p.645)poursuivis comme criminels de lèse-majesté nationale; que visite soit faite dans le&lt;br /&gt;CAHIER: (p.383)s par elle comme criminels de lèse-majesté, quand ils tromperont la confiance du so&lt;br /&gt;CAHIER: (p.286)s crimes de lèse-nation ou de lèse-majesté seulement; et que, dans ce cas, l'accus&lt;br /&gt;CAHIER GÉN...: (p.210)ni comme criminel de   lèse-majesté; 4° Cette loi protectrice de la libert&lt;br /&gt;CAHIER: (p.35)rrémissibles comme le crime de lese-majesté. 13° 'Qu'en matière civile comme en mat&lt;br /&gt;CAHIER: (p.378) crimes de lèse-majesté et de lèse-nation. Art. 8. Le compte des finances imprimé a&lt;br /&gt;CAHIER: (p.359) crimes de lèsemajesté, et de lèse-nation, ce qui comprend les crimes d'Etat. 7° En&lt;br /&gt;CAHIER: (p.301)ort infâme, comme coupable de lèse-nation, celui qui sera convaincu d'avoir violé c&lt;br /&gt;CAHIER.: (p.536) et punis comme coupables de lèse nation. 17" De demander 1 aliénation irrévocabl&lt;br /&gt;CAHIER: (p.82)x, sera déclarée criminelle de lèse-nation et poursuivie comme telle, soit par les Et&lt;br /&gt;CAHIER: (p.402)tte règle seront coupables de lèse-nation et poursuivis comme tels dès qu'ils auron&lt;br /&gt;CAHIER: (p.285) patrie, coupable du crime de lèse-nation, et puniecomme telle par le tribunal qu'é&lt;br /&gt;CAHIER: (p.544) coupables de rébellion et de lèse-nation, favoriser la violation de la constitution&lt;br /&gt;CAHIER: (p.42)lisse quels sont les crimes de lèse-nation. Le vœu des bailliages est que les ressor&lt;br /&gt;CAHIER: (p.285)n user que pour {es crimes de lèse-nation ou de lèse-majesté seulement; et que, da&lt;br /&gt;CAHIER: (p.402)s généraux, comme coupable de lèse-nation; que les impositions seront réparties dan&lt;br /&gt;CAHIER: (p.320)e défendre, c'est un crime de lèse-nation. Qui pourrait nier que dans la génératio&lt;br /&gt;CAHIER: (p.388)-mêmes; déclarant criminel de lèse-nation tous ceux qui pourraient entreprendre dire&lt;br /&gt;CAHIER.: (p.249)sions. Ce serait vu crime de lèse-patrie de ne pas correspondre à sa confiance pat&lt;br /&gt;CAHIER GÉN...: (p.221)i serait un crime de   lèse-patrie. 2° De demander l'abolition de la gabelle&lt;br /&gt;&lt;/pre&gt;&lt;/span&gt;These include "&lt;span style="font-style: italic;"&gt;lèse-majesté nationale&lt;/span&gt;", "&lt;span style="font-style: italic;"&gt;lèse-majesté et nation&lt;/span&gt;" (OCR error fixed), "&lt;span style="font-style: italic;"&gt;crimes de lèse-majesté et de lèse-nation&lt;/span&gt;", and (my favorite) "&lt;span style="font-style: italic;"&gt;crime de lèse-majesté divine et humaine&lt;/span&gt;".   Kelly suggests that notions of royal authority had been trimmed over the 18th century and with this reduction came a restriction of just what would constitute &lt;span style="font-style: italic;"&gt;lèse-majesté&lt;/span&gt; and to what kinds of crimes it would apply.  He argues that it was only in 1787, with the Assembly of Notables, that the idea of the nation "begins to take shape in a public glare" and further suggested that the decrees of September 1789 to decree the punishments for &lt;span style="font-style: italic;"&gt;lèse-nation&lt;/span&gt; (and subsequent events) show the "confused and arbitrary genesis of &lt;span style="font-style: italic;"&gt;lèse-nation&lt;/span&gt;".   &lt;br /&gt;&lt;br /&gt;See also the 11 entries in our &lt;a href="http://artfl-project.uchicago.edu/node/17"&gt;Dictionnaires d'autrefois&lt;/a&gt; for &lt;a href="http://artflx.uchicago.edu/cgi-bin/dicos/pubdico1look.pl?strippedhw=lese"&gt;lese&lt;/a&gt;  which stress &lt;span style="font-style: italic;"&gt;lèse-majesté &lt;/span&gt;through the entire period with &lt;span style="font-style: italic;"&gt;lèse-nation &lt;/span&gt;being left as an after-thought, such as in the DAF (8th edition):  "Il se joint quelquefois, par analogie, à d'autres noms féminins. &lt;i&gt;Crime de lèse-humanité, de lèse-nation, de lèse-patrie."   &lt;/i&gt;One should not construe this as excessively conservative, however, since &lt;span style="font-style: italic;"&gt;lèse-majesté &lt;/span&gt;is, by far, the most common construction in the 19th and 20th centuries (at least as represented in ARTFL-Frantext).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-4975997298926567867?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/10/archives-parlementaires-lese-more.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/4975997298926567867'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/4975997298926567867'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/10/archives-parlementaires-lese-more.html' title='Archives Parlementaires: lèse (more)'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-6185805571109761711</id><published>2009-10-04T09:41:00.010-05:00</published><updated>2009-10-13T17:47:54.581-05:00</updated><title type='text'>Topic Based Text Segmentation Goodies</title><content type='html'>As you may recall, Clovis ran some experiments this summer (2009) applying a perl implementation of Marti Heart's TextTiling algorithm to perform topic based text segmentation on different French documents (see his &lt;a href="http://artfl.blogspot.com/2009/07/experiment-on-text-segmentation.html"&gt;blog post&lt;/a&gt; and related files).   Clovis reasonably suggests that some types of literary documents, such as epistolary novels, may be more suitable candidates than other types, because they do not have the same degree of structural cohesion.   Now, as I mentioned in my first discussion of the &lt;a href="http://artfl.blogspot.com/2009/10/archives-parlementaires-i.html"&gt;Archives Parlementaires&lt;/a&gt;, I suspect that this collection may be particularly well to topic based segmentation.  At the end of his post, Clovis also suggests that we might be able to test how well a particular segmentation approach is working by using a clustering algorithm, such as LDA Topic Modeling, to see if the segments can be shown to be reasonably cohesive.  Both topic segmentation and modeling are difficult to assess because human readers/evaluators can have rather different opinions, leading to problems in "&lt;a href="http://en.wikipedia.org/wiki/Inter-rater_reliability"&gt;inter-rater reliability&lt;/a&gt;", which is probably a more vexing problem in the humanities and related areas of textual studies than in other domains.&lt;br /&gt;&lt;br /&gt;Earlier this year (and a bit last year), I also ran some experiments on some 18th century English materials, such as Hume's &lt;span style="font-style: italic;"&gt;History of England&lt;/span&gt; and the &lt;span style="font-style: italic;"&gt;Federalist Papers&lt;/span&gt;.    Encouraged by these results, particularly on the &lt;span style="font-style: italic;"&gt;Federalist Papers&lt;/span&gt;, I have accumulated a number of newer algorithms, packages, and papers which may be useful for future work in this area.   These are on my machine (for ARTFL folks, let me know if you want to know where), but I will not redistribute here as a couple of packages require non-redistribution or other limitations.   I am putting links to some of the source files, when I have them.&lt;br /&gt;&lt;br /&gt;Since Heart's original work, there have been a number of different approaches to topic based text segmentation.  Clovis and I have tried to make note of much of this work on our CiteULike references (&lt;a href="http://www.citeulike.org/group/2914/tag/segmentation"&gt;segmentation&lt;/a&gt;).   There is some overlap with Shlomo's &lt;a href="http://www.citeulike.org/user/argamon/tag/segmentation"&gt;list&lt;/a&gt;.   In no particular order of preference or chronology, here is what I have so far.   I will also try to provide some details on using these when I have a chance to run them up.&lt;br /&gt;&lt;br /&gt;From the Columbia NLP group (http://www1.cs.columbia.edu/nlp/tools.cgi),  we have both Min-Yan Kan's Segmenter and Michael Galley's LCSeg.  These required signing a use agreement, which I have in my office.  The release archives for both have papers, some test data,&lt;br /&gt;&lt;br /&gt;I spent some time trying to track down Freddy Choi's C99 algorithm and implementation described in some &lt;a href="http://www.citeulike.org/group/2914/article/937000"&gt;papers &lt;/a&gt;in the early part of this decade.  I finally tracked it all down on the WayBack Machine at Internet Archive (&lt;a href="http://web.archive.org/web/20040810103924/http://www.cs.man.ac.uk/%7Emary/choif/software.html"&gt;link&lt;/a&gt;, thank you!!), which also has some papers, software, data and implementations of TextTiling and other approaches from that period. It appears several of the packages below use C99 and some of the code from this.&lt;br /&gt;&lt;br /&gt;I was going to reference Utiyama and Isihara's implementation (TextSeg), but in the few months since I assembled this list, the link has (also) gone dead:&lt;br /&gt;http://www2.nict.go.jp/x/x161/members/mutiyama/software.html#textseg&lt;br /&gt;This appears to be a combination of approaches.&lt;br /&gt;&lt;br /&gt;Igor Malioutov's MinCut code (2006) is available from his page:&lt;br /&gt;http://people.csail.mit.edu/igorm/acl06code.html&lt;br /&gt;&lt;br /&gt;There appears to be some info on TextTiling in Simon Cozens (2006), "Advanced Perl Programming".&lt;br /&gt;&lt;br /&gt;We also want to check out Beeferman et. al. (&lt;a href="http://www.citeulike.org/user/markymaypo/article/939641"&gt;link&lt;/a&gt;) since I recall that this group had done some interesting work.   I have Beeferman's implementation of TextTiling in C, but don't think I have run across anything else.&lt;br /&gt;&lt;br /&gt;If you run across anything useful, please blog it here or let me know.  Papers should be noted on our CiteUlike.  Thanks!!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-6185805571109761711?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/10/topic-based-text-segmentation-goodies.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/6185805571109761711'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/6185805571109761711'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/10/topic-based-text-segmentation-goodies.html' title='Topic Based Text Segmentation Goodies'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-815002766376696123</id><published>2009-10-03T11:21:00.010-05:00</published><updated>2009-10-06T14:34:30.048-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Archives Parlementaires'/><title type='text'>Archives Parlementaires: lèse collocations</title><content type='html'>The collocation table function of PhiloLogic is a quick way to look at changes in word use.   &lt;a href="http://en.wikipedia.org/wiki/L%C3%A8se_majest%C3%A9"&gt;&lt;span style="font-size:100%;"&gt;Lèse majesté&lt;/span&gt;&lt;/a&gt;, treason or injuries against the dignity of the sovereign or state, is a common expression.  The collocation table below shows terms around "lese  | leze  | lèse  | lèze  | lése  | léze" in ARTFL Frantext (550 documents, 1700-1787) with &lt;span style="font-style: italic;"&gt;majesté&lt;/span&gt; being by far the most common.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_SNpwD2mXiMo/Ssd7taeuH6I/AAAAAAAAAk0/UiWpYwMkvEg/s1600-h/lese.frantext.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 301px;" src="http://1.bp.blogspot.com/_SNpwD2mXiMo/Ssd7taeuH6I/AAAAAAAAAk0/UiWpYwMkvEg/s400/lese.frantext.gif" alt="" id="BLOGGER_PHOTO_ID_5388411499304591266" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;It is interesting to note that the construction "&lt;span style="font-style: italic;"&gt;lèse nation&lt;/span&gt;" does not appear once in this report.   Searching for "&lt;span style="font-style: italic;"&gt;lèse nation&lt;/span&gt;" before the Revolution in ARTFL-Frantext finds a single occurrence, in &lt;a href="http://en.wikipedia.org/wiki/Honor%C3%A9_Gabriel_Riqueti,_comte_de_Mirabeau"&gt;Mirabeau&lt;/a&gt;'s [1780] &lt;span style="font-style: italic;"&gt;Lettres écrits du donjon de Vincennes&lt;/span&gt;, where he complains that "toute invocation de lettre-de-cachet me paraît un crime de &lt;span style="color: rgb(204, 51, 0);"&gt;&lt;b&gt;lèse&lt;/b&gt;&lt;/span&gt;-&lt;span style="color: rgb(204, 51, 0);"&gt;&lt;b&gt;nation&lt;/b&gt;&lt;/span&gt;".  The collocation table for lEse in the current sample of the Archives Parlementaires (there are no instances of the lEze in this dataset), shows the &lt;span style="font-style: italic;"&gt;lèse nation &lt;/span&gt;construction to be far more frequent.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_SNpwD2mXiMo/SseAIy3dSiI/AAAAAAAAAk8/dyqyZNZTd-o/s1600-h/lese.ap.all.gif"&gt;&lt;img style="cursor: pointer; width: 400px; height: 305px;" src="http://2.bp.blogspot.com/_SNpwD2mXiMo/SseAIy3dSiI/AAAAAAAAAk8/dyqyZNZTd-o/s400/lese.ap.all.gif" alt="" id="BLOGGER_PHOTO_ID_5388416367753775650" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;There have been discussions* of the transition from &lt;span style="font-style: italic;"&gt;lèse majesté&lt;/span&gt; to &lt;span style="font-style: italic;"&gt;lèse nation&lt;/span&gt;, which is clearly shown here.  Now, a reasonable objection to this is that this report includes the entire (as much as we have at the moment) revolutionary period.   But we see roughly the same rates and ranking for &lt;span style="font-style: italic;"&gt;lèse &lt;/span&gt;in 1789.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_SNpwD2mXiMo/SseCecNVzMI/AAAAAAAAAlE/Ldav8HtXEgg/s1600-h/lese.ap.89.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 308px;" src="http://4.bp.blogspot.com/_SNpwD2mXiMo/SseCecNVzMI/AAAAAAAAAlE/Ldav8HtXEgg/s400/lese.ap.89.gif" alt="" id="BLOGGER_PHOTO_ID_5388418938651921602" border="0" /&gt;&lt;/a&gt;It would appear -- I would not put too much credit in these numbers -- that the shift from majesty to nation, and all that this implies in terms of the way state is envisaged, was well under way by 1789.  This either happened very quickly in the years leading up to the Revolution, since the construction just once in ARTFL-Frantext before, or was a development that took place in types of documents not found in the rather more literary/canonical sample in ARTFL-Frantext, such as journals, pamphlets, and other more ephemeral materials.  I guess data entry projects will never end.&lt;br /&gt;&lt;br /&gt;One other observation.  I like the collocation cloud as a graphic.  But if you examine the table, you may notice that the cloud does not really represent the frequency differences all that well.  The second table -- all of the AP -- shows that nation occurs more than 6 times as frequently as majesté, but differences of that magnitude tend to be rather difficult to show in a cloud.  So, the compromise of providing both is probably the best approach.&lt;br /&gt;&lt;br /&gt;* G. A. Kelly, "From &lt;span style="font-style: italic;"&gt;L&lt;/span&gt;&lt;span style="font-style: italic;"&gt;èse Majesté&lt;/span&gt;&lt;span&gt; to &lt;/span&gt;&lt;span style="font-style: italic;"&gt;L&lt;/span&gt;&lt;span style="font-style: italic;"&gt;èse nation&lt;/span&gt;&lt;span&gt;: Treason in 18th century France", &lt;span style="font-style: italic;"&gt;Journal of the History of Ideas&lt;/span&gt;, 42 (1981): 269-286 (&lt;a href="http://www.jstor.org/stable/2709320"&gt;JStor&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-815002766376696123?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/10/archives-parlementaires-lese.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/815002766376696123'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/815002766376696123'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/10/archives-parlementaires-lese.html' title='Archives Parlementaires: lèse collocations'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_SNpwD2mXiMo/Ssd7taeuH6I/AAAAAAAAAk0/UiWpYwMkvEg/s72-c/lese.frantext.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-6071314656142964794</id><published>2009-10-02T11:24:00.006-05:00</published><updated>2009-10-03T10:57:42.403-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Archives Parlementaires'/><title type='text'>Archives Parlementaires (I)</title><content type='html'>A couple of weeks ago, some ARTFL folks discussed the notion of outlining some research and/or development projects that we will be, or would like to be, working on the coming months.   We discussed a wide range of possibilities that could involve substantive work, using some of the systems we have already developed or are working on, or more purely technical work.  Everyone came up with some pretty interesting projects and proposals, and we decided that it might be entertaining and useful for each of us to outline a specific project or two and write periodic entries here as things move forward.  In the cold light of sobriety, this sounds like a pretty good idea.  So, let me be the first to give this a whirl.&lt;br /&gt;&lt;br /&gt;Our colleagues at the Stanford University Library have been digitizing the Archives Parlementaires using the DocWorks system.   During a recent visit, Dan Edelstein was kind enough to deliver 43 volumes of OCRed text, which represents about half of the entire collection.  Dan and I very hastily assembled an alpha text build of this sample under PhiloLogic.  I converted the source data into a light TEI notation and attempted to identify probable sections in the data, such as "cahiers" , "séances", and other plausible divisions using an incredible simple approach.  Dan built a table to identify volumes and years, which we used to load the dataset in (hopefully) coherent order.  This is a very alpha test build.  It is uncorrected OCR (much of which is surprising good) without links to pages images.   The volumes are being scanned in no particular order, so we have volumes from a large swath of the collection.  We are hoping to get the rest of volumes from Stanford in the relatively near future and will be working up or more coherent and user friendly site, with page images and the like.  So, with these caveats, here is the PhiloLogic &lt;a href="http://artfl-project.uchicago.edu/node/94"&gt;search form&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://ihrf.univ-paris1.fr/spip.php?article93"&gt;Archives Parlementaires&lt;/a&gt; are the official, printed record of French legislative assemblies  from beginning of the Revolution (1787) thru 1860.   We are interested in the first part of the first series (82 volumes), out of copyright, ending in January 1794 which contain records pertaining to the Constituent Assembly, Legislative Assembly, and the Convention.  The first seven volumes of the AP are the General &lt;a href="http://en.wikipedia.org/wiki/Cahiers_de_dol%C3%A9ances"&gt;Cahiers de doléances&lt;/a&gt;, which are organized by locality and estate (clergy, nobility, and third).  The rest contain debates, speeches, draft legislation, reports, and many other kinds of materials typically organized by legislative session, often twice daily (morning and evening).&lt;br /&gt;&lt;br /&gt;There will be some general house keeping required to start.   Some of this will involve writing a better division recognizer, particularly for the Cahiers which are currently not including the place name and estate.  I will also need to decide how to handle annexes, editorial materials, notes, etc.  I suspect that it may also be worth some effort to try to correct some of the errors automatically, by simple replacement rules and identification impossible sequences.  I am also thinking of using proximity measures to try to correct some proper names, such as Bobespierre, Kobespierre, etc.   I would also want to concentrate some effort on terms that may reflect structural divisions.  Dan has suggested identification of speakers, where possible, so one could search the speeches (full and in debates) of specific individuals like Robespierre, but this appears to be fairly problematic, since it is not clear how to identify just where these might stop.&lt;br /&gt;&lt;br /&gt;Loading this data, particularly the complete (or at least out of copyright) dataset will probably be of general utility to Revolutionary historians, particularly when linked to page images and given some other enhancements.   This will be done in conjunction with our colleagues at Stanford and other researchers.&lt;br /&gt;&lt;br /&gt;I have several rather distinct research efforts in mind.  There are a series of technical enhancements which I think fit the nature of the data fairly well:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;sequence alignment to identified borrowed passages from earlier works, such as Rousseau and Montesquieu,&lt;/li&gt;&lt;li&gt;topic based text segmentation, to split individual sessions into parts, and,&lt;/li&gt;&lt;li&gt;topic modeling or clustering to attempt to identify the topics of parts identified by topic based segmentation.&lt;/li&gt;&lt;/ul&gt;We have already run experiments using &lt;a href="http://code.google.com/p/text-pair/"&gt;PhiloLine&lt;/a&gt;, the many to many sequence aligner which we are using for various other applications.   As we have found, this works for uncorrected OCR relatively well.  For example, Condorcet in the &lt;i&gt;Séance du vendredi 3 septembre 1790 &lt;/i&gt;[note the OCR error below] &lt;i&gt; &lt;/i&gt;borrows a passage from Voltaire's &lt;span style="font-style: italic;"&gt;Épitres&lt;/span&gt; in his&lt;br /&gt;&lt;p&gt; &lt;span style="font-size:85%;"&gt;&lt;/span&gt;&lt;/p&gt;&lt;blockquote&gt; &lt;p&gt;&lt;span style="font-size:85%;"&gt;Nouvelles réflexions sur le projet de payer la dette exigible en papier forcé, par M. GoNDORCET.  &lt;/span&gt;&lt;span style="font-size:85%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-size:85%;"&gt;Un maudit&lt;/span&gt;&lt;span style="font-size:85%;"&gt; Écossais, chassé de son pays, Vint changer tout en France et gâter nos esprits. L'espoir trompeur et vain, l'avarice au teint blême, Sous l'abbé Terrasson calculaient son système, Répandaient à grands flols les papiers imposteurs, Vidaient nos coffres-forts et corrompaient no s mœurs.&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;span style="font-size:85%;"&gt;&lt;/span&gt;&lt;/p&gt;&lt;span style="font-size:85%;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-size:85%;"&gt;Un maudit écossais, chassé de son pays,&lt;br /&gt;vint changer tout en &lt;pn&gt;France&lt;/pn&gt;, et gâta nos esprits.&lt;br /&gt;L'espoir trompeur et vain, l'avarice au teint blême,&lt;br /&gt;sous l'abbé &lt;pn&gt;Terrasson&lt;/pn&gt; calculant son système,&lt;br /&gt;répandaient à grands flots leurs papiers imposteurs,&lt;br /&gt;vidaient nos coffres-forts, et corrompaient nos&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:85%;"&gt; moeurs;&lt;br /&gt;&lt;/span&gt;&lt;/blockquote&gt;without specific reference to Voltaire (that I could find).  This is generally pretty decent OCR.  The alignments work for poorer quality and where there are significant insertions or deletions.  For example:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;Rousseau, Jean-Jacques, [1758], &lt;i&gt;Lettre à Mr. d'Alembert sur les spectacles&lt;/i&gt;:&lt;br /&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-size:85%;"&gt;autrui des accusations qu'elles croient fausses; tandis qu'en d'autres pays les femmes, également coupables par leur silence et par leurs discours, cachent, de peur de représailles, le mal qu'elles savent, et publient par vengeance celui qu'elles ont inventé. Combien de scandales &lt;/span&gt;&lt;span style="color: rgb(204, 51, 0);font-size:85%;" &gt; publics ne retient pas la crainte de ces sévères observatrices? Elles font presque dans notre ville la fonction de censeurs. C'est ainsi que dans les beaux tems de Rome , les citoyens, surveillans les uns des autres, s'accusoient publiquement par zele pour la justice; mais quand Rome fut corrompue et qu'il ne resta plus rien à faire pour les bonnes moeurs que de cacher les mauvaises, la haine des vices qui les démasque en devint un. Aux citoyens zélés succéderent des délateurs infames; et au lieu qu'autrefois les bons accusoient les méchans, ils en furent accusés à leur tour&lt;/span&gt;&lt;span style="font-size:85%;"&gt; . Grâce au ciel, nous sommes loin d'un terme si funeste. Nous ne sommes point réduits à nous cacher à nos propres yeux, de peur de nous faire horreur. Pour moi, je n'en aurai pas meilleure opinion des femmes, quand elles seront plus circonspectes: on se ménagera davantage, quand on&lt;/span&gt;&lt;/blockquote&gt;&lt;span style="font-size:85%;"&gt; &lt;i&gt;Séance publique du 30 avril 1793, l'an II de la&lt;/i&gt;:&lt;br /&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-size:85%;"&gt;son tribunal n'exerce pas, d'ailleurs, une autorité aussi 1 mu soire qu'on pourrait le croire ; il se fait J"_ tice d'une partie de la violation des lois «j ciales ; ses vengeances sont terribles p l'homme libre, puisque la censure o lst "°" la honte et le mépris : et combien cle st* § dales &lt;/span&gt;&lt;span style="color: rgb(204, 51, 0);font-size:85%;" &gt; publics ne retient pas la crainte m. châtiments ? Dans les beaux temps cle n°*ji les citoyens, surveillants nés les uns a es» s'accusaient publiquement par zèle p % justice. Mais quand Rome fut corrompu^ citoyens zélés succédèrent des oeiai •„ t fâmes; au lieu qu'autrefois les bons accu- -^ les méchants, ils en furent accuses tour &lt;/span&gt;&lt;span style="font-size:85%;"&gt;. -, rla méEn Egypte, la censure ssu_ v moire des morts ; la comédie eut o*" B^^ des un pouvoir plus étendu sur la rep vivants. „ •* i„ t-Ole niani^ 1 * L'esprit de l'homme est fait ae te ut rtr-c, encore plus du ridicule que d'un ,»ïl u &lt;/span&gt;&lt;/blockquote&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;/span&gt;The Rousseau passage is found in a speech titled &lt;span style="font-style: italic;"&gt;Nécessité d'établir une censure publique&lt;/span&gt; par J.-P. Picqué, which does not appear to mention the title and possibly not Rousseau at all (as far as I can tell).   As you can see, this is fair messy OCR and is significantly truncated.  We have a preliminary database running and will probably release this once we have the entire set and experiment further with alignment parameters. &lt;br /&gt;&lt;br /&gt;Based on preliminary work that I have done on Topic based text segmentation, which Clovis followed up on in more detail (&lt;a href="http://artfl.blogspot.com/2009/07/experiment-on-text-segmentation.html"&gt;link&lt;/a&gt;), suggests that the individual séances may be a particularly good candidate for topic segmentation, since the topics can shift around radically.   Running text tends not to do as well as clear shifts in topics.   There are a number of newer approaches than the Hurst TextTiling implementation (which I will blog when I run them up) that may be more effective.&lt;br /&gt;&lt;br /&gt;Finally, on the technical side, I want to experiment with LDA topic modeling.  Again, Clovis' initial work on topic identification for the articles of &lt;a href="http://artfl.blogspot.com/2009/09/classifying-echo-de-la-fabrique.html"&gt;Echo de la fabrique&lt;/a&gt;, indicate that, if one can get good topic segments, the modeling algorithm may be fairly effective.  Oddly enough, I cannot recall anyone doing the "topic two-step", where one would apply topic modeling to parts of documents split up by a topic based segmentation algorithm.   Or, I may have missed some important papers.   The idea behind all of this is an attempt to build the ability to search for relatively coherent topics, either for browsing or searching.&lt;br /&gt;&lt;br /&gt;So far, I have been talking about some more technical experimentation to see if certain algorithms, or general approaches, might be effective on a large and fairly complex document space.  While I used the AP for significant work when I was doing Revolutionary studies, my initial systematic interest was in the General Cahiers de doléances.  For my dissertation, and some later articles ("The Language of Enlightened Politics: The &lt;i id="dkz1104"&gt;Société de 1789&lt;/i&gt; in the French Revolution" in &lt;i id="dkz1105"&gt;Computers and the Humanities&lt;/i&gt; 23 (1989): 357-64), I keyboarded a small sample of the Cahiers (don't ever, ever do that as a poor graduate student :-) to serve as a baseline corpus to look at changes in Revolutionary discourse over time, with specific reference to the materials published by the Société de 1789.   I suspect that a statistical analysis of the language in the cahiers may bring to light interesting differences between the Estates, urban/rural, and north/south.  For this set of tasks, I am planning to use the comparative functions of &lt;a href="http://code.google.com/p/philomine/"&gt;PhiloMine&lt;/a&gt; to examine the degree to which these divisions can be identified using machine learning approaches and, if so, what kinds of lexical differences can be identified.  It would be equally interesting to compare a more linguistic analysis to the content analysis results found in Gilbert Shaprio et al, &lt;span style="font-style: italic;font-size:100%;" &gt;Revolutionary demands: a content analysis of the Cahiers de doléances of 1789&lt;/span&gt;. &lt;br /&gt;&lt;br /&gt;I will, as promised (or threatened) above, try to blog good results and failures -- remember  Edison is credited with saying while trying to invent the lightblub, “I have not failed. I've just found 10,000 ways that won't work.”  -- of these efforts here so we can all consider them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-6071314656142964794?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/10/archives-parlementaires-i.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/6071314656142964794'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/6071314656142964794'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/10/archives-parlementaires-i.html' title='Archives Parlementaires (I)'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-1269723334553136458</id><published>2009-09-25T13:12:00.005-05:00</published><updated>2009-09-28T12:58:33.201-05:00</updated><title type='text'>Epub to tei lite converter</title><content type='html'>This is just to let you know that we now have an epub to tei converter. It can be found here:&lt;br /&gt;&lt;a href="http://artfl.googlecode.com/files/epub_parser.tar"&gt;http://artfl.googlecode.com/files/epub_parser.tar&lt;/a&gt;&lt;br /&gt;As you'll notice, there are three files in this archive. The first one is epub_parser.sh. It's the only one you need to edit. Specify the paths (where the epub files are and where you want your tei files to be in) without slashes and just execute epub_parser.sh. The second one is parser.pl which is called by epub_parser.sh. The third one is entities.pl which handles html entities and is also called by epub_parser.sh. Before running it, make sure all three scripts are in the same directory.&lt;br /&gt;A sample philologic load can be found here:&lt;br /&gt;&lt;a href="http://artflx.uchicago.edu/philologic/epubtest.whizbang.form.html"&gt;&lt;span style="text-decoration: underline;"&gt;http://artflx.uchicago.edu/philologic/epubtest.whizbang.form.html&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;Of course, this is just a proof of concept and will only be used only for text search and machine learning purposes. Some things will have to be tuned up. Note that I put a div1 every ten pages since there is no way to recognize chapters in the original epub files.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-1269723334553136458?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/09/epub-to-tei-lite-converter.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/1269723334553136458'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/1269723334553136458'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/09/epub-to-tei-lite-converter.html' title='Epub to tei lite converter'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-3635528280251060735</id><published>2009-09-25T13:03:00.004-05:00</published><updated>2009-11-14T11:36:43.329-06:00</updated><title type='text'>Text segmentation code and usage</title><content type='html'>&lt;p&gt;Here's a quick explanation on how to use the text segmentation perl module called Lingua-FR-Segmenter. You can find here: &lt;a href="http://artfl.googlecode.com/files/Lingua-FR-Segmenter-0.1.tar.gz" class="external" rel="nofollow"&gt;http://artfl.googlecode.com/files/Lingua-FR-Segmenter-0.1.tar.gz&lt;/a&gt; It's not available on cpan as it's just a hacked version of Lingua::EN::Segmenter::TextTiling made to work with French. The first thing to do before installing it is to install Lingua::EN::Segmenter::TextTiling which will get you all the required dependencies (cpan -i Lingua::EN::Segmenter::TextTiling). When you install the French segmenter, make test will fail, so don't run it. That's normal since I haven't changed the example which is for the English version of the module. An example of how it can be used :&lt;br /&gt;&lt;br /&gt;&lt;code&gt;#!/usr/bin/perl&lt;br /&gt;use strict;&lt;br /&gt;use warnings;&lt;br /&gt;use Lingua::FR::Segmenter::TextTiling qw(segments);&lt;br /&gt;use lib '.';&lt;br /&gt;&lt;br /&gt;my $text;&lt;br /&gt;my $count;&lt;br /&gt;while (&lt;&gt;) {&lt;br /&gt; $text .= $_;&lt;br /&gt;}&lt;br /&gt;my $num_segment_breaks = 100000; # safe number so that we don't run out of segment breaks&lt;br /&gt;my @segments = segments($num_segment_breaks,$text);&lt;br /&gt;foreach (@segments) {&lt;br /&gt;    $count++;&lt;br /&gt;    print;&lt;br /&gt;     print "\n----------SEGMENT_BREAK----------\n" if exists $segments[$count];&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt; &lt;/p&gt;&lt;p&gt;There are other possibilities, but this is the basic one which will segment the text whenever there's a topic shift. Some massaging is necessary in order to get good results, and the changes needed are different from one text to the next. Basically separate paragraphs with a newline. &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-3635528280251060735?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/09/text-segmentation-code-and-usage.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3635528280251060735'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3635528280251060735'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/09/text-segmentation-code-and-usage.html' title='Text segmentation code and usage'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-4034207245151159739</id><published>2009-09-18T15:06:00.016-05:00</published><updated>2009-11-14T11:37:01.078-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><title type='text'>Classifying the Echo de la Fabrique</title><content type='html'>I've been working lately on trying to classify the Echo de la Fabrique, a 19th century newspaper, using LDA. The official website is located at &lt;a href="http://echo-fabrique.ens-lsh.fr/"&gt;http://echo-fabrique.ens-lsh.fr/&lt;/a&gt;. The installation I used is strictly meant for experimentation on topic modeling.&lt;br /&gt;The dataset I used is significantly smaller than the Encyclopédie, which means that the algorithm has fewer articles with which to generate topics. This makes the whole process trickier since choosing the right number of topics suddenly becomes  more important. I suspect that adding more articles to this dataset will yield better results. I settled for 55 topics,  and found a name corresponding to the general idea conveyed by each distribution of words.  I then proceeded to add those topics to each tei file and loaded it into philologic. I chose to include 4 topics per article, or fewer if topics didn't reach the mark of 0.1.&lt;br /&gt;The work I've done so far on LDA has already shown several things about its accuracy in generating meaningful topics and in properly classifying text. It tends to work really well with topics that are concept driven. For instance, in the Echo de la Fabrique , the topic 'justice' works really well. Same thing goes with 'Hygiène' associated with words like 'choléra' or 'eau'. On the other hand, there are some distribution of words which were not identifiable as topics. Those topics have been marked as 'Undetermined' with a number such as 'Undetermined1' to distinguish each undetermined topic. And then, there are also topics like 'Petites annonces' or 'Misère ouvrière ' which are not as concept driven, and therefore are subject to more inaccuracies. Once again, I believe that having more articles from the same source would partially improve this problem : more documents, more training for the topic modeler, reduced dependency on concepts.&lt;br /&gt;Each topic has a number attached to it. This number represents the importance of the topic for each article. To get the most prominent topic, search for e.g. 'justice 1', 'justice 2' for the second topic, 'justice 3' for the third topic, and 'justice 4' for the fourth topic. If you want a search for all four, just type 'justice'. Note that the classification tends to be more accurate with the first topic than with the other three, but that 's not always the case.&lt;br /&gt;Anyway, without further ado, here is the search form:&lt;br /&gt;&lt;a href="http://artfl-project.uchicago.edu/node/95"&gt;http://artfl-project.uchicago.edu/node/95&lt;/a&gt;&lt;br /&gt;Please let me know if you have any comments, suggestions. Any feedback is much appreciated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-4034207245151159739?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/09/classifying-echo-de-la-fabrique.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/4034207245151159739'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/4034207245151159739'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/09/classifying-echo-de-la-fabrique.html' title='Classifying the Echo de la Fabrique'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-304543002258089638</id><published>2009-08-28T12:15:00.011-05:00</published><updated>2009-08-31T08:40:47.485-05:00</updated><title type='text'>Some Classification Experiments</title><content type='html'>Since Clovis has running some experiments to see how well Topic Modeling using LDA might be used to predict topics on unseen instances, I thought I would back track a bit and write a bit about some experiments I ran last year which may be salient for future for comparative experimentation or even to begin thinking about putting some of our classification work into some level of production.  I am presuming that you are basically familiar with some of the classifiers and problems with the Encyclopédie ontology.  These are described in varying levels of detail in some of our recent &lt;a href="http://artfl-project.uchicago.edu/node/42"&gt;papers/talks&lt;/a&gt; and on the &lt;a href="http://code.google.com/p/philomine/"&gt;PhiloMine&lt;/a&gt; site.&lt;br /&gt;&lt;br /&gt;The first set was a series of experiments classifying a number of 18th century documents using a stand alone &lt;a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier"&gt;Bayesian classifier&lt;/a&gt;, learning the ontology of the &lt;a style="font-style: italic;" href="http://encyclopedie.uchicago.edu/"&gt;Encyclopédie&lt;/a&gt;, and predicting the classes on chapters (divs) of selected documents.   I have selected three for discussion here, since they are interesting and are segmented nicely into reasonable size chunks.  I ran these using the English classifications and did not exclude the particularly problematic classes, such as Modern Geography (which tend to be biographies about important folks, filed  under where they were from) or Literature.  Each document shows the Chapter or Article, which is linked to the text of the chapter, followed by one or more classifications, assigned using the Multinomial Bayesian classifier.  If I rerun these, I will simply pop the classification data right in each segment, for easier consultation.  Right now,  you will need to juggle between two windows:&lt;br /&gt;&lt;br /&gt;Montesquieu, &lt;a href="http://docs.google.com/View?id=ddj2s2rb_33cdjdx3hf"&gt;&lt;span style="font-style: italic;"&gt;Esprit des Loix&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;Selected articles from Voltaire, &lt;a style="font-style: italic;" href="http://docs.google.com/View?id=ddj2s2rb_349t48gpf3"&gt;Dictionnaire philosophique&lt;/a&gt;&lt;br /&gt;Diderot,  &lt;a style="font-style: italic;" href="http://docs.google.com/View?id=ddj2s2rb_35g5dw3x54"&gt;Elements de physiologie&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 153); font-weight: bold;"&gt;PENDING: Discussion of some interesting examples and notable failures.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The second set of experiments compared &lt;a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm"&gt;K-Nearest Neighbor&lt;/a&gt; (KNN) classifier to the Multinomial Bayesian classifiers in two tests, the first being cross classification of the &lt;span style="font-style: italic;"&gt;Encyclopédie&lt;/span&gt; and the second being multiple classifications, again using the Encyclopedie ontology, to predict classes of knowledge in Montesquieu's &lt;span style="font-style: italic;"&gt;Esprit des Loix&lt;/span&gt;.    The reason for these experiments is to examine the performance of linear (Bayesian) and non-linear (KNN) classifications in the rather noisy information space that is the &lt;span style="font-style: italic;"&gt;Encyclopédie&lt;/span&gt; ontology.  By "noisy" I mean to suggest that it is not at all uniform in terms of size of categories (which can range from several instances to several thousand), size of articles processed, degree of "abstractness," where some categories are very general and some are very specific, and a range other considerations.  We have debated, on and off, whether KNN or Bayesian (or other linear classifiers such as &lt;a href="http://en.wikipedia.org/wiki/Support_vector_machine"&gt;Support Vector Machines&lt;/a&gt;) classifiers are better suited to the kinds of noisy information spaces that one encounters in retro-fitting historical resources such as the Encyclopedie.   The distinction is not rigid.  In fact, in a paper last year, on which Russ was the lead author, we argued that one could reasonably combine KNN and Bayesian classifiers by using a "meta-classifier" to determine which should be used to perform a classification task on a particular article in cases of a dispute  (Cooney, et. al. "Hidden Roads and Twisted Paths: Intertextual Discovery using Clusters, Classifications, and Similarities", Digital Humanities 2008, University of Oulu, Oulu, Finland, June 25-29, 2008  [&lt;a href="http://docs.google.com/present/view?skipauth=true&amp;amp;id=dfddkspw_205fk8299hg"&gt;link&lt;/a&gt;]).   We concluded that, for example, "KNN is most accurate when it classifies smaller articles into classes of knowledge with smaller membership".&lt;br /&gt;&lt;br /&gt;Cross classification of the classified articles in Encyclopedie using MNB and KNN.  I did a number of runs, varying the size of the training set and set to be classified.  The result files for each of these runs, on an article by article basis, as quite large (and I'm happy to send them along).   So, I compiled the results into a &lt;a href="http://docs.google.com/View?id=ddj2s2rb_36hrjjc3tb"&gt;summary table&lt;/a&gt;.  I took 16,462 classified articles, excluding Modern Geography, and "trained" the classifiers on between 10% and 50% of the instances.   I put "trained" in scare quotes because a KNN classifier is an unsupervised learner, so what you are really doing is selecting a subset of comparison vectors with their classes.   The selection process resulted in 276 and 708 classes of knowledge in the information space.   As is shown in the table, KNN significantly outperforms MNB in this task.   We know from pervious work, and general background, that the MNB tends to flatten out distinctions among smaller classes, but has the advantage of being fast.&lt;br /&gt;&lt;br /&gt;The distinctions are at times fairly particular and many times the classifiers come up with quite reasonable predictions, even when they are wrong.  A few examples (red shows a mis-classification):&lt;br /&gt;&lt;br /&gt;Abaissé, Coat of arms (&lt;i&gt;en terme de Blason&lt;/i&gt;)&lt;p&gt;&lt;/p&gt;&lt;blockquote&gt; KNN Best category = CoatOfArms&lt;br /&gt;KNN All categories = CoatOfArms, ModernHistory&lt;br /&gt;&lt;span style="color: rgb(204, 51, 0);"&gt;MNB Best category = ModernHistory&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(204, 51, 0);"&gt;MNB All categories = ModernHistory, Geography&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;AGRÉMENS, Rufflemaker (Passement.)&lt;blockquote&gt; &lt;span style="color: rgb(204, 51, 0);"&gt;KNN Best category = Ribbonmaker&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(204, 51, 0);"&gt;KNN All categories = Ribbonmaker&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(204, 51, 0);"&gt;MNB Best category = Geography&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(204, 51, 0);"&gt;MNB All categories = Geography&lt;/span&gt;&lt;/blockquote&gt;TYPHON, Jaucourt: General physics (&lt;i&gt;Physiq. générale)&lt;/i&gt;&lt;blockquote&gt; &lt;span style="color: rgb(204, 51, 0);"&gt;KNN Best category = Geography&lt;/span&gt;&lt;br /&gt;KNN All categories = Geography, GeneralPhysics, Navy, AncientGeography&lt;br /&gt;&lt;span style="color: rgb(204, 51, 0);"&gt;MNB Best category = Geography&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(204, 51, 0);"&gt;MNB All categories = Geography, AncientGeography&lt;/span&gt;&lt;/blockquote&gt;I applied the comparative classifiers in a number of runs using different parameters for Montesquieu, &lt;span style="font-style: italic;"&gt;Esprit des Loix&lt;/span&gt;.  All of the runs tended to give fairly similar results, so here is the &lt;a href="http://docs.google.com/View?id=ddj2s2rb_37g3vxt2d5"&gt;last of the result sets&lt;/a&gt;.   The results are all rather reasonable, with in limits, given the significant variations in size of chapters/sections in the EdL.   The entire "section" 1:5:13 is&lt;br /&gt;&lt;blockquote&gt;Idée du despotisme. Quand les sauvages de la Louisiane veulent avoir du fruit, ils coupent l'arbre au pied, et cueillent le fruit. Voilà le gouvernement despotique. &lt;/blockquote&gt;which gets classified as&lt;br /&gt;&lt;br /&gt;KNN Best category = NaturalHistoryBotany&lt;br /&gt;KNN All categories = NaturalHistoryBotany&lt;br /&gt;MNB Best category = NaturalHistoryBotany&lt;br /&gt;MNB All categories = NaturalHistoryBotany, Geography, Botany, ModernHistory&lt;br /&gt;&lt;br /&gt;In certain other instances, KNN will pick classes like "Natural Law" or "Political Law" while the MNB will return the more general "Jurisprudence".   I am particularly entertained by&lt;br /&gt;&lt;br /&gt;PARTIE 2 LIVRE 12 CHAPITRE 5:&lt;br /&gt;&lt;span style="font-style: italic;font-size:85%;" &gt;De certaines accusations qui ont particulièrement besoin de modération et de prudence&lt;/span&gt;&lt;br /&gt;KNN Best category = Magic&lt;br /&gt;KNN All categories =&lt;br /&gt;MNB Best category = Jurisprudence&lt;br /&gt;MNB All categories = Jurisprudence&lt;br /&gt;&lt;br /&gt;Consulting the article, one finds a "&lt;span style="font-style: italic;"&gt;Maxime importante: il faut être très circonspect dans la poursuite de la magie et de l'hérésie&lt;/span&gt;" and that the rest of the chapter is indeed a discussion of magic.   While the differences are fun, and sometimes puzzling, one should also note the degree of agreement between the different classifiers, particularly if one discounts certain hard to determine differences between classes, such as Physiology and Medicine.  The chapter "&lt;i&gt;Combien les hommes sont différens dans les divers climats"&lt;/i&gt; (3:14:2) is classified by KNN as "Physiology" and MNB as "Medicine".  Both clearly distinguish this chapter from others on Jurisprudence or Law.&lt;br /&gt;&lt;br /&gt;I have tended to find KNN classifications to be rather more interesting than MNB.  But I don't think the jury is out on that and one can always perform the kinds of tests that Russ described in the Hidden Roads talk.&lt;br /&gt;&lt;br /&gt;All of these experiments were run using Ken Williams' incredible handy perl modules &lt;a href="http://search.cpan.org/dist/AI-Categorizer/"&gt;AI:Categorizer&lt;/a&gt; rather than &lt;a href="http://code.google.com/p/philomine/"&gt;PhiloMine&lt;/a&gt; (which also has a number of Williams' modules) just because it was easier to construct and tinker with the modules. &lt;span style="color: rgb(0, 0, 153);"&gt;I will post some of these shortly, for future reference.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-304543002258089638?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/08/some-classification-experiments.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/304543002258089638'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/304543002258089638'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/08/some-classification-experiments.html' title='Some Classification Experiments'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-4435010927805385037</id><published>2009-08-27T13:40:00.008-05:00</published><updated>2009-08-27T17:08:47.895-05:00</updated><title type='text'>Collocation Notes</title><content type='html'>Since we are planning a proposal that will use collocation as a main component for yet another grant/project proposal, I thought I would give some background notes for future reference.  One of the more popular reporting features in PhiloLogic is the collocation table.  This is a very simple mechanism.  It counts the words around a search term or list of terms (the user sets the span and can turn of function word filtering) and reports the frequencies of terms to the left, right and total in a table.  Richard recently added the "collocation cloud" feature to the current production version at ARTFL.   The following is the collocation table and cloud for "tradition" in the current release of ARTFL-Frantext:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_SNpwD2mXiMo/SpbUGlvvGsI/AAAAAAAAAj0/0RkQfMcINrk/s1600-h/c1.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 369px;" src="http://2.bp.blogspot.com/_SNpwD2mXiMo/SpbUGlvvGsI/AAAAAAAAAj0/0RkQfMcINrk/s400/c1.gif" alt="" id="BLOGGER_PHOTO_ID_5374716414989900482" border="0" /&gt;&lt;/a&gt;Collocation is a well established approach in Digital Humanities and other domains.  Susan Hockey, for example, has a nice discussion of collocation in &lt;a href="http://books.google.com/books?id=uuH4-diVFRwC"&gt;&lt;i&gt;Electronic Texts in the Humanities&lt;/i&gt;&lt;/a&gt;, (Oxford, 2000), pp 90-91.   She describes some work from the early 1970s and brings out the distinction between statistical calculations of collocation and very simple counts.&lt;br /&gt;&lt;blockquote&gt;Berry-Rogghe (1973) discusses the relevance of collocations in lexical  studies with reference to an investigation of the collocates of &lt;i&gt;house,&lt;/i&gt; from which she is able to derive some notion of the semantic field of  &lt;i&gt;house.&lt;/i&gt;  [...] Her program counts the total number of occurrences of the node, and the total number of occurrences of each collocate of the node within a certain span. It then attempts to indicate the probability of these collocates occurring if the words were distributed randomly throughout the text, and can thus estimate the expected number of collocates. It then compares the expected number with the observed number and generates a 'z-score', which indicates the significance of the collocate.  The first table she presents shows the collocates of &lt;i&gt;house &lt;/i&gt;based  on a span of three words and in descending order of frequency. First is  &lt;i&gt;the, &lt;/i&gt;which co-occurs thirty-five times with &lt;i&gt;house, &lt;/i&gt;but  the total number of occurrences of &lt;i&gt;the is &lt;/i&gt; 2,368.  &lt;i&gt;The is &lt;/i&gt;followed by &lt;i&gt;this, a, &lt;/i&gt; of, &lt;i&gt;I, in, it, my, is,&lt;/i&gt;&lt;i&gt;have, &lt;/i&gt;and &lt;i&gt;to, &lt;/i&gt;before the first significant collocate  &lt;i&gt;sold &lt;/i&gt;where six of the seven occurrences are within three words  of house. Four words further on is &lt;i&gt;commons, &lt;/i&gt;where all four  occurrences collocate with &lt;i&gt;house, &lt;/i&gt;obviously from the phrase  &lt;i&gt;House of Commons. &lt;/i&gt;When reordered by z-score, the list begins  &lt;i&gt;sold, commons, decorate, this, empty, buying, painting, opposite.&lt;/i&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;She goes on to suggest that "[f]or the non-mathematical or those who are suspicious of statistics, even simple counts of collocates can begin to show useful results, especially for comparative purposes."  Which is, of course, precisely what PhiloLogic does now.&lt;br /&gt;&lt;br /&gt;I have made extensive use of collocations over the years for my own work, both the zscore calculation and the very simple collocation by counts (filtering function words).   These studies include American and French political discourse for my dissertation and subsequent papers, gender marked discourse, and comparisons of notions of tradition over time and in English and French.  Breaking collocations down over time gives a pretty handy way to look at changing meanings of words.   I have an ancient paper "&lt;a href="http://docs.google.com/View?id=ddj2s2rb_30cc9s6chq"&gt;Quantitative Linguistics and&lt;i id="dkz186"&gt; histoire des mentalités&lt;/i&gt;: Gender Representation in the &lt;i id="dkz187"&gt;Trésor de la langue française&lt;/i&gt;, 1600-1950&lt;/a&gt;" in the &lt;i id="dkz188"&gt;Contributions to Quantitative Linguistics: Proceedings of QUALICO 1991, First Quantitative Linguistics Conference&lt;/i&gt; (Amsterdam: Kluwer 1993): 351-71. which gives a write up on the method, some math :-), and references to some salient papers, including Berry-Rogghe (1973).   In more recent work, I have used pretty much the same working model.  Build a database split into 1/2 century chunks and do collocations by half century periods, using the z-score calculation (outline the paper).  Indeed, I have a hacked version of PhiloLogic that does this.&lt;br /&gt;&lt;br /&gt;As Hockey indicates, the statistical measure gives a rather different flavor for the collocates, since it attempts to measure the degree of relatedness between the two words.  For example, the top collocates of "Platon" in a subset of Frantext shift around significantly.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;Word   Rank -&gt;  by &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;zscore&lt;/span&gt;    by freq&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;Speusippe&lt;/span&gt;:          1             78&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Aristote&lt;/span&gt; :          5              2&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The reason for this is clear.  4/8 occurrences of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Speusippe&lt;/span&gt; occur near &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Platon&lt;/span&gt; while 51/793 occurrences of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;Aristote&lt;/span&gt; are near &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;Platon&lt;/span&gt;.   I think both techniques are valid, and have used them to illuminate various tendencies.    The z-score measures the relatedness of two words while the simple counts shows how in general the keyword s typically used.  There is, of course, some overlap between the two, but the z-score tends to privilege to more unique constructions and associations.&lt;br /&gt;&lt;br /&gt;Now, the obvious question is: "why don't we have the z-score calculation as an option in the standard collocation function in &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;PhiloLogic&lt;/span&gt;?"   And the answer is speed.  The &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;z-score&lt;/span&gt; (and other statistical models which I will mention below), attempts to compare expected frequencies of the word distribution against the observed frequencies, where the expected frequency assumes  random distribution of words across a text, taking into account differences in frequencies.  [Caveat, we know that "&lt;a href="http://www.kilgarriff.co.uk/Publications/2005-K-lineer.pdf"&gt;Language is never, ever, ever, random&lt;/a&gt;", but it is a useful heuristic, particularly for the kinds of simplistic comparisons I am doing.]   The bottle neck for a real-time version of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;z-score&lt;/span&gt; collocations has been calculating baseline frequencies for any arbitrary range of documents.   This may no longer be a significant problem.   In a recent experiment, I built a script to sum the counts from arbitrary documents selected by bibliographic data (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;ARTFL&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;Frantext&lt;/span&gt; &lt;a href="http://artfl-project.uchicago.edu/node/90"&gt;word frequency report&lt;/a&gt;).  While we have had a few users express interest in having more global counts, it would appear that our latest servers have more than enough horsepower to do these kinds of additions very quickly, certain fast enough to be bolted on to a collocation generator as an option.   Certainly something to think about for a future revision of the old hopper.&lt;br /&gt;&lt;br /&gt;There are, of course, a huge number of ways to calculate collocations.   I suspect that there are two major areas:  1) how to identify spans and 2) how to measure the relationships between words.   I had this notion that rather than simply look at spans as N words to the right and left, one would count words in &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;pre&lt;/span&gt;-identified constructions (such as noun phrases, verb phases, or even clauses).  Given the power of modern &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;NLP&lt;/span&gt; tools, this is certain an option to think about.  Related is the notion that one would rather do collocations on either lemmas or even "stems" (the results of a &lt;a href="http://en.wikipedia.org/wiki/Stemming"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_14"&gt;stemmer&lt;/span&gt;&lt;/a&gt; which basically strips various characters) which are not words, but can be related to sets of words.    The other area of work is the possibility of using other statistical measures of association, such as log-likelihood and mutual information.&lt;br /&gt;&lt;br /&gt;I'm pretty sure I've seen standalone packages that support more sophisticated statistical models.  If we were going to do anything serious, the first place to start is reading.  Reading?  What?  Yes, indeed.  The chapter on Collocation in Chris Manning and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_15"&gt;Hinrich&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_16"&gt;Schütze&lt;/span&gt;, &lt;a href="http://nlp.stanford.edu/fsnlp/"&gt;&lt;i&gt;Foundations of Statistical Natural Language Processing&lt;/i&gt;&lt;/a&gt;, MIT Press.  Cambridge, MA: May 1999 is a great place to start.  Other titles may include &lt;span dir="ltr"&gt;Sabine &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_17"&gt;Bartsch&lt;/span&gt;, &lt;a style="font-style: italic;" href="http://books.google.com/books?id=CMyPT-nDm4sC"&gt;Structural and functional properties of collocations in English: a corpus study of lexical and pragmatic constraints on lexical co-occurrence&lt;/a&gt;&lt;/span&gt; (&lt;span dir="ltr"&gt;Gunter &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_18"&gt;Narr&lt;/span&gt; &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_19"&gt;Verlag&lt;/span&gt;, 2004)&lt;/span&gt;.  There is also software.  Of course, Martin's &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;WordHoard&lt;/span&gt; has an array of collocation measures (&lt;a href="http://wordhoard.northwestern.edu/userman/analysis-collocates.html"&gt;documentation&lt;/a&gt;) and we should not forget other goodies, such as &lt;a href="http://www.athel.com/colloc.html"&gt;Collocate&lt;/a&gt; (commercial) and the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;Cobuild&lt;/span&gt; &lt;a href="http://www.collins.co.uk/corpus/CorpusSearch.aspx#democoll"&gt;Collocation Sampler&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-4435010927805385037?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/08/collocation-notes.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/4435010927805385037'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/4435010927805385037'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/08/collocation-notes.html' title='Collocation Notes'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_SNpwD2mXiMo/SpbUGlvvGsI/AAAAAAAAAj0/0RkQfMcINrk/s72-c/c1.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-182506691105519816</id><published>2009-08-26T17:15:00.000-05:00</published><updated>2009-11-14T11:37:30.911-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><title type='text'>Finding related articles using topic modeling</title><content type='html'>While still working on the topic inferencer, I started hacking at another possibility which is made possible by topic modeling, that is finding closely related texts within a corpus. There are several ways of doing this. What I chose to do was to consider the top three topics in each article and their respective proportion, and weigh it against the whole corpus. Here's a link to a search form where you can search for similar articles in the Encyclopedie :&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/search.form.html"&gt;http://robespierre.uchicago.edu/topic_modeling/search.form.html&lt;/a&gt;&lt;br /&gt;In order to use it, you should paste the url of the article you're looking at. You'll then get a list of links to various articles that should be similar in content to the one you selected. A lot of tinkering can be done with the calculation of similarity, therefore I very well might have made some bad jugement here and there. This is therefore work in progress, therefore you might get strange results. But if you go through the whole list of results you might see some interesting things.&lt;br /&gt;I would like to give you two examples I've tried that work really well. The first one is the article Economie by Rousseau ( which gives very good results), and if you look at link 24, which is according to my (flawed) calculation the 24th closest article, you'll see an example of an article that would have been hard to find and link to Rousseau. The second example is Question by Jaucourt. Among the top 20, a lot concern various methods of torture, spread out in different classes of knowledge. Let me know what you think.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-182506691105519816?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/08/finding-related-articles-using-topic.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/182506691105519816'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/182506691105519816'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/08/finding-related-articles-using-topic.html' title='Finding related articles using topic modeling'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-1927384288556542042</id><published>2009-08-26T13:46:00.011-05:00</published><updated>2009-08-27T13:13:00.088-05:00</updated><title type='text'>Some Notes on Theme-Rheme in PhiloLogic</title><content type='html'>One of the more arcane, and probably rarely used, functions in &lt;a href="http://philologic.uchicago.edu/"&gt;PhiloLogic&lt;/a&gt; is an experimental reporting scheme that I rather tentatively named "word in clause position analysis" or "theme-rheme," which is briefly described in the PhiloLogic &lt;a href="http://philologic.uchicago.edu/manual.php#5.5"&gt;user manual&lt;/a&gt;.   I proposed this in talk titled "Making Space: Women's Writing in France, 1600-1950," which I gave at the ACH-ALLC and COCH/COSH conferences in 2004 (and drafted a good chunk of a paper about), and implemented in PhiloLogic around that time.  Since we are now thinking of using this kind of analysis as a possible way to identify "interesting" or "illustrative" uses of words as part of another project, I thought it might be helpful to back-track a bit, give a bit more overview of how it works, outline some of the theoretical background as I understand it, and provide some useful links and  papers.&lt;br /&gt;&lt;br /&gt;As noted in the user manual entry, the "theme-rheme" function generates a standard concordance which it then attempts to sort out by where your search term occurs in a clause, where a clause is defined by punctuation.   It segregates the occurrences by front of clause, back of clause, middle of clause, and instances where the clause is too short.  By default, it displays only those occurrences that are clause initial.  In the current implementation of &lt;a href="http://artfl-project.uchicago.edu/node/23"&gt;ARTFL-Frantext&lt;/a&gt; a search for "tradition" results in 4,962 occurrences, which roughly break down as follows:&lt;br /&gt;&lt;br /&gt;Front of Clause: 571 out of 4692 [12.16%] Avg. Clause length: 9.58&lt;br /&gt;Last of Clause: 1056 out of 4692 [22.50%]  Avg. Clause length: 8.68&lt;br /&gt;Middle of Clause: 2348 out of 4692 [50.04%]  Avg. Clause length: 9.56&lt;br /&gt;Too Short: 717 out of 4692 [15.28%]  Avg. Clause length: 2.40&lt;br /&gt;&lt;br /&gt;The system further identifies specific documents in which your search term exceeds, by a certain percentage, the front of clause rate (in this case 12.16%),  such as&lt;br /&gt;&lt;br /&gt;&lt;b&gt;55.55% &lt;/b&gt; (10/18): &lt;a name="HiSEl"&gt;&lt;/a&gt;Montalembert, Charles Forbes, [&lt;b&gt;1836&lt;/b&gt;], &lt;i&gt;Histoire de Sainte Elisabeth de Hongrie, duchese de Thuringe...&lt;br /&gt;&lt;/i&gt;&lt;b&gt;28.20% &lt;/b&gt; (11/39): &lt;a name="HisUn"&gt;&lt;/a&gt;Bossuet, Jacques Bénigne, 1627-1704. [&lt;b&gt;1681&lt;/b&gt;], &lt;i&gt;Discours sur l'histoire universelle&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;and it, of course, displays these in different colors, such as:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;L'Europe ainsi déracinée s'est plus tard déracinée davantage en se séparant, dans une large mesure, de la &lt;span style="color: rgb(0, 102, 0);"&gt;&lt;b&gt;tradition&lt;/b&gt;&lt;/span&gt; chrétienne elle-même sans pouvoir renouer aucun lien spirituel avec l'Antiquité.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Oui, sans doute, si cette &lt;span style="color: rgb(0, 102, 0);"&gt;&lt;b&gt;tradition&lt;/b&gt;&lt;/span&gt; était tout entière dans Aristote et dans l'enseignement péripatéticien de la scolastique.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;La &lt;span style="color: rgb(0, 102, 0);"&gt;&lt;b&gt;tradition&lt;/b&gt;&lt;/span&gt; attribue à Pythagore un séjour à Babylone. &lt;/li&gt;&lt;/ul&gt;The basic notion is that clause initial instances of words are probably more important, since they tend to be the "subject"of the rest of the clause.  And authors who tend to use your favorite word in more clause initial positions than is average, might be doing something of particular note.  In other words, can we use the machine to try to isolate, from the thousands of hits, those that might be particularly noteworthy.  In this case, we have isolated a small subset (12%) of the occurrences of "tradition" in a clause initial position and some authors/documents who tend to privilege this word.  I also identified clause ending uses, since (I suspect) end of clause words provide a bridge to the next clause (or sentence).&lt;br /&gt;&lt;br /&gt;I set two "&lt;a href="http://en.wikipedia.org/wiki/Intertwingularity"&gt;intertwingled&lt;/a&gt;" problems in the paper, women's writing and, more salient to this post, the increasing need to arrive at high orders of generalization to make sense of the results coming from ever increasing datasets.  Obviously, one solution to this is work we have been doing over the last few years in the areas of machine learning, document summarization, and text data mining (see &lt;a href="http://code.google.com/p/philomine/"&gt;PhiloMine&lt;/a&gt; and related papers).   What I proposed in this paper was a move toward from traditional text analysis techniques towards analytical notions based on &lt;a href="http://en.wikipedia.org/wiki/Systemic_functional_grammar"&gt;functional linguistics&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/Functional_grammar"&gt;functional grammar&lt;/a&gt;, which are related in various ways to &lt;a href="http://www.beaugrande.com/introduction_to_text_linguistics.htm"&gt;text linguistics&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/Discourse_analysis"&gt;discourse analysis&lt;/a&gt;.   This is a huge area of work and I would not begin to characterize it.  Helma, of course, is a functional linguist and proposes that this is a branch of "linguistics that takes the communicative functions of language as primary as opposed to seeing form as primary."  And as you might imagine, there are schools and competing views.  I have to admit I like the name "West Coast Functionalists.  :-)&lt;br /&gt;&lt;br /&gt;My take on this is that meaning arises from choices, or chains of choices, with sets of goals and objectives.  I also suspect that many "functionalists" would agree on a few other basic notions, such as lexis and grammar are inseparable in meaning creation, and indeed the term "‘lexico-grammar’ is now often used in recognition of the    fact that lexis and grammar are not separate and discrete, but form a continuum." (&lt;a href="http://www.philseflsupport.com/grammarnlexis.htm"&gt;cite&lt;/a&gt;)  It also appears that many functionalists would agree with the notion that the clause is the building block unit.   There are probably other points of general agreement about just how different layers might work or be defined.  For example, Simon Dik (not related to Helma) identified three layers in his Functional Grammar:&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: justify;"&gt;&lt;ul&gt;&lt;li&gt;SEMANTIC FUNCTIONS (Agent, Patient, Recipient, etc.) which define  the roles that participants play in states of affairs, as designated  by predications.&lt;/li&gt;&lt;li&gt;SYNTACTIC FUNCTIONS (Subject and Object) which define  different perspectives through which states of affairs are presented  in linguistic expressions.&lt;/li&gt;&lt;li&gt;PRAGMATIC FUNCTIONS (Theme and Tail, Topic and Focus) which define  the informational status of constituents of linguistic expressions.  They relate to the embedding of the expression in the ongoing discourse,  that is, are determined by the status of the pragmatic information of  Speaker and Addressee as it developes in verbal interaction. &lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;p&gt;&lt;br /&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_SNpwD2mXiMo/SpWe7qt64uI/AAAAAAAAAjk/svORBE_knoU/s1600-h/image006.gif"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 236px; height: 319px;" src="http://4.bp.blogspot.com/_SNpwD2mXiMo/SpWe7qt64uI/AAAAAAAAAjk/svORBE_knoU/s320/image006.gif" alt="" id="BLOGGER_PHOTO_ID_5374376478252917474" border="0" /&gt;&lt;/a&gt;Of course, other folks will carve these things up differently.   Robert de Beaugrande, whose extensive web site and papers are well worth the visit, represents the various levels of functional linguistics from nerves to text, as outline in the image, taken from his "&lt;a href="http://www.beaugrande.com/Functionalism%20and%20Corpus%20Linguistics.htm"&gt;Functionalism and Corpus Linguistics in the ‘Next Generation&lt;/a&gt;."    In another paper, he argues "&lt;span style="font-size:85%;"&gt;Corpus data are so eminently suited to informing us about 'networks'  because they offer concrete displays of the constraints upon how  sets of choices can interact. In the 'lexicon' part of the  'lexicogrammar' of English, these constraints constitute the  collocability  in the virtual system, and the textual actualisations  are the lexical  collocations. In the 'grammar' part of the  'lexicogrammar', these constraints constitute the colligability  in  the virtual system, and the textual actualisations are the grammatical   colligations&lt;/span&gt;" and goes on to represent the following image  the series of "dialectics" running between text and language.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_SNpwD2mXiMo/SpWi8KsD6vI/AAAAAAAAAjs/LzHgdFbT8BM/s1600-h/TexttM1.gif"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 320px; height: 91px;" src="http://3.bp.blogspot.com/_SNpwD2mXiMo/SpWi8KsD6vI/AAAAAAAAAjs/LzHgdFbT8BM/s320/TexttM1.gif" alt="" id="BLOGGER_PHOTO_ID_5374380884881566450" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Ok, they are fun images ... now back to work... and I wanted to see how embedding images would work...&lt;/p&gt;&lt;p&gt;It is the level of pragmatics that I suspect interests us in this particular case.  As I noted above, I borrowed the "theme-rheme" nominclature from MAK Halliday's &lt;span style="font-style: italic;"&gt;Introduction to Functional Linguistics&lt;/span&gt;.   Again:&lt;/p&gt;&lt;p&gt;Theme: "starting point of the message, what the clause is  going to be about".&lt;br /&gt;Rheme:  everything not the Theme: new information/material&lt;/p&gt;Theme contains given information i.e. information which has already been mentioned somewhere in the text, or is familar from the context.  There is an accessible description of this, with some nice examples in &lt;a href="http://www.asian-efl-journal.com/March_07_lw.php"&gt;Theme and Rheme in the Thematic Organization of Text&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In English (and French), identification of the Theme is based primarily on  word order.    Thus, the theme is the element which comes first in the clause.  (Eggins, A&lt;span style="font-style: italic;"&gt;n Introduction to Systemtic Functional Linguistics&lt;/span&gt;, p. 275)   Plenty of problems identifying the exact boundaries of different kinds of themes.&lt;br /&gt;&lt;br /&gt;The take way point, from all of this, is that the theme/rheme distinction is important because it is the way you get thematic development across a longer span of text.  Obviously, the Rheme in one clause can become Theme in the next. &lt;br /&gt;&lt;br /&gt;One other take away:  Halliday makes the argument that one can use punctuation in written texts to identify clauses, which is not the same for spoken texts.&lt;br /&gt;&lt;br /&gt;More later?????  I can track down a few more bibliographic entries....&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-1927384288556542042?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/08/some-notes-on-theme-rheme-in-philologic.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/1927384288556542042'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/1927384288556542042'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/08/some-notes-on-theme-rheme-in-philologic.html' title='Some Notes on Theme-Rheme in PhiloLogic'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_SNpwD2mXiMo/SpWe7qt64uI/AAAAAAAAAjk/svORBE_knoU/s72-c/image006.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-3505347734779741481</id><published>2009-08-18T13:46:00.000-05:00</published><updated>2009-11-14T11:37:30.911-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><title type='text'>Topic inference using the Encyclopédie trained model</title><content type='html'>While trying to use the Encyclopédie trained topic model on the Mémoires de Trévoux, something quite unexpected happened, the topic modeler was finding it hard to find topics that matched the Trévoux articles. You can see those results here:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/inference/encyclo2trevoux.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/inference/encyclo2trevoux.txt&lt;/a&gt;&lt;br /&gt;Since the topic inference feature in mallet is relatively new, I though of creating a model out of the Trévoux, and then compare the topic proportion generated from the topic trainer with the one generated using the model. So basically, I tested the model against the corpus of articles from which it originated. In all likelihood, the results were going to be excellent. Well, they weren't, therefore showing that the topic inferencer is not yet operational (it is a new feature after all). On the other hand, I did notice something, that if you compare the results, you'll notice that the same topics (mostly) are prominent in both, only the proportion measure is off, approximately divided by ten when using topic inference. Here are those results:&lt;br /&gt;when using topic training:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/inference/proportions.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/inference/proportions.txt&lt;/a&gt;&lt;br /&gt;when using topic inference:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/inference/proportions_itself.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/inference/proportions_itself.txt&lt;/a&gt;&lt;br /&gt;The question is, can I trust those results. My initial analysis tends to show that it does work, but it's definitely not as accurate as the first experiments I did with topic modeling. Some more digging is needed, eventually getting in touch with the Mallet developers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-3505347734779741481?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/08/topic-inference-using-encyclopedie.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3505347734779741481'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3505347734779741481'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/08/topic-inference-using-encyclopedie.html' title='Topic inference using the Encyclopédie trained model'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-5769785754531165914</id><published>2009-08-14T11:55:00.005-05:00</published><updated>2009-11-14T11:37:30.911-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><title type='text'>Proportions of topics in Encyclopédie articles</title><content type='html'>This is a follow-up to my previous blog entry about &lt;a style="color: rgb(51, 102, 255);" href="http://artfl.blogspot.com/2009/08/preliminary-results-on-topic-modeling.html"&gt;topic modeling in the Encyclopédie&lt;/a&gt;. As the title of this post suggests, I will be showing here the proportions of topics per article. Instead of just posting those results without any further comment, I would like to focus on 12 random articles to see what kind of results one could get. My feeling about this is that the best results are in the 300 topic model. What do you think? Note that there is still a lot of room for some refinement.&lt;br /&gt;&lt;br /&gt;Examples from the 42 topic model :&lt;br /&gt;&lt;a id="publishedDocumentUrl" class="tabcontent" target="_blank" href="http://docs.google.com/View?id=dgrbcw9z_69gk9w5tgc"&gt;http://docs.google.com/View?id=dgrbcw9z_69gk9w5tgc&lt;/a&gt;&lt;br /&gt;Examples from the 100 topic model:&lt;br /&gt;&lt;a href="http://docs.google.com/View?id=dgrbcw9z_70c2n79kgv"&gt;http://docs.google.com/View?id=dgrbcw9z_70c2n79kgv&lt;/a&gt;&lt;br /&gt;Examples from the 150 topic model:&lt;br /&gt;&lt;a href="http://docs.google.com/View?id=dgrbcw9z_71cx73tsch"&gt;http://docs.google.com/View?id=dgrbcw9z_71cx73tsch&lt;/a&gt;&lt;br /&gt;Examples from the 200 topic model:&lt;br /&gt;&lt;a href="http://docs.google.com/View?id=dgrbcw9z_724t5x9mfm"&gt;http://docs.google.com/View?id=dgrbcw9z_724t5x9mfm&lt;/a&gt;&lt;br /&gt;Examples from the 250 topic model:&lt;br /&gt;&lt;a href="http://docs.google.com/View?id=dgrbcw9z_73fvznkb7j"&gt;http://docs.google.com/View?id=dgrbcw9z_73fvznkb7j&lt;/a&gt;&lt;br /&gt;Examples from the 300 topic model:&lt;br /&gt;&lt;a href="http://docs.google.com/View?id=dgrbcw9z_74chqfgsct"&gt;http://docs.google.com/View?id=dgrbcw9z_74chqfgsct&lt;/a&gt;&lt;br /&gt;Examples from the 350 topic model:&lt;br /&gt;&lt;a href="http://docs.google.com/View?id=dgrbcw9z_75chsw8gcp"&gt;http://docs.google.com/View?id=dgrbcw9z_75chsw8gcp&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;If you wish to look yourself at the results, here they are, the first number is the topic with the proportion measure in parentheses. The article number is the div number of the article :&lt;br /&gt;&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_42.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_42.txt&lt;/a&gt;&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_100.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_100.txt&lt;/a&gt;&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_150.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_150.txt&lt;/a&gt;&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_200.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_200.txt&lt;/a&gt;&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_250.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_250.txt&lt;/a&gt;&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_300.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_300.txt&lt;/a&gt;&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_350.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/topics_in_articles_350.txt&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-5769785754531165914?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/08/proportions-of-topics-in-encyclopedie.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5769785754531165914'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5769785754531165914'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/08/proportions-of-topics-in-encyclopedie.html' title='Proportions of topics in Encyclopédie articles'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-8844659665590918553</id><published>2009-08-11T09:51:00.014-05:00</published><updated>2009-08-11T12:49:50.061-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='philologic'/><category scheme='http://www.blogger.com/atom/ns#' term='nosql'/><category scheme='http://www.blogger.com/atom/ns#' term='architecture'/><title type='text'>the PhiloLogic Data Architecture</title><content type='html'>For the last year or so, I've been arguing that it's time for a round of maintenance work on PhiloLogic's various retrieval sub-systems.  In a later post, I'll examine some of the newer data store components out there in the open-source world.  First, however, I'd like to enumerate what PhiloLogic's main storage components are, where they live, and how they work, for clarity and economy of reference.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Main Word Index:&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;PhiloLogic's central data store is a &lt;a href="http://www.gnu.org/software/gdbm/"&gt;GDBM&lt;/a&gt; hashtable called &lt;span style="font-weight: bold;"&gt;index&lt;/span&gt; that functions, basically, the same way as a Perl hash, but on disk, rather than in memory.  It has a set of keys, in this case each unique word in the database.  Each key corresponds to a short-ish byte-string value, which can come in two different formats:&lt;br /&gt;&lt;br /&gt;For low-frequency words, each key word corresponds to a packed binary data object that contains three components:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;A short header that says, "I'm a low-frequency word!"&lt;/li&gt;&lt;li&gt;Total frequency for this word.  This is used by the query optimizer.&lt;/li&gt;&lt;li&gt;A compressed binary hitlist for the word, containing the byte offset and object address of every occurrence of the word.&lt;/li&gt;&lt;/ol&gt;For high frequency words, the structure is similar.  A type header is followed by the total frequency, which is followed by an address into the raw block index, called &lt;span style="font-weight: bold;"&gt;index.1&lt;/span&gt;.  If you've ever looked at a database directory, you may have noticed that this &lt;span style="font-weight: bold;"&gt;index.1&lt;/span&gt; file is typically two or three times the size of the main index.  That's because it contains the binary hitlists for all the high-frequency words in the database, &lt;span style="font-style: italic;"&gt;divided into 2-kilobyte chunks&lt;/span&gt;.  That's important, because, as Zipf's law will tell us, the most frequent words in a language can be very, very frequent, and thus the hits for a single word could go on for tens or or hundreds of megabytes.  By dividing large hitlists into chunks, we can put a limit on memory usage.  In a modern system, we could set a higher ceiling- 64k might be reasonable.  But architecturally, the chunking algorithms are vital for frequent words or large databases.&lt;br /&gt;&lt;br /&gt;The upside of this admittedly complex architecture is PhiloLogic's raison d'etre: its blindingly fast search performance.  The downside is that GDBM doesn't support some of the features that we expect, particularly ones that involve more complicated searches than the simple keywords.&lt;br /&gt;&lt;br /&gt;Thus, we added a plain-text token table, &lt;span style="font-weight: bold;"&gt;words.R&lt;/span&gt;, that we can grep through quite quickly to get a list of all valid keys that match a specified pattern, and a tab delimited token table, &lt;span style="font-weight: bold;"&gt;words.R.wom&lt;/span&gt;, that we can grep through for various secondary attributes and normalizations of the indexed tokens.&lt;br /&gt;&lt;br /&gt;Both of these functions are very fast, due to the high throughput of GNU grep. The only downside is the opacity of the index construction process, which can make modifications to this structure very difficult.  That said, it's capable of handling unexpectedly rich data if you understand where everything goes.  I'll go into this in more depth in my Perseus whitepaper.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Document Metadata:&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Traditional 3-series PhiloLogic keeps the most important information about it's XML document store &lt;span style="font-weight: bold;"&gt;&lt;/span&gt;in a file called &lt;span style="font-weight: bold;"&gt;docinfo&lt;/span&gt;, which contains the filename, size, date, and a few other book-keeping tidbits.  The reporting systems use this file for the basic tasks of opening an XML file and reading a section out of it, whether for search results context, or for browsing with getobject.&lt;br /&gt;&lt;br /&gt;All other data go in a file called &lt;span style="font-weight: bold;"&gt;bibliography&lt;/span&gt;, which has about 20 fields for author, title, publisher, language, etc., and which the &lt;span style="font-weight: bold;"&gt;gimme&lt;/span&gt; utilities search for bibliographic queries.  Traditionally, this is done with GNU egrep, but more recent releases have preferred MySQL for its more sophisticated query language.  The result of any query, regardless, is a list of binary document id's to pass through to the search engine as a corpus file.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Text Object Metadata:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;PhiloLogic tracks all objects below the document level as either division or paragraph objects, and stores them in two different tables: &lt;span style="font-weight: bold;"&gt;divindex.raw&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;subdivindex.raw&lt;/span&gt; respectively, and uses the &lt;span style="font-weight: bold;"&gt;subdocgimme&lt;/span&gt; utilities to query the metadata.  As before, query results, via SQL or egrep, are pushed off to search3 as a packed binary corpus file.  And again, the reporting and retrieval subsystems have their own data structure, in this case the &lt;span style="font-weight: bold;"&gt;toms&lt;/span&gt;, for contextualizing hits, or for retrieval with &lt;span style="font-weight: bold;"&gt;getobject&lt;/span&gt;. Finally, the loader builds several derived data structures called &lt;span style="font-weight: bold;"&gt;navigation&lt;/span&gt;, &lt;span style="font-weight: bold;"&gt;pagemarks&lt;/span&gt;, &lt;span style="font-weight: bold;"&gt;references&lt;/span&gt;, and &lt;span style="font-weight: bold;"&gt;dividxchild.tab&lt;/span&gt; for various internal functions.&lt;br /&gt;&lt;br /&gt;As the reader may have noticed, the document and text objects are not as clean or as optimized as the main word index, and even harder to hack coherently.  I'll detail my first attempt at a more dynamic object structure for the Perseus corpus in a later post.  For now, though, I'll pose a question:&lt;br /&gt;&lt;br /&gt;Is it possible to devise a single data structure that can handle &lt;span style="font-weight: bold;"&gt;all&lt;/span&gt; of the functions that our current gang of tables and packed binaries does?  In short, these are:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Query objects for arbitrary combinations of properties&lt;/li&gt;&lt;li&gt;Efficiently retrieve file paths and byte offsets for retrieval&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Maintain the logical relationships of all these objects&lt;/li&gt;&lt;li&gt;Resolve internal and external references to objects at any depth&lt;/li&gt;&lt;/ol&gt;My current prototypes fulfill about half of these requirements, #2 &amp;amp; 3.  But a new object architecture would ideally be a single data structure that does it all.  Can anyone think of needed features that I've missed?   What about the kind of metadata that ASP or IWW use?  Can anyone think of a component that's totally unnecessary and redundant, or notoriously buggy?  Please, let me know.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-8844659665590918553?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/08/philologic-data-architecture.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/8844659665590918553'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/8844659665590918553'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/08/philologic-data-architecture.html' title='the PhiloLogic Data Architecture'/><author><name>Richard</name><uri>http://www.blogger.com/profile/06345844875619851744</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-511983332739990759</id><published>2009-08-07T14:29:00.019-05:00</published><updated>2009-11-14T11:37:30.911-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><title type='text'>Preliminary results on topic modeling in the Encyclopédie</title><content type='html'>Following up on Mark's comments on topic modeling using Latent Dirichlet Allocation, or &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;LDA&lt;/span&gt;, I went on to explore some implementations of this &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;algorithm&lt;/span&gt; to see what type of results we would get on some of the &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;data sets&lt;/span&gt; we have. I first started using &lt;a style="color: rgb(51, 102, 255);" href="http://www.cs.princeton.edu/%7Eblei/lda-c/index.html"&gt;David &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Blei's&lt;/span&gt; code&lt;/a&gt;, but it ended being to complex to use, the documentation was very elusive. So I starting &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;to&lt;/span&gt; look at another tool, &lt;a style="color: rgb(51, 102, 255);" href="http://mallet.cs.umass.edu/index.php"&gt;Mallet&lt;/a&gt;, which also includes an implementation of &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;LDA&lt;/span&gt;.&lt;br /&gt;Here are the first results I've come up with when running it against the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;Encyclopédie&lt;/span&gt;. The main issue when using topic modeling is, as described in &lt;a style="color: rgb(51, 102, 255);" href="http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf"&gt;this article&lt;/a&gt;, coming up with the right number of topics as the results differ quite a bit depending on this number. I haven't quite settled yet for a particular number. Below are the topics I've come up with. Let me know what you think, which version(s) seems the more accurate. I would argue that the question comes down to how focused do we want each topic to be, or how broad do we want those topics to be without losing any accuracy.  Please let me know if there are some words you think I could eliminate (less noise, more accuracy). Several hints would be useful, such as pinpointing a topic that doesn't make sense, a word that seems out of place somewhere (probably some noise to be eliminated during another run). Note that the list of words that I delete from the articles (so far a little over 300) could very well be used for other 18&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;th&lt;/span&gt; century French texts, if not for different periods from 1650 to today with some tweaks here and there. Thanks.&lt;br /&gt;&lt;br /&gt;Version with 42 topics:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/42topics-encyclo.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/42topics-encyclo.txt&lt;/a&gt;&lt;br /&gt;Version with 100 topics:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/100topics-encyclo.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/100topics-encyclo.txt&lt;/a&gt;&lt;br /&gt;Version with 150 topics:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/150topics-encyclo.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/150topics-encyclo.txt&lt;/a&gt;&lt;br /&gt;Version with 200 topics:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/200topics-encyclo.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/200topics-encyclo.txt&lt;/a&gt;&lt;br /&gt;Version with 250 topics:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/250topics-encyclo.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/250topics-encyclo.txt&lt;/a&gt;&lt;br /&gt;Version with 300 topics:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/300topics-encyclo.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/300topics-encyclo.txt&lt;/a&gt;&lt;br /&gt;Version with 350 topics:&lt;br /&gt;&lt;a href="http://robespierre.uchicago.edu/topic_modeling/350topics-encyclo.txt"&gt;http://robespierre.uchicago.edu/topic_modeling/350topics-encyclo.txt&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;These results are just the preliminary step. The interesting part is the topics proportions per document. I'll show some results in another post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-511983332739990759?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/08/preliminary-results-on-topic-modeling.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/511983332739990759'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/511983332739990759'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/08/preliminary-results-on-topic-modeling.html' title='Preliminary results on topic modeling in the Encyclopédie'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-4001039962089664878</id><published>2009-07-27T15:29:00.003-05:00</published><updated>2009-07-27T15:35:21.736-05:00</updated><title type='text'>Looking at different implementations of fuzzy matching</title><content type='html'>While thinking of maybe renovating philologic, one of the possibilities we would look into would be fuzzy matching. A couple of implementations exist. I looked at what each one had to offer. Please let me know if some things are unclear. &lt;a href="http://docs.google.com/View?id=dgrbcw9z_61d7zpvs48"&gt;Here&lt;/a&gt; are the results of this investigation.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-4001039962089664878?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/07/looking-at-different-implementations-of.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/4001039962089664878'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/4001039962089664878'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/07/looking-at-different-implementations-of.html' title='Looking at different implementations of fuzzy matching'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-7389527743935173998</id><published>2009-07-27T15:13:00.005-05:00</published><updated>2009-07-28T12:53:38.690-05:00</updated><title type='text'>An experiment on text segmentation</title><content type='html'>&lt;span style=";font-family:verdana;font-size:100%;"  &gt;What is text segmentation?&lt;br /&gt;The whole point of text segmentation is to be able to divide texts into meaningful segments by using an algorithm that will analyze the text and automatically subdivide it by identifying topic shifts. This is really the first step towards a larger goal, that is being able to run a classifier on each identified segment and therefore be able to determine automatically what topic each segment is about. I therefore started investigating the possibilities of one implementation of text segmentation to see if the results were encouraging.&lt;br /&gt;The results of this experimentation can be found &lt;a href="http://docs.google.com/View?id=dgrbcw9z_596q66hpg2"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style=";font-family:verdana;font-size:100%;"  &gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-7389527743935173998?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/07/experiment-on-text-segmentation.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7389527743935173998'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7389527743935173998'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/07/experiment-on-text-segmentation.html' title='An experiment on text segmentation'/><author><name>Clovis</name><uri>http://www.blogger.com/profile/09949897464324648883</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-627504178237862252</id><published>2009-07-20T14:40:00.003-05:00</published><updated>2009-11-14T11:37:30.912-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><title type='text'>Fast Latent Dirichlet Allocation</title><content type='html'>Porteous, Ian, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth,   and Max Welling.  "Fast collapsed gibbs sampling for latent dirichlet allocation."   &lt;em&gt;KDD '08: Proceeding of the 14th ACM SIGKDD international   conference on Knowledge discovery and data mining&lt;/em&gt;.  New York, NY, USA: ACM, 2008,   569-577.  (&lt;a href="http://www.citeulike.org/group/2914/article/5210241"&gt;Link&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;This describes Fast LDA and suggests that this may be helpful in "real time" topic modeling of a few thousand documents returned by a search engine.  The introduction to section 3 gives a nice "intuitive" description of LDA, helpful for those, like me, who are significantly math challenged, as well as some algorithm descriptions.   The paper has links to &lt;a href="http://www.ics.uci.edu/%7Eiporteou/fastlda/"&gt;code&lt;/a&gt; and David Newman has posted links to some earlier &lt;a href="http://www.ics.uci.edu/%7Enewman/code/topicmodel/"&gt;code&lt;/a&gt; which may be of considerable interest.   Newman has done some interesting work on topic modeling of 18th century American newpapers (&lt;a href="http://www.citeulike.org/group/2914/article/3394394"&gt;link&lt;/a&gt; and &lt;a href="http://www.historycooperative.org/journals/cp/vol-06/no-02/tales/"&gt;link&lt;/a&gt;).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-627504178237862252?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/07/fast-latent-dirichlet-allocation.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/627504178237862252'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/627504178237862252'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/07/fast-latent-dirichlet-allocation.html' title='Fast Latent Dirichlet Allocation'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-7407124198351419146</id><published>2009-07-08T16:46:00.006-05:00</published><updated>2009-11-14T11:37:30.912-06:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Topic modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='LDA'/><title type='text'>Dynamic Topic Models</title><content type='html'>I just had a look at David Hall, Daniel Jurafsky, and Christopher Manning.  "Studying the History of Ideas Using Topic Models."   &lt;em&gt;Proceedings from the EMNLP 2008: Conference on Empirical   Methods in Natural Language Processing&lt;/em&gt;.  October 2008. [&lt;a href="http://www.citeulike.org/user/markymaypo/article/5094138"&gt;link&lt;/a&gt;]   This is a very interesting article, using Latent Dirichlet Allocation  [&lt;a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation"&gt;link&lt;/a&gt; wikipedia] and some extensions, examining changing publication trends in computational linguistics.   As noted on the Wikipedia entry, this approach [LDA] is described in David Blei, Andrew Y. Ng, and Michael I. Jordan.  "Latent Dirichlet Allocation."  &lt;em&gt;Journal of Machine Learning Research&lt;/em&gt;  3 (January 2003)  [&lt;a href="http://www.citeulike.org/user/markymaypo/article/1939239"&gt;link&lt;/a&gt;].  David Blei has released code [&lt;a href="http://www.cs.princeton.edu/%7Eblei/topicmodeling.html"&gt;link&lt;/a&gt;] and has a number of samples, a listserv, etc. on his site.  He also gave a great presentation of his work as a Google talk "&lt;span id="details-title"&gt;Modeling Science: Dynamic Topic Models of Scholarly Research" in May 2007&lt;/span&gt;  [&lt;a href="http://video.google.com/videoplay?docid=3077213787166426672"&gt;link&lt;/a&gt; video and &lt;a href="http://www.citeulike.org/user/markymaypo/article/813963"&gt;paper&lt;/a&gt;].   This appears to be a powerful technique, which has the ability to handle changing vocabularies over a century of scientific writing.&lt;br /&gt;&lt;br /&gt;In trying to run it on OS-X, I am able to currently get topics for the sample AP collection provided by Blei, but not able to get inferences as it throws malloc errors.  I'm looking at the mailing list to see if there are any hints about OS-X.&lt;br /&gt;&lt;br /&gt;Blei lists several implementations on his site, including one part of Mallet, which I think we installed here at one point.  See also &lt;a href="http://gibbslda.sourceforge.net/"&gt;http://gibbslda.sourceforge.net/&lt;/a&gt;&lt;br /&gt;for another implementation and some samples run on large Wikipedia and Medline (abstract) collections.&lt;br /&gt;&lt;br /&gt;Also noticed a Ruby module described at&lt;br /&gt;&lt;a href="http://mendicantbug.com/2008/11/17/lda-in-ruby/"&gt;http://mendicantbug.com/2008/11/17/lda-in-ruby/&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-7407124198351419146?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/07/dynamic-topic-models.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7407124198351419146'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/7407124198351419146'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/07/dynamic-topic-models.html' title='Dynamic Topic Models'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-9030421667523393928</id><published>2009-07-07T15:54:00.004-05:00</published><updated>2009-07-07T17:00:13.054-05:00</updated><title type='text'>Scribal Publication and Undiscovered Public Knowledge</title><content type='html'>In thinking about another project, I ran across Harold Love's &lt;span style="font-style: italic;"&gt;Scribal Publication in Seventeenth-Century England &lt;/span&gt;(&lt;span dir="ltr"&gt;Oxford: Clarendon Press, 1993. Pp. xi+379).  &lt;/span&gt;[&lt;a href="http://books.google.com/books?id=H7jsgiR-xVMC"&gt;Google Books&lt;/a&gt;]&lt;br /&gt;&lt;br /&gt;This has an interesting discussion regarding scribal publication as being a "perfect example" of Don Swanson's notion of "Undiscovered Public Knowledge".  "By this he [Swanson] means knowledge that exists 'like scattered pieces of a puzzle' in scholarly books and articles, but remains unknown because its 'logically related parts ... have never become known to one person."  The reference is to Don R. Swanson, 'Undiscovered public knowledge', &lt;span style="font-style: italic;"&gt;Library Quarterly&lt;/span&gt; 56 (1986).   Professor Swanson's work is aimed primarily at bio-medical research using a system that he and his colleagues call Arrowsmith, which is available on &lt;a href="http://kiwi.uchicago.edu/"&gt;http://kiwi.uchicago.edu/&lt;/a&gt; (currently in Charlie's office) which has links to recent papers and more references.&lt;br /&gt;&lt;br /&gt;It may be interesting to think about how this might be applied to research in the humanities.  Other work in the same area suggests that latent semantic indexing, a variation on the general vector space model, may be of use.&lt;br /&gt;&lt;br /&gt;A few more papers to think about:&lt;br /&gt;&lt;br /&gt;Xiaohua Hu, et al. "Mining undiscovered public knowledge from complementary and non-interactive biomedical literature through semantic pruning", &lt;span style="font-style: italic;"&gt;Proceedings of the 14th ACM international conference on Information and knowledge management&lt;/span&gt; (2005) [&lt;br /&gt;&lt;a href="http://portal.acm.org/citation.cfm?id=1099611"&gt;Link&lt;/a&gt;]  and Supercomputing Approach to Undiscovered Public Knowledge&lt;br /&gt;[&lt;a href="http://www.isrl.illinois.edu/upk/"&gt;Link&lt;/a&gt;] from, UIUC (of course).&lt;br /&gt;&lt;br /&gt;I will post more related articles on the ARTFL CiteULike and, if I remember, use the tag UDPK to cluster the papers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-9030421667523393928?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/07/scribal-publication-and-undiscovered.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/9030421667523393928'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/9030421667523393928'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/07/scribal-publication-and-undiscovered.html' title='Scribal Publication and Undiscovered Public Knowledge'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-3120295624039830008</id><published>2009-06-25T11:20:00.004-05:00</published><updated>2009-06-25T11:28:59.087-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='alignment'/><category scheme='http://www.blogger.com/atom/ns#' term='philoline'/><title type='text'>Textual Re-use of Ancient Greek Texts</title><content type='html'>&lt;h3 style="font-weight: normal;" class="post-title"&gt;Textual Re-use of Ancient Greek Texts: A case study on Plato’s works&lt;/h3&gt;                                                                                                                  &lt;p&gt;Marco Büchler &amp;amp; Annette Loos (eAqua Project, Leipzig)&lt;/p&gt;&lt;p&gt;Digital Classicist/ICS Work in Progress Seminar, Summer 2009  &lt;a href="http://www.digitalclassicist.org/wip/wip2009-04mbal.html"&gt;Link&lt;/a&gt;&lt;/p&gt;&lt;p&gt;See abstract of workshop presentation.  Appears to use ngrams with with a mechanism to "relax word order" and a kind of semantic association.   Russ and I have talked a bit about both as future extensions to PhiloLine/PAIR to improve recall, but at the risk of introducing less precision.&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-3120295624039830008?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/06/textual-re-use-of-ancient-greek-texts.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3120295624039830008'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/3120295624039830008'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/06/textual-re-use-of-ancient-greek-texts.html' title='Textual Re-use of Ancient Greek Texts'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-6593595162177273214</id><published>2009-06-25T10:02:00.001-05:00</published><updated>2009-06-25T10:04:12.865-05:00</updated><title type='text'>PhiloLogic: Ubuntu 64 bit compilation failure</title><content type='html'>Damir Cavar reports:&lt;br /&gt;&lt;br /&gt;After evaluations with various Linux distributions we came to the conclusion: Philologic index generation (the C-code) breaks on 64-bit (various versions) with a segmentation fault. We didn't manage to let it run in a 32-bit changeroot environment on Ubuntu and Debian.&lt;br /&gt;&lt;br /&gt;It works perfectly well on the newest release of the 32-bit Ubuntu server, and also on 32-bit Debian Lenny. On a 32-bit system the default is most likely that one has a memory limitation, i.e. max. 3.5 GB RAM, even though there might be more RAM available physically. If you install the Ubuntu "server kernel" on a 32-bit system, you get large memory support (i.e. more than 3.5 or 4 GB RAM), i.e. you need a PAE enabled kernel. On Debian it is the bigmem kernel you need to install. A 32-bit system is somewhat slower, there are various other disadvantages (if one uses other code or software that makes use of advanced 64-bit CPU features), but, well, we seem to have no other choice now for a solution with Philologic right now.&lt;br /&gt;&lt;br /&gt;We have a version running, now on Debian Lenny with the bigmem kernel, and we're putting the bits and pieces together, i.e. our Croatian localization, some scripts for statistics etc. Once this is up, I'll place some more docu, scripts, localizations and adaptations at the Croatian Language Corpus site: &lt;a href="http://riznica.ihjj.hr/" target="_blank"&gt;http://riznica.ihjj.hr/&lt;/a&gt; (this is still the old system, we are just migrating the infrastructure to new servers, using Lenny)&lt;br /&gt;&lt;br /&gt;More can soon be found on the pages of the Linguistics dept. at the University of Zadar: &lt;a href="http://ling.unizd.hr/" target="_blank"&gt;http://ling.unizd.hr/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Should somebody have a fix for a 64-bit Linux environment, hints would be very much appreciated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-6593595162177273214?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/06/philologic-ubuntu-64-bit-compilation.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/6593595162177273214'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/6593595162177273214'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/06/philologic-ubuntu-64-bit-compilation.html' title='PhiloLogic: Ubuntu 64 bit compilation failure'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8901065416749663157.post-5981355597153839540</id><published>2009-06-25T09:37:00.003-05:00</published><updated>2009-06-25T09:49:31.350-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='software'/><title type='text'>ASV Toolbox project</title><content type='html'>&lt;a href="http://wortschatz.uni-leipzig.de/%7Ecbiemann/software/toolbox/"&gt;http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;ASV Toolbox is a modular collection of tools for the exploration of written language data. They work either on word lists or text and solve several linguistic classification and clustering tasks. The topics covered contain language detection, POS-tagging, base form reduction, named entity recognition, and terminology extraction. On a more abstract level, the algorithms deal with various kinds of word similarity, using pattern based and statistical approaches. The collection can be used to work on large real world data sets as well as for studying the underlying algorithms. The ASV Toolbox can work on plain text files and connect to a MySQL database. While it is especially designed to work with corpora of the &lt;a href="http://corpora.uni-leipzig.de/"&gt;Leipzig Corpora Collection&lt;/a&gt;, it can easily be adapted to other sources.&lt;br /&gt;&lt;br /&gt;Many of these appear to be described in &lt;a href="http://lips.informatik.uni-leipzig.de/browse/results/field_authors:%22Biemann,%20C.%22"&gt;recent papers&lt;/a&gt; by Beimann and his collaborators.&lt;br /&gt;&lt;br /&gt;Thanks to Alain Guerreau for the pointer.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8901065416749663157-5981355597153839540?l=artfl.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='text/html' href='http://artfl.blogspot.com/2009/06/asv-toolbox-project.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5981355597153839540'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8901065416749663157/posts/default/5981355597153839540'/><link rel='alternate' type='text/html' href='http://artfl.blogspot.com/2009/06/asv-toolbox-project.html' title='ASV Toolbox project'/><author><name>Mark</name><uri>http://www.blogger.com/profile/01834980565423639300</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
