ARTFL Project Research Blog

Looking at different implementations of fuzzy matching

Clovis Monday, July 27, 2009 Leave a Comment

While thinking of maybe renovating philologic, one of the possibilities we would look into would be fuzzy matching. A couple of implementations exist. I looked at what each one had to offer. Please let me know if some things are unclear. Here are the results of this investigation.

An experiment on text segmentation

Clovis Monday, July 27, 2009 Leave a Comment

What is text segmentation?
The whole point of text segmentation is to be able to divide texts into meaningful segments by using an algorithm that will analyze the text and automatically subdivide it by identifying topic shifts. This is really the first step towards a larger goal, that is being able to run a classifier on each identified segment and therefore be able to determine automatically what topic each segment is about. I therefore started investigating the possibilities of one implementation of text segmentation to see if the results were encouraging.
The results of this experimentation can be found here.

Fast Latent Dirichlet Allocation

Mark Monday, July 20, 2009 Leave a Comment

Porteous, Ian, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. "Fast collapsed gibbs sampling for latent dirichlet allocation." KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2008, 569-577. (Link)

This describes Fast LDA and suggests that this may be helpful in "real time" topic modeling of a few thousand documents returned by a search engine. The introduction to section 3 gives a nice "intuitive" description of LDA, helpful for those, like me, who are significantly math challenged, as well as some algorithm descriptions. The paper has links to code and David Newman has posted links to some earlier code which may be of considerable interest. Newman has done some interesting work on topic modeling of 18th century American newpapers (link and link).

Dynamic Topic Models

Mark Wednesday, July 08, 2009 1 comment

I just had a look at David Hall, Daniel Jurafsky, and Christopher Manning. "Studying the History of Ideas Using Topic Models." Proceedings from the EMNLP 2008: Conference on Empirical Methods in Natural Language Processing. October 2008. [link] This is a very interesting article, using Latent Dirichlet Allocation [link wikipedia] and some extensions, examining changing publication trends in computational linguistics. As noted on the Wikipedia entry, this approach [LDA] is described in David Blei, Andrew Y. Ng, and Michael I. Jordan. "Latent Dirichlet Allocation." Journal of Machine Learning Research 3 (January 2003) [link]. David Blei has released code [link] and has a number of samples, a listserv, etc. on his site. He also gave a great presentation of his work as a Google talk "Modeling Science: Dynamic Topic Models of Scholarly Research" in May 2007 [link video and paper]. This appears to be a powerful technique, which has the ability to handle changing vocabularies over a century of scientific writing.

In trying to run it on OS-X, I am able to currently get topics for the sample AP collection provided by Blei, but not able to get inferences as it throws malloc errors. I'm looking at the mailing list to see if there are any hints about OS-X.

Blei lists several implementations on his site, including one part of Mallet, which I think we installed here at one point. See also http://gibbslda.sourceforge.net/
for another implementation and some samples run on large Wikipedia and Medline (abstract) collections.

Also noticed a Ruby module described at
http://mendicantbug.com/2008/11/17/lda-in-ruby/

Scribal Publication and Undiscovered Public Knowledge

Mark Tuesday, July 07, 2009 Leave a Comment

In thinking about another project, I ran across Harold Love's Scribal Publication in Seventeenth-Century England (Oxford: Clarendon Press, 1993. Pp. xi+379). [Google Books]

This has an interesting discussion regarding scribal publication as being a "perfect example" of Don Swanson's notion of "Undiscovered Public Knowledge". "By this he [Swanson] means knowledge that exists 'like scattered pieces of a puzzle' in scholarly books and articles, but remains unknown because its 'logically related parts ... have never become known to one person." The reference is to Don R. Swanson, 'Undiscovered public knowledge', Library Quarterly 56 (1986). Professor Swanson's work is aimed primarily at bio-medical research using a system that he and his colleagues call Arrowsmith, which is available on http://kiwi.uchicago.edu/ (currently in Charlie's office) which has links to recent papers and more references.

It may be interesting to think about how this might be applied to research in the humanities. Other work in the same area suggests that latent semantic indexing, a variation on the general vector space model, may be of use.

A few more papers to think about:

Xiaohua Hu, et al. "Mining undiscovered public knowledge from complementary and non-interactive biomedical literature through semantic pruning", Proceedings of the 14th ACM international conference on Information and knowledge management (2005) [
Link] and Supercomputing Approach to Undiscovered Public Knowledge
[Link] from, UIUC (of course).

I will post more related articles on the ARTFL CiteULike and, if I remember, use the tag UDPK to cluster the papers.

Textual Re-use of Ancient Greek Texts

Mark Thursday, June 25, 2009 Leave a Comment

Textual Re-use of Ancient Greek Texts: A case study on Plato’s works

Marco Büchler & Annette Loos (eAqua Project, Leipzig)

Digital Classicist/ICS Work in Progress Seminar, Summer 2009 Link

See abstract of workshop presentation. Appears to use ngrams with with a mechanism to "relax word order" and a kind of semantic association. Russ and I have talked a bit about both as future extensions to PhiloLine/PAIR to improve recall, but at the risk of introducing less precision.

PhiloLogic: Ubuntu 64 bit compilation failure

Mark Thursday, June 25, 2009 Leave a Comment

Damir Cavar reports:

After evaluations with various Linux distributions we came to the conclusion: Philologic index generation (the C-code) breaks on 64-bit (various versions) with a segmentation fault. We didn't manage to let it run in a 32-bit changeroot environment on Ubuntu and Debian.

It works perfectly well on the newest release of the 32-bit Ubuntu server, and also on 32-bit Debian Lenny. On a 32-bit system the default is most likely that one has a memory limitation, i.e. max. 3.5 GB RAM, even though there might be more RAM available physically. If you install the Ubuntu "server kernel" on a 32-bit system, you get large memory support (i.e. more than 3.5 or 4 GB RAM), i.e. you need a PAE enabled kernel. On Debian it is the bigmem kernel you need to install. A 32-bit system is somewhat slower, there are various other disadvantages (if one uses other code or software that makes use of advanced 64-bit CPU features), but, well, we seem to have no other choice now for a solution with Philologic right now.

We have a version running, now on Debian Lenny with the bigmem kernel, and we're putting the bits and pieces together, i.e. our Croatian localization, some scripts for statistics etc. Once this is up, I'll place some more docu, scripts, localizations and adaptations at the Croatian Language Corpus site: http://riznica.ihjj.hr/ (this is still the old system, we are just migrating the infrastructure to new servers, using Lenny)

More can soon be found on the pages of the Linguistics dept. at the University of Zadar: http://ling.unizd.hr/

Should somebody have a fix for a 64-bit Linux environment, hints would be very much appreciated.

ASV Toolbox project

Mark Thursday, June 25, 2009 Leave a Comment

http://wortschatz.uni-leipzig.de/~cbiemann/software/toolbox/

ASV Toolbox is a modular collection of tools for the exploration of written language data. They work either on word lists or text and solve several linguistic classification and clustering tasks. The topics covered contain language detection, POS-tagging, base form reduction, named entity recognition, and terminology extraction. On a more abstract level, the algorithms deal with various kinds of word similarity, using pattern based and statistical approaches. The collection can be used to work on large real world data sets as well as for studying the underlying algorithms. The ASV Toolbox can work on plain text files and connect to a MySQL database. While it is especially designed to work with corpora of the Leipzig Corpora Collection, it can easily be adapted to other sources.

Many of these appear to be described in recent papers by Beimann and his collaborators.

Thanks to Alain Guerreau for the pointer.

ARTFL Project Research Blog

Looking at different implementations of fuzzy matching

An experiment on text segmentation

Fast Latent Dirichlet Allocation

Dynamic Topic Models

Scribal Publication and Undiscovered Public Knowledge

Textual Re-use of Ancient Greek Texts

Textual Re-use of Ancient Greek Texts: A case study on Plato’s works

PhiloLogic: Ubuntu 64 bit compilation failure

ASV Toolbox project

Labels

Popular Posts

Blog Archive

Developed by ARTFL