ARTFL Project Research Blog

Looking at different implementations of fuzzy matching

Clovis Monday, July 27, 2009 Leave a Comment

While thinking of maybe renovating philologic, one of the possibilities we would look into would be fuzzy matching. A couple of implementations exist. I looked at what each one had to offer. Please let me know if some things are unclear. Here are the results of this investigation.

An experiment on text segmentation

Clovis Monday, July 27, 2009 Leave a Comment

What is text segmentation?
The whole point of text segmentation is to be able to divide texts into meaningful segments by using an algorithm that will analyze the text and automatically subdivide it by identifying topic shifts. This is really the first step towards a larger goal, that is being able to run a classifier on each identified segment and therefore be able to determine automatically what topic each segment is about. I therefore started investigating the possibilities of one implementation of text segmentation to see if the results were encouraging.
The results of this experimentation can be found here.

Fast Latent Dirichlet Allocation

Mark Monday, July 20, 2009 Leave a Comment

Porteous, Ian, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. "Fast collapsed gibbs sampling for latent dirichlet allocation." KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2008, 569-577. (Link)

This describes Fast LDA and suggests that this may be helpful in "real time" topic modeling of a few thousand documents returned by a search engine. The introduction to section 3 gives a nice "intuitive" description of LDA, helpful for those, like me, who are significantly math challenged, as well as some algorithm descriptions. The paper has links to code and David Newman has posted links to some earlier code which may be of considerable interest. Newman has done some interesting work on topic modeling of 18th century American newpapers (link and link).

Dynamic Topic Models

Mark Wednesday, July 08, 2009 1 comment

I just had a look at David Hall, Daniel Jurafsky, and Christopher Manning. "Studying the History of Ideas Using Topic Models." Proceedings from the EMNLP 2008: Conference on Empirical Methods in Natural Language Processing. October 2008. [link] This is a very interesting article, using Latent Dirichlet Allocation [link wikipedia] and some extensions, examining changing publication trends in computational linguistics. As noted on the Wikipedia entry, this approach [LDA] is described in David Blei, Andrew Y. Ng, and Michael I. Jordan. "Latent Dirichlet Allocation." Journal of Machine Learning Research 3 (January 2003) [link]. David Blei has released code [link] and has a number of samples, a listserv, etc. on his site. He also gave a great presentation of his work as a Google talk "Modeling Science: Dynamic Topic Models of Scholarly Research" in May 2007 [link video and paper]. This appears to be a powerful technique, which has the ability to handle changing vocabularies over a century of scientific writing.

In trying to run it on OS-X, I am able to currently get topics for the sample AP collection provided by Blei, but not able to get inferences as it throws malloc errors. I'm looking at the mailing list to see if there are any hints about OS-X.

Blei lists several implementations on his site, including one part of Mallet, which I think we installed here at one point. See also http://gibbslda.sourceforge.net/
for another implementation and some samples run on large Wikipedia and Medline (abstract) collections.

Also noticed a Ruby module described at
http://mendicantbug.com/2008/11/17/lda-in-ruby/

Scribal Publication and Undiscovered Public Knowledge

Mark Tuesday, July 07, 2009 Leave a Comment

In thinking about another project, I ran across Harold Love's Scribal Publication in Seventeenth-Century England (Oxford: Clarendon Press, 1993. Pp. xi+379). [Google Books]

This has an interesting discussion regarding scribal publication as being a "perfect example" of Don Swanson's notion of "Undiscovered Public Knowledge". "By this he [Swanson] means knowledge that exists 'like scattered pieces of a puzzle' in scholarly books and articles, but remains unknown because its 'logically related parts ... have never become known to one person." The reference is to Don R. Swanson, 'Undiscovered public knowledge', Library Quarterly 56 (1986). Professor Swanson's work is aimed primarily at bio-medical research using a system that he and his colleagues call Arrowsmith, which is available on http://kiwi.uchicago.edu/ (currently in Charlie's office) which has links to recent papers and more references.

It may be interesting to think about how this might be applied to research in the humanities. Other work in the same area suggests that latent semantic indexing, a variation on the general vector space model, may be of use.

A few more papers to think about:

Xiaohua Hu, et al. "Mining undiscovered public knowledge from complementary and non-interactive biomedical literature through semantic pruning", Proceedings of the 14th ACM international conference on Information and knowledge management (2005) [
Link] and Supercomputing Approach to Undiscovered Public Knowledge
[Link] from, UIUC (of course).

I will post more related articles on the ARTFL CiteULike and, if I remember, use the tag UDPK to cluster the papers.

ARTFL Project Research Blog

Looking at different implementations of fuzzy matching

An experiment on text segmentation

Fast Latent Dirichlet Allocation

Dynamic Topic Models

Scribal Publication and Undiscovered Public Knowledge

Labels

Popular Posts

Blog Archive

Developed by ARTFL