While thinking of maybe renovating philologic, one of the possibilities we would look into would be fuzzy matching. A couple of implementations exist. I looked at what each one had to offer. Please let me know if some things are unclear. Here are the results of this investigatio...
An experiment on text segmentation
What is text segmentation?The whole point of text segmentation is to be able to divide texts into meaningful segments by using an algorithm that will analyze the text and automatically subdivide it by identifying topic shifts. This is really the first step towards a larger goal, that is being able to run a classifier on each identified segment and therefore be able to determine automatically what topic each segment is about. I therefore started investigating the possibilities of one implementation of text segmentation to see...
Fast Latent Dirichlet Allocation
Porteous, Ian, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. "Fast collapsed gibbs sampling for latent dirichlet allocation." KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2008, 569-577. (Link)This describes Fast LDA and suggests that this may be helpful in "real time" topic modeling of a few thousand documents returned by a search engine. The introduction to section 3 gives a nice "intuitive"...
Dynamic Topic Models
I just had a look at David Hall, Daniel Jurafsky, and Christopher Manning. "Studying the History of Ideas Using Topic Models." Proceedings from the EMNLP 2008: Conference on Empirical Methods in Natural Language Processing. October 2008. [link] This is a very interesting article, using Latent Dirichlet Allocation [link wikipedia] and some extensions, examining changing publication trends in computational linguistics. As noted on the Wikipedia entry, this approach [LDA] is described in David Blei, Andrew Y. Ng, and...
Scribal Publication and Undiscovered Public Knowledge
In thinking about another project, I ran across Harold Love's Scribal Publication in Seventeenth-Century England (Oxford: Clarendon Press, 1993. Pp. xi+379). [Google Books]This has an interesting discussion regarding scribal publication as being a "perfect example" of Don Swanson's notion of "Undiscovered Public Knowledge". "By this he [Swanson] means knowledge that exists 'like scattered pieces of a puzzle' in scholarly books and articles, but remains unknown because its 'logically related parts ... have never become known...