TextPAIR: a new high-performance sequence aligner

Leave a Comment
We are happy to announce the release of TextPAIR, a new sequence aligner focused on detecting reuses in large body of texts. In many ways, TextPAIR is a successor to the old TextPAIR and PhiloLine released in 2009. But it also differs in important ways which we will highlight here.

The ARTFL-Project has long worked on intertextuality (see our papers section on the ARTFL site), and finding ways to detect similar passages in running text. Although we found great success with PhiloLine, particularly in the context of the Commonplace Cultures project, we also faced certain limitations which we wanted to address, particularly in the case of our recent project to explore the legacies of the Enlightenment in 19th century print culture.

Higher performance

The first issue that we wanted to address was that of performance. PhiloLine certainly wasn't slow, but it also wasn't designed to run a very large scale datasets, and remains to this day an experimental implementation meant to be replaced by a more optimized version. It served us well during the Commonplace Cultures project, where we ran the aligner against 200,000 texts. But the task also took 3 weeks to run, and needed to be broken up into several batches to run entirely. The results were certainly fruitful (over 40 million shared passages were detected!), but rerunning the task with a different set of parameters was out of the question given the deadline for the completion of the project.

As a result, when we started designing the new generation of our sequence aligner, we decided to focus on performance. We also wanted to leverage the rich Python ecosystem of NLP tools, so we decided that we would write this new package in Python (PhiloLine was written in Perl). After a redesign of the matching algorithm, the initial Python version was able to run about 1.5 to 2 times faster, but with also a much higher RAM usage, about 4-6 times more than PhiloLine. Certainly not a ground-breaking difference... Accelerating the alignment by parallelizing the task was out of the question given the memory cost of using multiprocessing in Python.

While we could have at that point decided to use Cython to gain C speed and parallelize the code, we decided to take a look at Go, a relatively new language developed at Google, which excels at running concurrent tasks, and runs significantly faster than Python. After a proof of concept rewrite in Go showed that we could run an alignment of all ARTFL-Frantext's 3,500 texts in under 4 hours on a single core, a task that took about 10 hours with the Python version,, we decided to go for a pure Go implementation of the core aligner code. While the RAM usage was a bit lower in Go than in Python, it was still somewhat high for our purposes, so we decided to use only 32 bit integers for all integer values (instead of the 64 bit default), effectively halving our memory usage. Our highest potential integer values are in the byte positions of passages within documents, and given that we are unlikely to find a 2,147,483,647 -- the maximum value for 32 bit signed integers -- byte text file anytime soon, there was no risk in switching to 32 bit integers. 

After a number of optimizations to the code, we were able to bring down the runtime of our ARTFL-Frantext alignment to a mere 11 minutes (!!!), leveraging all 16 cores (and 32 threads) of our server. With the Python preprocessing included (which combines various normalization steps and the ngram creation), as well as the database loading and web application building, it took a total of 20 minutes to go from the PhiloLogic parsed output to a full functioning web application capable of search through the 60,000 alignments. As a result of these optimizations, we were able to compete the alignment of our Enlightenment legacies project, which compared 1,300 texts from before the 19th century to 115,000 files from the TGB collection, in about 4 hours, most of which was spent preprocessing and filtering the OCR files from the TGB. We were able to run this alignment multiple times using different parameters in order to obtain the best set of results.

A revamped preprocessing stage

A big aspect of the aligner rewrite was our decision to rely as much as possible on the fledging Python ecosystem for all of our text preprocessing. There are many libraries available for preprocessing, but we decided to leverage Spacy, a well-documented and ever improving NLP library, for part-of-speech tagging and lemmatization. We do plan on using more of its features in the future.

We have also worked on building a virtual modernization pipeline for both English (relying on Martin's Mueller's work on TCP resources) and French (using the work Marine Riguet did on modernizing old forms in ARTFL-Frantext). This is an important feature to have when comparing texts from different periods. The typical example in French would be converting old forms of the imperfect ending in -ois/-oit/-oient to -ais/-ait/-aient. 

As we were working on this code, we realized it would be more useful to break-up this preprocessing step from the TextPAIR code so we could reuse it for other text analysis work. We've therefore created a separate library, called text-preprocessing, which is available on Github, and which we are constantly working to improve separately from the TextPAIR code.

A much improved Web Application

The original PhiloLine had a web application associated with it, and which could be used to search through alignments. But the feature set was restricted to searching shared passages using various metadata filtering options. Following in PhiloLogic4's footsteps, TextPAIR's Web interface has added faceted browsing to aggregate reuses in a way that gives a better overall perspective on the reuses present in the database. We also offer a Time Series view of shared passages to better understand how any given author/work has been reused accross time. And finally, we've worked on getting it integrated into PhiloLogic4 (when alignments were built from PhiloLogic4 output) by providing contextual links that take you straight to a PhiloLogic instance. 

Future work

We are looking to improve the current version of TextPAIR on different fronts:
  • Provide an alternate matching algorithm that can link together more loosely related passages
  • Allow for easier configuration of various components of the aligner and Web Application
  • Include a contextualization feature within TextPAIR (as an alternative to linking to a PhiloLogic instance)
  • Provide visualizations of alignments showing clusters of reuses, as well as of document to document shared passages

Some examples of currently running alignment databases

TextPAIR is fully open-source, and we gladly welcome any comments and/or contributions.

Read More

Evaluating the Practices and Legacy of the Enlightenment on 19th Century Print Culture

Leave a Comment
ARTFL is proud to announce the release of two large-scale sequence alignment databases built within the context of a collaborative project with l'Observatoire de la Vie Littéraire (OBVIL). The goal of this project was to investigate the legacy of the French Enlightenment on 19th century print culture. Thanks to the release by the BNF of the "Très Grand Bibliothèque" (TGB), a collection of 128,000 texts from their digital archive, we attempted to evaluate the presence of Enlightenment discourse within the French 19th century, relying on well-known text-reuse detection techniques. This project represented a natural outgrowth from previous research into sequence alignment in large collections, and resulted in the open-source release of TextPAIR, a high performance sequence aligner capable of comparing hundreds of thousands of documents in a mere 4 or 5 hours.

We used two well-curated datasets from the ARTFL Project holdings to form the test samples to identify Enlightenment discourse. The first are the 1,367 documents that comprise the pre-19th century holdings in ARTFL Frantext. This dataset contains a significant, though by no means complete, sample of major and minor French Enlightenment published works. We decide to retain Frantext’s 17th century holdings as part of this study. Thus, the most frequent authors with more than 10 works in this collection are shown in Table One (see bottom of post). The second sample is the complete text of the Encyclopédie of Diderot and d’Alembert as found in the ARTFL edition of this famous work. As mentioned, the ARTFL Frantext corpus and the Encyclopédie are both curated collections that have been largely corrected of input and other errors as well as being reasonable close transcriptions of the original documents with most later editorial interventions having been removed.

The TGB collection, which was meant to be a representative sample of French 19th century print culture, is comprised 128,441 documents which were digitized using Optical Character Recognition. As expected, the quality of the raw data varies widely depending on a whole range of factors, including age, preservation status and print quality, though it was overall of good quality. On the other hand, the document-level metadata was quite inconsistent, and sometimes incorrect, so our collaborators at the Observatoire de la Vie Littéraire had to perform some extensive preliminary work in order to get the data ready for our alignment experiments. This included a number of authorship attribution issues, as well as normalizing the spelling of each author found in the corpus. Additionally, while the vast majority of the texts in the TGB were published during the 19th century, the collection has a significant number of documents which were originally published before 1800. Most of these documents were reprints of earlier texts in complete or selected works or, less commonly, as individual reprints. We used a series of heuristics based on the metadata provided by the BNF to eliminate duplicates and texts originally published before 1800. We removed 17,063 documents from the TGB sample, with the top authors removed listed in Table Two (see bottom of post). This left 112,907 documents in the TGB sample. There are, of course, some titles that should have been retained in the sample and others that should have been removed, since the criteria for removal was based on fairly simple heuristics, such as removing most titles identified as complete works and looking at author year of birth or death, where available, as another criteria. Given that our goal was to draw a picture of the legacy of the Enlightenment using a representative sample of works published in the 19th century, this was a well worth tradeoff given the potential for many false positive reuses that would have been detected from leaving in texts originally written in previous centuries.

Since the primary task of this project is the identification of reused passages, we used the combined word lists of the Frantext sample and the Encyclopédie as the list of words to index in the TGB for both search and alignment applications. This was done in order to reduce the number of unique words (types) to a manageable level and to ignore all the potential OCR errors using the well attested word list of work from our well-curated texts. It did not have an impact on the alignment tasks since we use exact n-gram matching, so any words not found in the source text word list would not be found in the target text. We retained 193,908 types, amounting to a total of 2.1 billion words (tokens). 

TextPAIR (Pairwise Alignment of Intertextual Relations)
While the ARTFL Project had built text alignment packages in the past, this system was not built for very large-scale comparisons -- 100,000+ document ranges. As such, we wanted to create a new software package that could retain the strengths of PhiloLine while addressing the problem of scalability. Speed and scalability is important since data-mining projects often make progress through multiple runs testing various parameters and settings. Thus it was necessary for us to build a tool that we could rerun multiple times without having to wait for weeks for results to come in, as had been the case with the original implementation of PhiloLine.

The TextPAIR package was written over the course of many months during which the team at the ARTFL Project was in regular contact with the team at OBVIL in order to gather as much feedback as possible during the development phase. Its algorithm is based on the same principle used in PhiloLine, combining an n-gram representation of text with an alignment logic inspired by research in DNA sequencing. The alignment software comes with a web application designed to facilitate the exploration of the text-reuses found during the detection phase. This application includes both a faceted browser and a time series feature.

Detecting identical or similar passages requires a one-to-one document comparison of every text in the dataset. Our new program, called TextPAIR, generates a list of similar passages (based on a set of flexible matching parameters) shared between any two texts. This simple approach allows us to find borrowings and other instances of text reuse, from quotations to uncited passages and paraphrases, over large heterogeneous corpora. ln order for TextPAIR to find shared passages, we apply a number of transformations to the texts. For instance, we remove all stopwords, common function words, and short words which tend to be ubiquitous and, thus, are not reliable markers of textual similarity. We also reduce the number of orthographic variants by normalizing spelling where possible, and eliminate all words that occur only once in the dataset. The remaining words are then grouped into units of n-number of words – or n-grams – where each unit overlaps with the preceding and following group. These n-grams form a representation of the text that privileges word rareness over ubiquity, unlike textual representations that retain every single word.

Only once we have performed these textual transformations can we start comparing documents to one another. Because it is designed to run on many thousands of texts, TextPAIR’s matching algorithm is relatively simple and straightforward. Any more complex alignment algorithm, such as the Smith-Waterman algorithm, would significantly increase processing time. The basic principle of our text aligner is to compare sequences of n-grams between two documents. Whenever TextPAIR finds matching n-grams, a relatively rare occurrence, it continues comparing until it no longer finds sufficient matching n-grams. It then determines whether the number of contiguous matching n-grams is large enough to constitute a meaningful shared passage.

The TextPAIR package was built using cutting-edge technologies. Installed as a Python package, it includes a text preprocessing component written in Python, a sequence aligner written in Go to maximize speed and scalability, and a single-page web application written with the VueJS framework to guarantee maximum interactivity when text alignments are deployed in the browser. The package is available as open-source on Github, with accompanying documentation meant to assist other research groups in installing and running their own text-reuse experiments.

TextPAIR: General Results and Usage overview
The sequence alignments of the pre 19th century sample of Frantext and the Encyclopédie against the 112,000 documents of the TGB produced a large number of resulting passage pairs, the basic unit of analysis. Figure One shows a typical alignment pair, in this case a passage from the famous Discours Préliminaire reused with some indication of the source in Peignot’s Dictionnaire raisonné de bibliologie. It is important to note that the TextPAIR can detect similar passages with considerable variations which can arise from textual insertions, deletions or modifications along with data capture errors, differences in spellings and word order changes. The figure below uses the “Show differences” feature to highlight the variations between the passage pair.

Each record of the result database stores metadata for each document of the pair from the TEI headers, byte locations and offsets in the corresponding text data files, the passages in question, the size of the alignments, and whether or not the alignment is considered banal. We have in other instances, put addition data describing the passage pair, including whether or not it was from the Bible and related to other passages in the set (commonplace tracking). The databases are loaded into a PostgreSQL relational database with a dedicated interface to allow users to query the document pairs, get summary results and navigate to the original documents at will.

The alignment between the Encyclopédie and the TGB resulted in almost 117,000 records. This number is somewhat deceptive since it contains a number of banal alignments, such as the title of the Encyclopédie and other uninteresting similar passages. Similarly, the alignment between the pre-19th century of ARTFL Frantext and the TGB resulted in just under 295,000 passages, which is reduced to over 201,000 passages when removing short and banal passages. Such filtering is among the many features of the alignment result database implementation. The figure below shows the query form of the Encyclopédie to TGB alignment database, which supports metadata queries to allow the user to focus on specific questions, in this case a search for all aligned passages from articles written by Rousseau.

The query returns 611 passages, as shown in the figure below, where the first reused passage in this query is his article Accolade, which is found pretty much verbatim in a dictionary of music from 1825. 

The query interface makes makes extensive use of facets, allowing the user to get frequencies broken down by different criteria. Breaking the reuses of Rousseau’s contributions to the Encyclopédie, it is interesting to note that while most of Rousseau’s entries in the Encyclopédie were about music, it is his political philosophy article “ECONOMIE” that is most reused in the 19th century. The interface supports the generation of time series graphs of the results. Figure Four shows that reuses of the article “ECONOMIE” was fairly consistent through the 19th century.

The Baron d’Holbach is another interesting case. As one of the philosophes with the most notorious reputation as a free-thinking materialists he contributed some of the most controversial articles to the Encyclopédie, such as “Représentants” or “Prêtres”. As shown in the figure on the left, it was his work on chemistry, mineralogy, and German history that is most reused in the 19th century. Instead of his scandalous article on “prêtres” being cited, you get the rather vanilla article “EVEQUE” which outlines the historical background of elector Bishops under the Holy Roman Empire. in fact, not one reuse of d’Holbach’s controversial material was found in the TGB, which sheds new light on our vision of Holbach as not simply an atheist propagandist, but as a man of science whose articles in various domains continued to be cited and used well into the 19th-century. This is an image of d’Holbach that rarely, if ever, occurs in modern intellectual and literary histories.

Algorithms and experiments
We believe that we can begin to use these techniques and these sorts of large-scale databases to refashion literary history, to give a more expansive vision of literary culture, etc.. by identifying various forms of intertextual activity, from reuse to referencing, in a broadened set of 18th-century corpora and to make use of various visualisation tools to navigate the output. In the context of this grant, we decided to concentrate on reuses of the Encyclopédie in the 19th century. While our interpretive work on this set of reuses is still in its initial phases, we have already been able to identify significant findings that change our understanding of the impact of the this great collective work on the 19th century.

We went into this project with the hypothesis that the engin de guerre of the Enlightenment had little to no impact in the 19th century. This was based on the general long-held general opinion on the subject, but it was also backed up by our initial experiments on the ARTFL Frantext corpus of works. However when we moved from this limited corpus to the large-scale TGB corpus, we moved from an exploration of what might be considered as a representative canon of “great works” of the 19th century to what in its vastness might be considered as something coming closer to a representation of a general cultural system.

This change in scale scale led us immediately to note the huge reuse of the Encyclopédie in the genre of dictionaries and encyclopedias published in the nineteenth century. In this area, the Encyclopédie was used as both a model and a source of information. But, more generally, the reuse of the Encyclopédie was more widespread across a broader range of publications than we had expected. So, from this point of view, in spite of the great developments in the sciences in the 19th century, the Encyclopédie remains an important source of information.

On the other hand, the articles that are most often cited in today’s discussions of the Encyclopédie, those heavily ideological articles laying out the aims and goals, those that make us see the Encyclopédie as an engin de guerre for the philosophes, are cited less often than we expected. Thus an author like d’Holbach is rarely reprised in the context of his specifically materialistic articles and more for articles he wrote on mineralogy and chemistry. All of this is to say, that Encyclopédie did have a significant impact in the 19th century, but it was not that which we had expected.

This work is just beginning and we will soon begin to look more closely at the bigger picture – not just the Encyclopédie in the TGB, but all of our various 18th century holdings (including the 18th century texts contained in the TGB corpus itself) – to broaden our understanding of reuse of 18th century in the post-Revolutionary era of the 19th century.

Direct Outcomes of this project
This project resulted in a number of related deliverables. Most importantly is the open source distribution of TextPAIR, as this provides a new model for handling very large scale alignment tasks.

The importance of this new software is underlined by the ARTFL Project release of a build of the Newberry French Revolution Collection which includes a open release of an alignment database of ARTFL pre-Revolutionary collection and the more than 26,000 Revolutionary documents. This allows scholar to look directly at the long standing question of the relationship between the Enlightenment and the Revolution.

The second and equally important deliverable from this collaborative work is the publication at ARTFL of both alignment databases as described above. These are complete installations of the alignment databases except that we have disabled links to the full texts of underlying datasets owing to agreements with various collaborators.

Home page of our alignment databases: http://artfl-project.uchicago.edu/legacy_eighteenth

ARTFL Encyclopédie to TGB alignment database: https://artflsrv03.uchicago.edu/text-align/encyc_vs_TGB_0803/

The ARTFL-Frantext to TGB alignment database: https://artflsrv03.uchicago.edu/text-align/frantext_vs_TGB_0803/


TABLE One: Frequency of authors (shown with dates) in the Frantext Sample

Voltaire, 1694-1778.                                              85
Diderot, Denis, 1713-1784.                                        45
Corneille, Pierre, 1606-1684.                                     37
Molière, 1622-1673.                                               34
Aulnoy, Madame d'(Marie-Catherine), 1650 or 51-1705.              31
Fontenelle, M. de (Bernard Le Bovier), 1657-1757.                 23
Marivaux, Pierre Carlet de Chamblain de, 1688-1763.               22
Bossuet, Jacques Bénigne, 1627-1704.                              21
Saint-Simon, Louis de Rouvroy, duc de, 1675-1755                  20
Rousseau, Jean-Jacques, 1712-1778.                                17
Mersenne, Marin, 1588-1648.                                       16
Charrière, Isabelle de, 1740-1805.                                14
Fénelon, François de Salignac de La Mothe-, 1651-1715.            13
Montesquieu, Charles de Secondat, baron de, 1689-1755.            13
Prévost, abbé, 1697-1763.                                         13
Racine, Jean, 1639-1699.                                          13
La Fontaine, Jean de, 1621-1695.                                  11
Marot, Clément                                                    11
Balzac, Jean-Louis Guez, seigneur de, 1597-1654.                  10
Du Bellay, Joachim                                                10
Scudéry, M. de (Georges), 1601-1667.                              10

Table Two: Top Authors removed from TGB

Voltaire (1694-1778)                                             249
Molière (1622-1673)                                              243
Racine, Jean (1639-1699)                                         139
Corneille, Pierre (1606-1684)                                    132
La Fontaine, Jean de (1621-1695)                                 129
Chateaubriand, François-René de (1768-1848)                      112
Scott, Walter (1771-1832)                                        105
Boileau, Nicolas (1636-1711)                                     100
Fénelon, François de (1651-1715)                                  96
Scribe, Eugène (1791-1861)                                        84
Rousseau, Jean-Jacques (1712-1778)                                72
Rollin, Charles (1661-1741)                                       69
Diderot, Denis (1713-1784)                                        64
Louis (1755-1824)                                                 63
Florian, Jean-Pierre Claris de (1755-1794)                        60
Marmontel, Jean-François (1723-1799)                              58
Prévost, Antoine François (1697-1763)                             57
Sévigné, Marie de Rabutin-Chantal (1626-1696)                     56
Bachaumont, Louis Petit de (1690-1771)                            55
Cicéron (0106-0043 av. J.-C.)                                     55
Read More

PhiloLogic4: The Big Picture

Leave a Comment
While Clovis and I continue to document various aspects of PhiloLogic4's architecture and design, it may be helpful to keep in mind a sort of top-level "bird's-eye view" of the system as a whole.  PhiloLogic does a huge number of different things at different times, and it can be very difficult to keep them all organized. My best attempt to convey it in a single diagram is below:

As with PhiloLogic3, the foundation of all PhiloLogic services is a set of C functions, which are now collected together in a library called "libphilo", contained in the main PhiloLogic4 github repository.  These provide the high-performance compression, indexing, and search algorithms that distinguish PhiloLogic from most other XML and database technologies.  

This C library is the building block upon which all of PhiloLogic4's python library classes are built.  The two most important are 
  1. the Loader class, which controls parsing and indexing TEI XML files, and 
  2. the DB class, which governs all access to a PhiloLogic database.  

These classes themselves make use of other classes, most of which appear in the diagram above; it's extremely important to note that the Loader and the DB share almost no behaviors or components.  

This separation is a point of departure from most other database systems: in PhiloLogic4, the set of components that produce a database is distinct from the set of components that query an existing database.  We refer to the time when XML documents are ingested and indexed as load-time, and the time when a user queries the database as run-time or query-time.

Although one of the original design goals of PhiloLogic4 was to focus on the development of a more generalized library for TEI processing, it became clear at some point that a set of general behaviors was not enough, and that pragmatic development required two additional components:
  1. a general-purpose document-ingesting script, capable of handling errors and ambiguity, and
  2. a readymade web application suitable for most purposes, and customizable for others

These components were built as applications making use of the standard library components, and allow a PhiloLogic developer to specify all text- and language-specific features without modification of any shared functions.

The load_script has been described already in a previous post, but it is worth revisiting in this broader context.  The load script is responsible for three fundamental tasks:
  1. taking command-line arguments from the user, and passing all the supplied files into the loader class, along with additional parameters
  2. storing all system-specific configuration parameters: hostname, filesystem locations, etc.
  3. storing all text-specific configuration parameters: XPaths, tokenization regexes, special filters, etc.
When the load script has finished running, it moves the loaded database into an appropriate path in the web server's document tree, and creates a web application around it.  This is the very same web application described in Clovis's recent post.  It is created by copying a set of files stored elsewhere, typically in the PhiloLogic4 install directory, although specifying another set of files to "clone" from is possible.  It is important to note that, by convention, we refer to the web application together with the database that it accesses as a "database", as one almost never exists without the other, and this is reflected in the diagram above.  

The behavior of such a database/application is just as Clovis described it: all queries go to one of several "report generators", which interpret query parameters and access the database accordingly.  They produce a result object, a python object that maps very closely to a JSON object--that is, a single dictionary literal consisting of other literals, without functions, tuples, lambdas, objects, and other such structures that cannot be expressed in JSON.  This result object is then passed on to a Mako template file, which can transform the result into HTML viewable by a web browser, which is finally returned to the user--"finally" usually meaning under 100 milliseconds, of course.  

Over the coming months, Clovis and I will be describing many of these components in detail, and this post may be updated as this larger documentation project proceeds; but for now, I hope it serves as a helpful overview of PhiloLogic4.
Read More

General Overview of PhiloLogic4's Web Architecture

Leave a Comment
Very early in the development of PhiloLogic4, we decided to separate the core library (the C core and Python bindings) from the actual Web interface. While there is still a clear separation between the Web environment and the library code, the two are nevertheless interdependent, which is to say that one cannot function independently of the other (unless you intend to use the library functions on the command line...)

As such, the Web component of PhiloLogic4 was designed as a Web Application, and each database functions as its own individual Web App. This allows for greater flexibility and customization. With PhiloLogic4, we wanted the Web layer to be the only part of the code a database developer has to deal with. We even went so far as to offer configuration options that drastically change the behavior of our various utilities. Before I start diving into each individual component (in later posts), I wanted to give a general picture of the Web app, as well as an idea of its features and flexibility.

The application is at its core a Python WSGI app which handles (most) requests through a dispatcher.py script that interprets queries and reroutes them to the relevant parts of the application. The results of requests are rendered in HTML thanks to the use of Mako, a powerful and easy-to-use template library. A description of the general layout of the Web App will give a better idea of how the PhiloLogic4 Web App functions.

There are four distinct sections (besides CSS and JS resources) inside the application:
  • The reports directory, which contains the major search reports which fetch data from the database by interfacing with the core library, and then return a specialized results report. These reports include concordance, KWIC (Key Word In Context), collocation, and time series. 
  • The functions directory, which contains all of the generic functions used by individual reports. These functions include parsing the query string, loading web configuration options, access control, etc. 
  • The scripts directory, which contains standalone CGI scripts that are called directly from Javascript code on the client side. These functions bypass the dispatcher and have a very specialized purpose, such as returning the total number of hits for any given query, or switching from a concordance display to a KWIC display.
The first three directories contain all that is necessary to return initial results to the client. The CGI scripts contained in /scripts provide additional functionality made possible by the use of Javascript in our Web Client. Significant work has been done to provide a dynamic and interactive Web interface, and this was made possible via heavy use of Javascript throughout the application, something which I'll describe in greater detail in another post.

Another design decision we made, somewhat late in the development process, was to rely on a CSS/JS framework for the layout of our HTML. We decided to use Bootstrap for its flexibility and responsiveness. As a result, PhiloLogic4 should work on any screen, be it phone, tablet or computer, although some functionality (such as KWIC reports) is hidden on smaller screens due to the limited space available.

Finally-- and I will go into much further detail in a separate post--we've designed a RESTful API that provides access to the full functionality of our web app. This is made possible by delaying for as long as possible the process of choosing to render search results as HTML or JSON. Basically, we expose the same results object to the HTML renderer (the Mako templates) that we do to any potential client. This design feature has allowed us to build a PhiloReader Android client application, focused on reading, by calling the relevant APIs needed for such functionality.

In my next post on the Web Application, I will go through the various configuration options available. 
Read More

PhiloLogic4 Load Script Architecture

Leave a Comment
Clovis and I have been doing a great deal of work lately on PhiloLogic4's document-loading process, and I feel that it's matured enough to start documenting in detail.  The best place to start is with the standard PhiloLogic4 load_script.py, which you can look at on github if you don't have one close at hand:


The load script works more or less like the old philoload script, with some important differences:

  1. The load script is not installed system-wide--you generally want to keep it near your data, with any other scripts. 
  2. The load script has no global configuration file--all configuration is kept separate in each copy of the script that you create.
  3. The PhiloLogic4 Parser class is fully configurable from the load script--you can change any Xpaths you want, or even supply a replacement Parser class if you need to.
  4. The load script is designed to be short, and easy to understand and modify.
The most important pieces of information in any load script are the system setup variables at the top of the file.  These will give immediate errors if they aren't set up right.  

database_root is the filesystem path to the web-accessible directory where your PhiloLogic4 database will live, like /var/www/philologic/--so your webserver process will need read access to it, and you will need write access to create the database--and don't forget to keep the slash at the end of the directory, or you'll get errors.  

url_root is the HTTP URL that the database_root directory is accessible at: http://your.server.com/philologic/could be a reasonable mapping of the example above, but it will depend on your DNS setup, server configuration, and other hosting issues outside the scope of this document.

template_dir, which defaults to database_root + "/_system_dir/_install_dir/", is the directory containing all the scripts, reports, templates, and stylesheets that make up a PhiloLogic4 database application.  If you have customized behaviors or designs that you want reflected in all of the databases you build, you can keep those templates in a directory on their own where they won't get overwritten.  

(At the moment, you can't "clone" the templates from an existing database, because they actual database content can be very large, but we'd very much like to implement that feature in the future to allow for easy reloads.)

Most of the rest of the file is configuration for the Loader class, which does all of the real work, but the config is kept here, in the script, so you don't have to maintain custom classes for every database. 

For now, it's just important to know what options can be specified in the load script:
  1. default_object_level defines the type of object returned for the purpose of most navigation reports--for most database, this will be "doc", but you might want to use "div1" for dictionary or encyclopedia databases.
  2. navigable_objects is a list of the object types stored in the database and available for searching, reporting, and navigation--("doc","div1","div2","div3") is the default, but you might want to append "para" if you are parsing interesting metadata on paragraphs, like in drama.  Pages are handled separately, and don't need to be included here.
  3. filters and post_filters are lists of loader functions--their behavior and design will be documented separately, but they are basically lists of modular loader functions to be executed in order, and so shouldn't be modified carelessly.
  4. plain_text_obj is a very useful option that generates a flat text file representations of all objects of a given type, like "doc" or "div1", usually for data mining with Mallet or some other tool.
  5. extra_locals is a catch_all list of extra parameters to pass on to your database later, if you need to--think of it as a "swiss army knife" for passing data from the loader to the database at run-time.
The next section of the load script is setup for the XML Parser:

This is a bit complex, and will be explored in depth in a separate post, but the basic layout is this:
  1. xpaths is a list of 2-tuples that maps philologic object types to absolute XPaths--that is, XPaths evaluated where "." refers to the TEI document root element.  You can define multiple XPaths for the same type of object, but you will get much better and more consistent results if you do not.
  2. metadata_xpaths is a list of 3-tuples that map one or more XPaths to each metadata field defined on each object type.  These are evaluated relative to whatever XML element matched the XPath for the object type in question--so "." here refers to a doc, div1, or paragraph-level object somewhere in the xml.
  3. pseudo_empty_tags is a very obscure option for things that you want to treat as containers, even if they are encoded as self-closing tags.  
  4. suppress_tags is a list of tags in which you do not want to perform tokenization at all--that is, no words in them will be searchable via full-text search.  It does not prohibit extracting metadata from the content of those tags.
  5. word_regex and punct_regex are regular expression fragments that drive our tokenizer.  Each needs to consist of exactly one capturing subgroup so that our tokenizer can use them correctly. They are both fully unicode-aware--usually, the default \w class is fine for words, but in some cases you may need to add apostrophes and such to the word pattern.  Likewise, the punctuation regex pattern fully supports multi-byte utf-8 punctuation.  In both cases you should enter characters as unicode code points, not utf-8 byte strings.
The next section consists of just a few scary incantations that shouldn't be modified:

But the following 2 sections are where all the work gets done, and an important place to perform modifications.   First, we construct the Loader object, passing it all the configuration variables we have constructed so far:

Then we operate the Loader object step-by-step:

And that's it!  

Usually, these load functions should all be executed in the same order, but it is worth paying special attention to the load_metadata variable that is constructed right before l.parse_files is called.  This variable controls the entire parsing process, and is incredibly powerful.  Not only does it let you define any order in which to load your files, but you can also supply any document-level metadata you wish, and change the xpaths, load_filters, or parser class used per file, which can be very useful on complex or heterogeneous data sets.  However, this often requires either some source of stand-off metadata or pre-processing/parsing stage.  

For this purpose, we've added a powerful new Loader function called sort_by_metadata which integrates the functions of a PhiloLogic3 style metadata guesser and sorter, while still being modular enough to replaced entirely when necessary.  We'll describe it in more detail in a later post, but for now, you can look at the new artfl_load_script to get a sense of how to construct a more robust, fault-tolerant loader using this new function.


Up next: the architecture of the PhiloLogic Loader class itself.
Read More

shlax and ElementTree

Leave a Comment
I've just pushed a few commits to the central philo4 repository;
mostly small bugfixes to the makefile and the parser, but I added a convenience method to the shlax XML parser.

As you may know, Python has a really nice XML library called ElementTree, but it has a few quirks:
1) it uses standard, "fussy" XML parsers that choke on the slightest flaw, and
2) it has a formally correct but incomprehensible approach to namespaces that is exceedingly impractical for day-to-day TEI hacking.

In this update, I've added a shlaxtree module to the philo4 distribution that hooks our fault-tolerant, namespace-agnostic XML parser up to ElementTree's XPath evaluator and serialization facilities. It generally prefers the 1.3 version of ElementTree, which is standard in python 2.7, but a simple install in 2.6 and 2.5.

Basically, the method philologic.shlaxtree.parse() will take in a file object, and return the root node of the xml document in the file, assuming it found one. You can use this to make a simple bibliographic extractor like so:

#!/usr/bin/env python
import philologic.shlaxtree as st
import sys
import codecs

for filename in sys.argv[1:]:
file = codecs.open(filename,"r","utf-8")
root = st.parse(file)
header = root.find("teiHeader")
print st.et.tostring(header)
print header.findtext(".//titleStmt/title")
print header.findtext(".//titleStmt/author")

Not bad for 10 lines, no? What's really cool is that you can modify trees, nodes, and fragments before writing them out, with neat recursive functions and what not. I've been using it for converting old SGML dictionaries to TEI--once you get the hang of it, it's much easier than regular expressions, and much easier to maintain and modify as well.
Read More

shlax: a shallow, lazy XML parser in python

1 comment
Recently, I stumbled upon a paper from the dawn age of XML:

"REX: XML Shallow Parsing with Regular Expressions", Robert D. Cameron

It describes how to do something I'd never seen done before: parse the entirety of standard XML syntax in a single regular expression.

We've all written short regexes to find some particular feature in an XML document, but we've also all seen those fail because of oddities of whitespace, quoting, linebreaks, etc., that are perfectly legal, but hard to account for in a short, line-by-line regular expression.

Standard XML parsers, like expat, are fabulous, well maintained, and efficient. However, they have a common achilles heel: the XML standard's insistence that XML processors "MUST" report a fatal error if a document contains unbalanced tags. For working with HTML or SGML based documents, this is disastrous!

In contrast, Cameron's regex-based parser is extremely fault-tolerant--it extracts as much structure from the document as possible, and reports the rest as plain text. Further, it supports "round-tripping": the ability to exactly re-generate a document from parser output, which standard parser typically lack. As a corollary of this property, it becomes possible to report absolute byte offsets, which is a "killer feature" for the purposes of indexing.

Because of all these benefits, I've opted to translate his source code from javascript to python. I call my modified implementation "shlax" [pronounced like "shellacs", sort of], a shallow, lazy XML parser. "Shallow" meaning that it doesn't check for well-formedness, and simply reports tokens, offsets, and attributes as best it can. "Lazy" meaning that it iterates over the input, and yields one object at a time--so you don't have to write 8 asynchronous event handlers to use it, as in a typical SAX-style parser. This is often called a "pull" parser, but "shpux" doesn't sound as good, does it?

If you're interested, you can look at the source at the libphilo github repo. The regular expression itself is built up over the course of about 30 expressions, to allow for maintainability and readability. I've made some further modifications to Cameron's code to fit our typical workflow. I've buffered the text input, which allows us to iterate over a file-handle, rather than a string--this saves vast amounts of memory for processing large XML files, in particular. And I return "node" objects, rather than strings, that contain several useful items of information:
  1. the original text content of the node
  2. the "type" of the node: text, StartTag,EndTag, or Markup[for DTD's, comments, etc.]
  3. any attributes the node has
  4. the absolute byte offset in the string or file
You don't need anything more than that to power PhiloLogic. If you'd like to see an example of how to use it, take a look at my DirtyParser class, which takes as input a set of xpaths to recognize for containers and metadata, and outputs a set of objects suitable for the index builder I wrote about last time.

Oh, and about performance: shlax is noticeably slower than Mark's perl loader. I've tried to mitigate for that in a variety of ways, but in general, python's regex engine is not as fast as perl's. On the other hand, I've recently had a lot of success with running a load in parallel on an 8-core machine, which I'll write about when the code settles. That said, if efficiency is a concern, our best option would be to use well-formed XML with a standard parser.

So, my major development push now is to refactor the loader into a framework that can handle multiple parser backends, flexible metadata recognizers, and multiple simultaneous parser processes. I'll be posting about that as soon as it's ready.
Read More

A Unified Index Construction Library

Leave a Comment
I've spent the last two weeks replacing PhiloLogic's index-construction routines, following my prior work on the query and database interfaces.

The legacy index-packing code dates back to sometime before PhiloLogic 2, and is spread over 3 executable programs linked together by a Makefile and some obscure binary state files.

Unfortunately, the 3 programs all link to different versions of the same compression library, so they couldn't simply be refactored and recompiled as a single unit.

Instead, I worked backwards from the decompression routines I wrote last month, to write a new index construction library from scratch.

Thus, I had the luxury of being able to define an abstract, high-level interface that meets my four major goals:

1)simple, efficient operation
2)flexible enough for various index formats
3)easy to bind to other languages.
4)fully compatible with 3-series PhiloLogic

The main loop is below. It's pretty clean. All the details are handled by a hit-buffer object named "hb" that does compression, memory management, and database interfacing.
while(1) {
// as long as we read lines from standard input.
if (fgets(line,511,stdin) == NULL) {
// scan for hits in standard Philo3 format.
state = sscanf(line,
"%s %d %d %d %d %d %d %d %d %s\n",
word, &hit[0],...);

if (state == 10) {
// if we read a valid hit
if ((strcmp(word,hb->word))) {
//if we have a new word...
hitbuffer_finish(hb); // write out the current buffer.
hitbuffer_init(hb, word); // and reinitialize
uniq_words += 1LLU; //LLU for a 64-bit unsigned int.
hitbuffer_inc(hb, hit); //add the hit to whichever word you're on.
totalhits += 1LLU;
else {
fprintf(stderr, "Couldn't understand hit.\n");

The code is publicly available on github, but I'm having some problems with their web interface. I'll post a link once it's sorted out.
Read More

Vector Processing for OHCO

Leave a Comment
I've posted an expanded version of my CI Days talk on Google docs. I'd recommend looking at the speaker notes (click "actions" on the bottom left) since I won't be narrating it in person.

The presentation is an attempt to describe, somewhat formally, how PhiloLogic is capable of performing as well as it does. This comes from spending three years learning how Leonid's search core works, and attempting to extend and elucidate whatever I can. It's also the intellectual framework that I'm using to plan new features, like search on line and meter position, metadata, joins, etc. Hopefully, I can get someone who's better at math than I am to help me tighten up the formalities.

Basically, I refer to the infamous OHCO thesis as a useful axiom for translating the features of a text into a set of numerical objects, and then compare the characteristics of this representation to XML or Relational approaches. I'd love to know how interesting/useful/comprehensible others find the presentation, or the concept. What needs more explanation? What gets tedious?

If you look at the speaker notes, you can see me derive a claim that PhiloLogic runs 866 times faster than a relational database for word search. Math is fun!
Read More

PhiloLogic proto-binding for Python

Leave a Comment

In an earlier post, I mentioned that I'd try to to call the philologic C routines via ctypes, a Python Foreign Function Interface library. I did, and it worked awesomely well! Ctypes lets you call C functions from python without writing any glue at all in some cases, giving you access to high-performance C routines in a clean, modern programming language. We'd ultimately want a much more hand-crafted approach, but for prototyping interfaces, this is a very, very useful tool.

First, I had to compile the search engine as a shared library, rather than an executable:

gcc -dynamiclib -std=gnu99 search.o word.o retreive.o level.o gmap.o blockmap.o log.o out.o plugin/libindex.a db/db.o db/bitsvector.o db/unpack.o -lgdbm -o libphilo.dylib

All that refactoring certainly paid off. The search4 executable will now happily link against the shared library with no modification, and so can any other program that wants high-speed text object search:


import sys,os
from ctypes import *

# First, we need to get the C standard library loaded in
# so that we can pass python's input on to the search engine.
stdin = stdlib.fdopen(sys.stdin.fileno(),"r")
# Honestly, that's an architectural error.
# I'd prefer to pass in strings, not a file handle

# Now load in philologic from a shared library
libphilo = cdll.LoadLibrary("./libphilo.dylib")

# Give it a path to the database. The C routines parse the db definitions.
db = libphilo.init_dbh_folder("/var/lib/philologic/databases/mvotest5/")

# now initialize a new search object, with some reasonable defaults.
s = libphilo.new_search(db,"phrase",None,1,100000,0,None)

# Read words from standard input.

# Then dump the results to standard output.
# Done.

That was pretty easy, right? Notice that there weren't any boilerplate classes. I could hold pointers to arbitrary data in regular variables, and pass them directly into the C subroutines as void pointers. Not safe, but very, very convenient.

Of course, this opens us up for quite a bit more work: the C library really needs a lot more ways to get data in and out than a pair of input/output file descriptors, I would say. In all likelihood, after some more experiments, we'll eventually settle on a set of standard interfaces, and generate lower-level bindings with SWIG, which would alow us to call philo natively from Perl or PHP or Ruby or Java or LISP or Lua or...anything, really.

Ctypes still has some advantages over automatically-generated wrappers, however. In particular, it lets you pass python functions back into C, allowing us to write search operators in python, rather than C--for example, a metadata join, or a custom optimizer for part-of-speech searching. Neat!

Read More
Previous PostOlder Posts Home

Zett - A Responsive Blogger Theme, Lets Take your blog to the next level.

This is an example of a Optin Form, you could edit this to put information about yourself.

This is an example of a Optin Form, you could edit this to put information about yourself or your site so readers know where you are coming from. Find out more...

Following are the some of the Advantages of Opt-in Form :-

  • Easy to Setup and use.
  • It Can Generate more email subscribers.
  • It’s beautiful on every screen size (try resizing your browser!)