ARTFL Project Research Blog

PhiloLogic4: The Big Picture

Richard Wednesday, December 17, 2014 Leave a Comment

While Clovis and I continue to document various aspects of PhiloLogic4's architecture and design, it may be helpful to keep in mind a sort of top-level "bird's-eye view" of the system as a whole. PhiloLogic does a huge number of different things at different times, and it can be very difficult to keep them all organized. My best attempt to convey it in a single diagram is below:

As with PhiloLogic3, the foundation of all PhiloLogic services is a set of C functions, which are now collected together in a library called "libphilo", contained in the main PhiloLogic4 github repository. These provide the high-performance compression, indexing, and search algorithms that distinguish PhiloLogic from most other XML and database technologies.

This C library is the building block upon which all of PhiloLogic4's python library classes are built. The two most important are

the Loader class, which controls parsing and indexing TEI XML files, and
the DB class, which governs all access to a PhiloLogic database.

These classes themselves make use of other classes, most of which appear in the diagram above; it's extremely important to note that the Loader and the DB share almost no behaviors or components.

This separation is a point of departure from most other database systems: in PhiloLogic4, the set of components that produce a database is distinct from the set of components that query an existing database. We refer to the time when XML documents are ingested and indexed as load-time, and the time when a user queries the database as run-time or query-time.

Although one of the original design goals of PhiloLogic4 was to focus on the development of a more generalized library for TEI processing, it became clear at some point that a set of general behaviors was not enough, and that pragmatic development required two additional components:

a general-purpose document-ingesting script, capable of handling errors and ambiguity, and
a readymade web application suitable for most purposes, and customizable for others

These components were built as applications making use of the standard library components, and allow a PhiloLogic developer to specify all text- and language-specific features without modification of any shared functions.

The load_script has been described already in a previous post, but it is worth revisiting in this broader context. The load script is responsible for three fundamental tasks:

taking command-line arguments from the user, and passing all the supplied files into the loader class, along with additional parameters
storing all system-specific configuration parameters: hostname, filesystem locations, etc.
storing all text-specific configuration parameters: XPaths, tokenization regexes, special filters, etc.

When the load script has finished running, it moves the loaded database into an appropriate path in the web server's document tree, and creates a web application around it. This is the very same web application described in Clovis's recent post. It is created by copying a set of files stored elsewhere, typically in the PhiloLogic4 install directory, although specifying another set of files to "clone" from is possible. It is important to note that, by convention, we refer to the web application together with the database that it accesses as a "database", as one almost never exists without the other, and this is reflected in the diagram above.

The behavior of such a database/application is just as Clovis described it: all queries go to one of several "report generators", which interpret query parameters and access the database accordingly. They produce a result object, a python object that maps very closely to a JSON object--that is, a single dictionary literal consisting of other literals, without functions, tuples, lambdas, objects, and other such structures that cannot be expressed in JSON. This result object is then passed on to a Mako template file, which can transform the result into HTML viewable by a web browser, which is finally returned to the user--"finally" usually meaning under 100 milliseconds, of course.

Over the coming months, Clovis and I will be describing many of these components in detail, and this post may be updated as this larger documentation project proceeds; but for now, I hope it serves as a helpful overview of PhiloLogic4.

General Overview of PhiloLogic4's Web Architecture

Clovis Wednesday, December 03, 2014 2 comments

Very early in the development of PhiloLogic4, we decided to separate the core library (the C core and Python bindings) from the actual Web interface. While there is still a clear separation between the Web environment and the library code, the two are nevertheless interdependent, which is to say that one cannot function independently of the other (unless you intend to use the library functions on the command line...)

As such, the Web component of PhiloLogic4 was designed as a Web Application, and each database functions as its own individual Web App. This allows for greater flexibility and customization. With PhiloLogic4, we wanted the Web layer to be the only part of the code a database developer has to deal with. We even went so far as to offer configuration options that drastically change the behavior of our various utilities. Before I start diving into each individual component (in later posts), I wanted to give a general picture of the Web app, as well as an idea of its features and flexibility.

The application is at its core a Python WSGI app which handles (most) requests through a dispatcher.py script that interprets queries and reroutes them to the relevant parts of the application. The results of requests are rendered in HTML thanks to the use of Mako, a powerful and easy-to-use template library. A description of the general layout of the Web App will give a better idea of how the PhiloLogic4 Web App functions.

There are four distinct sections (besides CSS and JS resources) inside the application:

The reports directory, which contains the major search reports which fetch data from the database by interfacing with the core library, and then return a specialized results report. These reports include concordance, KWIC (Key Word In Context), collocation, and time series.

The templates directory, which contains all of the different Mako templates we use to generate an HTML page.

The functions directory, which contains all of the generic functions used by individual reports. These functions include parsing the query string, loading web configuration options, access control, etc.

The scripts directory, which contains standalone CGI scripts that are called directly from Javascript code on the client side. These functions bypass the dispatcher and have a very specialized purpose, such as returning the total number of hits for any given query, or switching from a concordance display to a KWIC display.

The first three directories contain all that is necessary to return initial results to the client. The CGI scripts contained in /scripts provide additional functionality made possible by the use of Javascript in our Web Client. Significant work has been done to provide a dynamic and interactive Web interface, and this was made possible via heavy use of Javascript throughout the application, something which I'll describe in greater detail in another post.

Another design decision we made, somewhat late in the development process, was to rely on a CSS/JS framework for the layout of our HTML. We decided to use Bootstrap for its flexibility and responsiveness. As a result, PhiloLogic4 should work on any screen, be it phone, tablet or computer, although some functionality (such as KWIC reports) is hidden on smaller screens due to the limited space available.

Finally-- and I will go into much further detail in a separate post--we've designed a RESTful API that provides access to the full functionality of our web app. This is made possible by delaying for as long as possible the process of choosing to render search results as HTML or JSON. Basically, we expose the same results object to the HTML renderer (the Mako templates) that we do to any potential client. This design feature has allowed us to build a PhiloReader Android client application, focused on reading, by calling the relevant APIs needed for such functionality.

In my next post on the Web Application, I will go through the various configuration options available.

PhiloLogic4 Load Script Architecture

Richard Wednesday, October 22, 2014 Leave a Comment

Clovis and I have been doing a great deal of work lately on PhiloLogic4's document-loading process, and I feel that it's matured enough to start documenting in detail. The best place to start is with the standard PhiloLogic4 load_script.py, which you can look at on github if you don't have one close at hand:

https://github.com/ARTFL-Project/PhiloLogic4/blob/master/scripts/load_script.py

The load script works more or less like the old philoload script, with some important differences:

The load script is not installed system-wide--you generally want to keep it near your data, with any other scripts.
The load script has no global configuration file--all configuration is kept separate in each copy of the script that you create.
The PhiloLogic4 Parser class is fully configurable from the load script--you can change any Xpaths you want, or even supply a replacement Parser class if you need to.
The load script is designed to be short, and easy to understand and modify.

The most important pieces of information in any load script are the system setup variables at the top of the file. These will give immediate errors if they aren't set up right.

database_root is the filesystem path to the web-accessible directory where your PhiloLogic4 database will live, like /var/www/philologic/--so your webserver process will need read access to it, and you will need write access to create the database--and don't forget to keep the slash at the end of the directory, or you'll get errors.

url_root is the HTTP URL that the database_root directory is accessible at: http://your.server.com/philologic/could be a reasonable mapping of the example above, but it will depend on your DNS setup, server configuration, and other hosting issues outside the scope of this document.

template_dir, which defaults to database_root + "/_system_dir/_install_dir/", is the directory containing all the scripts, reports, templates, and stylesheets that make up a PhiloLogic4 database application. If you have customized behaviors or designs that you want reflected in all of the databases you build, you can keep those templates in a directory on their own where they won't get overwritten.

(At the moment, you can't "clone" the templates from an existing database, because they actual database content can be very large, but we'd very much like to implement that feature in the future to allow for easy reloads.)

Most of the rest of the file is configuration for the Loader class, which does all of the real work, but the config is kept here, in the script, so you don't have to maintain custom classes for every database.

For now, it's just important to know what options can be specified in the load script:

default_object_level defines the type of object returned for the purpose of most navigation reports--for most database, this will be "doc", but you might want to use "div1" for dictionary or encyclopedia databases.
navigable_objects is a list of the object types stored in the database and available for searching, reporting, and navigation--("doc","div1","div2","div3") is the default, but you might want to append "para" if you are parsing interesting metadata on paragraphs, like in drama. Pages are handled separately, and don't need to be included here.
filters and post_filters are lists of loader functions--their behavior and design will be documented separately, but they are basically lists of modular loader functions to be executed in order, and so shouldn't be modified carelessly.
plain_text_obj is a very useful option that generates a flat text file representations of all objects of a given type, like "doc" or "div1", usually for data mining with Mallet or some other tool.
extra_locals is a catch_all list of extra parameters to pass on to your database later, if you need to--think of it as a "swiss army knife" for passing data from the loader to the database at run-time.

The next section of the load script is setup for the XML Parser:

This is a bit complex, and will be explored in depth in a separate post, but the basic layout is this:

xpaths is a list of 2-tuples that maps philologic object types to absolute XPaths--that is, XPaths evaluated where "." refers to the TEI document root element. You can define multiple XPaths for the same type of object, but you will get much better and more consistent results if you do not.
metadata_xpaths is a list of 3-tuples that map one or more XPaths to each metadata field defined on each object type. These are evaluated relative to whatever XML element matched the XPath for the object type in question--so "." here refers to a doc, div1, or paragraph-level object somewhere in the xml.
pseudo_empty_tags is a very obscure option for things that you want to treat as containers, even if they are encoded as self-closing tags.
suppress_tags is a list of tags in which you do not want to perform tokenization at all--that is, no words in them will be searchable via full-text search. It does not prohibit extracting metadata from the content of those tags.
word_regex and punct_regex are regular expression fragments that drive our tokenizer. Each needs to consist of exactly one capturing subgroup so that our tokenizer can use them correctly. They are both fully unicode-aware--usually, the default \w class is fine for words, but in some cases you may need to add apostrophes and such to the word pattern. Likewise, the punctuation regex pattern fully supports multi-byte utf-8 punctuation. In both cases you should enter characters as unicode code points, not utf-8 byte strings.

The next section consists of just a few scary incantations that shouldn't be modified:

But the following 2 sections are where all the work gets done, and an important place to perform modifications. First, we construct the Loader object, passing it all the configuration variables we have constructed so far:

Then we operate the Loader object step-by-step:

And that's it!

Usually, these load functions should all be executed in the same order, but it is worth paying special attention to the load_metadata variable that is constructed right before l.parse_files is called. This variable controls the entire parsing process, and is incredibly powerful. Not only does it let you define any order in which to load your files, but you can also supply any document-level metadata you wish, and change the xpaths, load_filters, or parser class used per file, which can be very useful on complex or heterogeneous data sets. However, this often requires either some source of stand-off metadata or pre-processing/parsing stage.

For this purpose, we've added a powerful new Loader function called sort_by_metadata which integrates the functions of a PhiloLogic3 style metadata guesser and sorter, while still being modular enough to replaced entirely when necessary. We'll describe it in more detail in a later post, but for now, you can look at the new artfl_load_script to get a sense of how to construct a more robust, fault-tolerant loader using this new function.

https://github.com/ARTFL-Project/PhiloLogic4/blob/master/scripts/artfl_load_script.py

Up next: the architecture of the PhiloLogic Loader class itself.

ARTFL Project Research Blog

PhiloLogic4: The Big Picture

General Overview of PhiloLogic4's Web Architecture

PhiloLogic4 Load Script Architecture

Labels

Popular Posts

Blog Archive

Developed by ARTFL