The load script works more or less like the old philoload script, with some important differences:
- The load script is not installed system-wide--you generally want to keep it near your data, with any other scripts.
- The load script has no global configuration file--all configuration is kept separate in each copy of the script that you create.
- The PhiloLogic4 Parser class is fully configurable from the load script--you can change any Xpaths you want, or even supply a replacement Parser class if you need to.
- The load script is designed to be short, and easy to understand and modify.
- default_object_level defines the type of object returned for the purpose of most navigation reports--for most database, this will be "doc", but you might want to use "div1" for dictionary or encyclopedia databases.
- navigable_objects is a list of the object types stored in the database and available for searching, reporting, and navigation--("doc","div1","div2","div3") is the default, but you might want to append "para" if you are parsing interesting metadata on paragraphs, like in drama. Pages are handled separately, and don't need to be included here.
- filters and post_filters are lists of loader functions--their behavior and design will be documented separately, but they are basically lists of modular loader functions to be executed in order, and so shouldn't be modified carelessly.
- plain_text_obj is a very useful option that generates a flat text file representations of all objects of a given type, like "doc" or "div1", usually for data mining with Mallet or some other tool.
- extra_locals is a catch_all list of extra parameters to pass on to your database later, if you need to--think of it as a "swiss army knife" for passing data from the loader to the database at run-time.
- xpaths is a list of 2-tuples that maps philologic object types to absolute XPaths--that is, XPaths evaluated where "." refers to the TEI document root element. You can define multiple XPaths for the same type of object, but you will get much better and more consistent results if you do not.
- metadata_xpaths is a list of 3-tuples that map one or more XPaths to each metadata field defined on each object type. These are evaluated relative to whatever XML element matched the XPath for the object type in question--so "." here refers to a doc, div1, or paragraph-level object somewhere in the xml.
- pseudo_empty_tags is a very obscure option for things that you want to treat as containers, even if they are encoded as self-closing tags.
- suppress_tags is a list of tags in which you do not want to perform tokenization at all--that is, no words in them will be searchable via full-text search. It does not prohibit extracting metadata from the content of those tags.
- word_regex and punct_regex are regular expression fragments that drive our tokenizer. Each needs to consist of exactly one capturing subgroup so that our tokenizer can use them correctly. They are both fully unicode-aware--usually, the default \w class is fine for words, but in some cases you may need to add apostrophes and such to the word pattern. Likewise, the punctuation regex pattern fully supports multi-byte utf-8 punctuation. In both cases you should enter characters as unicode code points, not utf-8 byte strings.