This is just to let you know that we now have an epub to tei converter. It can be found here:
http://artfl.googlecode.com/files/epub_parser.tar
As you'll notice, there are three files in this archive. The first one is epub_parser.sh. It's the only one you need to edit. Specify the paths (where the epub files are and where you want your tei files to be in) without slashes and just execute epub_parser.sh. The second one is parser.pl which is called by epub_parser.sh. The third one is entities.pl which handles html entities and is also called by epub_parser.sh. Before running it, make sure all three scripts are in the same directory.
A sample philologic load can be found here:
http://artflx.uchicago.edu/philologic/epubtest.whizbang.form.html
Of course, this is just a proof of concept and will only be used only for text search and machine learning purposes. Some things will have to be tuned up. Note that I put a div1 every ten pages since there is no way to recognize chapters in the original epub files.
0 comments:
Post a Comment