txtorg

Software for preprocessing textual data in multiple languages for textual analysis.

Produced by Christopher Lucas, Alex Storer, and Dustin Tingley. Thanks to Sam Brotherton for work on earlier versions.

Send questions and suggestions to Christopher Lucas.

Cite: BibTeX | Endnote

Linux installation

If you are using a Linux distribution, the installation process is fairly straightforward. txtorg no longer depends on non-Python packages, so you may simply install txtorg as you would any other python module, as shown below.

$ curl -OL https://github.com/ChristopherLucas/txtorg/archive/master.zip; unzip master.zip; cd txtorg-master; python setup.py install --user

If you want to extend txtorg to Lucene, you may do so with the instructions below. We only recommend this where absolutely necessary.

PyLucene Installation Instructions for Ubuntu/Debian

PyLucene depends on depends on gcc/g++, the JDK, Ant, and the Python development packages. If you're using Ubuntu/Debian, there is a PyLucene deb that you can install, which will automatically install any missing dependencies. Otherwise, you'll have to build PyLucene from source, as well as any missing dependencies. After satisfying the dependencies, you may install txtorg as you would any other Python package. We'll first show you how to install the PyLucene deb file, both using APT and using dpkg, then how to install PyLucene from source. You only need to do one of these. Then, we'll show you how to install txtorg.

You can either install PyLucene with APT or you can download the deb and install it. The former is probably best but we'll describe both for good measure.

APT

First, confirm that the repository contains PyLucene Version 3. You can do this from terminal with the following command.

$ apt-cache policy pylucene

If that returned PyLucene Version 3.X.X, install PyLucene with the following command.

$ sudo apt-get install pylucene

And that should do it! Proceed to install txtorg.

dpkg

If your repository doesn't contain PyLucene Version 3, you can do the following. This will download and install the source package, using wget. If you are not running a 64-bit machine, change the link accordingly.

$ wget http://ftp.us.debian.org/debian/pool/main/p/pylucene/pylucene_3.5.0-1.2_amd64.deb
$ sudo dpkg -i pylucene_3.5.0-1.2_amd64.deb

You can now install txtorg.

Building PyLucene from Source

If you are using a distribution of Linux other than Ubuntu/Debian, you'll have to install PyLucene and its dependencies from source. Assuming that you're using Aptitude and that the dependencies are in your repository, install the dependencies as follows in the code block below. If you're using another package management system or if the packages are not in your repos, you'll have to edit the process accordingly.

$ sudo apt-get install openjdk-7-jdk ant g++ python-dev
Next, you need to build PyLucene. First, download PyLucene 3 here (any version 3 release will do). From a shell, navigate to the directory in which you saved the tarball, then proceed as follows.

$ cd this_is_where/I_saved_my_tarball # Navigate to the folder with the tarball
$ tar -zxvf pylucene-[PyLucene Version Here]-src.tar.gz # Untar the tarball
$ cd pylucene-4.5.1-1/jcc # Change directories in the jcc directory
$ sudo python setup.py build 
$ sudo python setup.py install
$ cd .. # Change directories back to the parent directory

Next, in a text editor, you need to edit the Makefile by uncommenting the section of code shown in the block below (that is, open the Makefile in a text editor, edit it accordingly, save, and close it). Note that you'll uncomment a different portion of code if you're using a 32-bit machine.

# Linux (Ubuntu 11.10 64-bit, Python 2.7.2, OpenJDK 1.7, setuptools 0.6.16)
# Be sure to also set JDK['linux2'] in jcc's setup.py to the JAVA_HOME value
# used below for ANT (and rebuild jcc after changing it).
PREFIX_PYTHON=/usr
ANT=JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 /usr/bin/ant
PYTHON=$(PREFIX_PYTHON)/bin/python
JCC=$(PYTHON) -m jcc --shared
NUM_FILES=4	    

Back in the terminal, finish the PyLucene installation process by issuing the following commands.

$ make
$ sudo make install

Now, install txtorg.

Installing txtorg

Now that you've met the dependencies for the package, you can install txtorg locally as you would any other Python module.

$ curl -OL https://github.com/ChristopherLucas/txtorg/archive/master.zip; unzip master.zip; cd txtorg-master; python setup.py install --user