txtorg

Software for preprocessing textual data in multiple languages for textual analysis.

Produced by Christopher Lucas, Alex Storer, and Dustin Tingley. Thanks to Sam Brotherton for work on earlier versions.

Send questions and suggestions to Christopher Lucas.

Cite: BibTeX | Endnote

Windows installation

To install the Whoosh version of txtorg, simply follow these short instructions.

  1. Install Python (ideally 2.7 but definitely not 3.x)
  2. Download the zip file and unzip it. This should create a directory named "ChristopherLucas-txtorg-[LATEST GIT COMMIT NUMBER]".
  3. To install the package, first open a command prompt (Start > All Programs > Accessories > Command Prompt). Next, navigate to the directory "ChristopherLucas-txtorg-[LATEST GIT COMMIT NUMBER].zip". To do so, use the cd command, followed by the full path to the directory. Then, install the package with the command setup.py install. These steps are show in the code block below.
  4. $ cd this_is_where/I_unzipped_the_file/ChristopherLucas-txtorg-[LATEST GIT COMMIT NUMBER] # Navigate to the folder 
    $ python setup.py install
    
  5. Open the directory where you unzipped the file, go to bin, and rename txtorg to txtorg.py. Execute this file to run txtorg. By default, you should be able to do so by double-clicking this file. If you've changed the default behavior for python scripts, you may have to execute this file from the command line with 'python txtorg.py'.

And that should do it!

To extend txtorg to use with Lucene as well as Whoosh, follow the following steps instead. Unless you need support for a language not supported by Whoosh, we do not recommend this (and if you do need Lucene support, you may want to consider working on a Linux machine, as installation is considerably easier).

  1. Install Python (ideally 2.7 but definitely not 3.x)
  2. Install setuptools: Download https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py and double-click it! This will install easy_install.exe, which you'll need in step 5.
  3. Install Java (if it's not installed)
  4. Locate the jvm.dll file (mine is in C:\Program Files (x86)\Java\jre1.6.0_24\bin\client). You'll need this in the next step.
  5. Add the dll location and the easy_install location to the Path. easy_install.exe is probably located in C:\Python27\Scripts, but if not, use the search box on the Start menu to find it. If you haven't edited the Path variable before, follow these steps to do so.
    • From the Desktop, right click on My Computer, click Properties
    • Under Advanced System Settings, Environment Variables
    • Click PATH in the list of System Variables
    • Click edit, and change the path by adding the dll location and the easy_install location, as shown in the code block below. IMPORTANT: note that file paths must be separated semicolons. In the example below, there are two paths, not one.
      ;C:\Program Files (x86)\Java\jre1.6.0_24\bin\client;C:\Python27\ArcGIS10.2\Scripts
  6. Download this file and double click it
  7. Download the zip file and unzip it. This should create a directory named "ChristopherLucas-txtorg-4dbad8b".
  8. Finally, you must install txtorg (as you would any Python package). To install the package, first open a command prompt (Start > All Programs > Accessories > Command Prompt). Next, navigate to the directory "ChristopherLucas-txtorg-4dbad8b". To do so, use the cd command, followed by the full path to the directory. Then, install the package with the command setup.py install. These steps are show in the code block below.
  9. $ cd this_is_where/I_unzipped_the_file/ChristopherLucas-txtorg-4dbad8b # Navigate to the folder 
    $ setup.py install
    
  10. Open the directory where you unzipped the file, go to bin, and rename txtorg to txtorg.py. Double click this file to run txtorg.

And that should do it!

Updating Versions on a Windows Machine

Because txtorg is still in active development, you should update often (for notifications about updates, sign up for our mailing list). On a Windows machine, update by following the steps below.

  1. Delete the directory containing the current install. You can leave the program files, these will be overwritten.
  2. Download the zip file and unzip it.
  3. From the command line, navigate to the directory created when you unzipped the file. This should contain setup.py. Following the same process used for the initial install, install txtorg. This is shown in the steps below.
  4. $ cd this_is_where/I_unzipped_the_file/ChristopherLucas-txtorg-4dbad8b # Navigate to the folder 
    $ setup.py install
    
  5. Open the directory where you unzipped the file, go to bin, and rename txtorg to txtorg.py. Double click this file to run txtorg.