Software for preprocessing textual data in multiple languages for textual analysis.

Produced by Christopher Lucas, Alex Storer, and Dustin Tingley. Thanks to Sam Brotherton for work on earlier versions.

Send questions and suggestions to Christopher Lucas.

Cite: BibTeX | Endnote

Index-based text management

txtorg is a Python-based utility that leverages Whoosh and Apache Lucene to facilitate text preprocessing and management. In the newest version (released Winter 2016), users can use a pure-Python version of txtorg that relies on Whoosh, while those in need of especially fast processing or additional language support can install Lucene to use with txtorg. txtorg outputs processed text in a variety of formats for use in a wide array of analytical software, including (but not limited to) the structural topic model. It scales to large corpora and has a graphical user interface that anyone can use. With Lucene and Whoosh, txtorg can support a wide range of languages. For more information on txtorg and text analysis, especially (but not exclusively) with data in political science, we point users to a working paper that describes the software and various applications in greater detail.

For regular updates and software support, sign up for our mailing list.


In response to demand for software with fewer dependencies, we've updated txtorg so that users need only install txtorg in order to utilize its base functionality. Users may extend txtorg functionality by also installing Lucene. We've done our best to simplify and document the installation process for Linux, Windows, and Mac, but if you have any trouble with the installation or if you have ideas for improving it, send Chris an email! Your feedback is very helpful.

Please find the appropriate installation instructions for your system below.



This section explains how to use txtorg. After you've finished the installation process, please use the table of contents below to navigate the documentation. For the impatient, the basic worklow proceeds as follows. First, create a new corpus. Second, import documents. Third, rebuild the index files (this is very important). Fourth, select documents. Finally, export the TDM.

  1. Create a Corpus

  2. Import Documents
  3. Analyzer Selection (Specify Non-English Language)

  4. Select Documents

  5. Export TDM
  6. Example