txtorg

Software for preprocessing textual data in multiple languages for textual analysis.

Produced by Christopher Lucas, Alex Storer, and Dustin Tingley. Thanks to Sam Brotherton for work on earlier versions.

Send questions and suggestions to Christopher Lucas.

Cite: BibTeX | Endnote

Create a Corpus

Begin by creating a new corpus. One of txtorg's strengths is its capacity to manage multiple corpora simultaneously. You can create a corpus for one project, begin another, and later return to the original without reindexing, thus unifying the text management process into a single piece of software.

To create a new corpus, click 'File' -> 'New Corpus'. You will be prompted to choose a directory in which to save the corpus, and then to name it. You may put the corpus wherever you like. The corpus name may only contain alphanumeric characters.

Import Documents

Next, you must import documents. This section lays out the various formatting and import options and concludes with example scripts demonstrating the construction of files that can be used to import documents into txtorg.

Formatting Requirements

To import documents, click on the corpus to which you want to import documents (in the leftmost 'Corpus' frame). The corpus should be highlighted. Then, click 'Corpus' -> 'Import Documents'. You'll be shown a window labeled 'SELECT PREPROCESSING OPTIONS', at the bottom of which, you select the 'Corpus Format'. Note that the txtorg installation contains examples for all three import options, noted in the following paragraphs. The example is composed of excerpts from The Brothers Karamazov (now public domain) and books and chapters to which those excerpts belong (metadata).

Import an Entire Directory

The first import format option is 'Import an entire directory'. If you'd like to import all documents in a directory and any child directory located within it, select this option. Note that these documents must be .txt files, txtorg will skip all other file extensions but will import every .txt file located within the directory. Note this import option does not support metadata. An example of this import format can be found in the 'examples/brothersk/' directory contained in the zip file used to install txtorg.

Import from a CSV file (not including content)

The second import format option is 'Import from a CSV file (not including content)'. In this format, files are uploaded with a csv, where the first column is 'filepath' pointing to all of the documents that are to be uploaded. The remaining columns in the csv are metadata fields, such as author name or date. In the installation directory, '/examples/brothersk_without_content.csv' displays this import format. This file is also shown below. Note that txtorg cannot recognize relative directory paths (such as those using a period to signify the current directory) on some operating systems, and thus to upload the example corpus with this CSV, you must edit the paths accordingly.

filepath,book,chapter
[full_path_to_directory]/examples/brothersk/1.txt,2,2
[full_path_to_directory]/examples/brothersk/2.txt,2,5
[full_path_to_directory]/examples/brothersk/3.txt,2,7
[full_path_to_directory]/examples/brothersk/4.txt,3,3
[full_path_to_directory]/examples/brothersk/5.txt,3,9
[full_path_to_directory]/examples/brothersk/6.txt,4,3
[full_path_to_directory]/examples/brothersk/7.txt,4,8
[full_path_to_directory]/examples/brothersk/8.txt,4,8
[full_path_to_directory]/examples/brothersk/9.txt,7,3

Import from a CSV file (including content)

The third import format option is 'Import from a CSV file (including content)'. In this format, the documents are represented as a field in the csv, where the values are the actual text of the corpus, and the other columns again correspond to document metadata. In the installation directory, '/examples/brothersk_with_content.csv' displays an example of this import format. This file is also shown below and can be uploaded without modification. Note that you'll be prompted to select the field containing the content, so it need not necessarily be the first column in the CSV.

quote,book,chapter
The man who lies to himself and listens to his own lie comes to such a pass that he cannot distinguish the truth within him or around him and so loses all respect for himself and for others. And having no respect he ceases to love and in order to occupy and distract himself without love he gives way to passions and coarse pleasures and sinks to bestiality in his vices all from continual lying to other men and to himself,2,2
All this exile to hard labor and formerly with floggings does not reform anyone and above all does not even frighten almost any criminal and the number of crimes not only does not diminish but increases all the more. Surely you will admit that. And it turns out that society thus is not protected at all for although the harmful member is mechanically cut off and sent away far out of sight another criminal appears at once to take his place perhaps even two others. If anything protects society even in our time and even reforms the criminal himself and transforms him into a different person again it is Christ's law alone which manifests itself in the acknowledgement of one's own conscience,2,5
You're a Karamazov too! In your family sensuality is carried to the point of fever. So these three sensualists are now eyeing each other with knives in their boots. The thereof them are at loggerheads and maybe you're the fourth,2,7
I'm a Karamazov. . . . when I fall into the abyss I go straight into it head down and heels up and I'm even pleased that I'm falling in such a humiliating position and for me I find it beautiful. And so in that very shame I suddenly begin a hymn. Let me be cursed let me be base and vile but let me also kiss the hem of that garment in which my God is clothed; let me be following the devil at the same time but still I am also your son Lord and I love you and I feel a joy without which the world cannot stand and be,3,3
Viper will eat viper and it would serve them both right!,3,9
My friends ask gladness from God. Be glad as children as birds in the sky. And let man's sin not disturb you in your efforts do not fear that it will dampen your endeavor and keep it from being fulfilled do not say 'Sin is strong impiety is strong the bad environment is strong and we are lonely and powerless the bad environment will dampen us and keep our good endeavor from being fulfilled.' Flee from such despondency my children! There is only one salvation for you: take yourself up and make yourself responsible for all the sins of men. For indeed it is so my friend and the moment you make yourself sincerely responsible for everything and everyone you will see at once that it is really so that it is you who are guilty on behalf of all and for all. Whereas by shifting your own laziness and powerlessness onto others you will end by sharing in Satan's pride and murmuring against God,4,3
his whole heart blazed up and turned towards some kind of light and he wanted to live and live to go on and on along some path towards the new beckoning light and to hurry hurry right now at once!,4,8
Everything is permitted,4,8
Just know one thing Rakitka I may be wicked but still I gave an onion,7,3

Building an Import CSV

To import a corpus with metadata, you must construct a CSV like the two outlined in the previous section. To demonstrate how users might make appropriate import CSVs, we provide examples in R and Python. For both examples, we scrape the Brothers Karamazov corpus from the web, then write it to the working directory in a format commonly used to denote metadata. Specifically, metadata is denoted in the filename. There are two metadata fields, the book and chapter in which the selected quotation can be found. The files are named [BOOK]_[CHAPTER].txt, where [BOOK] and [CHAPTER] denote the respective values for the particular document.

You can of course construct the CSV by whatever method you find most convenient, and you need only write one CSV. Our examples, in R and Python, write a CSV for importing with filenames and another for importing from a CSV that contains the actual documents.

Constructing an Import CSV with R

library('XML')

###################################
# SCRAPE CORPUS AND WRITE TO DISK #
###################################

base.url <- 'http://www.christopherlucas.org/data'

# Get links to docs
links.page <- paste(base.url, 'brothersk', sep = '/')
page <- htmlParse(links.page)
links <- getHTMLLinks(page)

# Create the directory structure, scrape the corpus, and write it to the disk
dir.create('scraped_docs')
setwd('./scraped_docs')
dir.create('4_8')
for(link in links){
    write(xpathSApply(
        htmlParse(paste(base.url, link, sep = '/')), '///p', xmlValue
        ),
          link)
}

###############################################
# FROM CORPUS ON DISK, MAKE TXTORG IMPORT CSV #
###############################################

files <- list.files(path = getwd(), recursive = TRUE)

# Import without content
m <- c()
for(file in files){
    book <- substr(file, 1, 1)
    chapter <- substr(file, 3, 3)
    m <- rbind(m, c(paste(getwd(), file, sep = '/'), book, chapter))
}
colnames(m) <- c('filename', 'book', 'chapter')
write.csv(m, 'txtorgCSV_without_content.csv', quote=FALSE, row.names = FALSE)

# Import with content in csv
m <- c()
for(file in files){
    book <- substr(file, 1, 1)
    chapter <- substr(file, 3, 3)
    content <- readLines(file)[1]
    m <- rbind(m, c(content, book, chapter))
}
colnames(m) <- c('content','book','chapter')
write.csv(m, 'txtorgCSV_with_content.csv', quote=FALSE, row.names = FALSE)

Constructing an Import CSV with Python

import urllib2 
import lxml
from lxml import html
import os
import csv
import re

###################################
# SCRAPE CORPUS AND WRITE TO DISK #
###################################

base_url = 'http://www.christopherlucas.org/data/'

# Get links to docs
page = urllib2.urlopen(base_url + 'brothersk')
content = page.read()
page.close()

html = lxml.html.fromstring(content)
links = [i[2] for i in html.iterlinks()]

os.makedirs('corpus/4_8')
os.chdir('corpus')

for link in links:
    page = urllib2.urlopen(base_url + link)
    content = page.read()
    page.close()

    f = open(link, 'w')
    f.write(content)
    f.close()

###############################################
# FROM CORPUS ON DISK, MAKE TXTORG IMPORT CSV #
###############################################
    
# Import without content
f = open('txtorgCSV_without_content.csv','w')
names = ['filename', 'book', 'chapter']
dw = csv.DictWriter(f, names)
dw.writerow({k:k for k in names})
for root, dirnames, filenames in os.walk('.'):
    for filename in filenames:
        if not filename.endswith('.txt'):
            continue
        book = filename[0]
        chapter = filename[2]
        fname = os.getcwd() + filename
        dw.writerow({'filename':fname, 'book':book, 'chapter':chapter})
f.close()

# Import with content in csv
f = open('txtorgCSV_with_content.csv','w')
names = ['content', 'book', 'chapter']
dw = csv.DictWriter(f, names)
dw.writerow({k:k for k in names})
for root, dirnames, filenames in os.walk('.'):
    for filename in filenames:
        if not filename.endswith('.txt'):
            continue
        book = filename[0]
        chapter = filename[2]
        open_doc = open(root + '/' + filename, 'r')
        content = open_doc.read().rstrip()
        open_doc.close()        
        dw.writerow({'content':content, 'book':book, 'chapter':chapter})
f.close()

Encodings

txtorg supports a wide range of encodings through the Chardet library in Python. Those encodings can be seen in full in the encodings drop down menu on the import documents menu and are shown in the table below. In general, UTF-8 is preferable, in the event that you have a choice. If you select 'Automatically Detect Encodings', txtorg will guess the encodings of your corpus, with some error. Note that the corpus must be in one encoding, so txtorg will fail if you ask it to detect the encodings of a corpus containing multiple encodings. Because it isn't possible to detect encodings perfectly, it is better to designate the encoding if possible.

Supported Encodings

Dictionary Replacement

Often, corpora contain terms that you may wish to replace with another, either for cleaning purposes, or because you'd like to index a phrase. For example, an investigator may be interested in occurences of the term "Foreign Aid", rather than simply the terms "foreign" and "aid". Because txtorg only indexes unigrams, to include n-grams (n > 1) in the final TDM, an investigator should combine the n-grams of interest into a single token. For example, "foreign aid" becomes "foreignaid". The investigator can then use this combined term to measure frequencies of the phrase "foreign aid".

txtorg supports this sort of preprocessing by allowing the user to upload a two-column CSV, where the first column is the string that is to be replaced and the second column is the term with which it is to be replaced. 'examples/replace_dict.csv' displays a simple example for use with the included Brothers Karamozov corpus. This example is displayed below, which will replace all occurences of the term 'karamozov' with 'smerdyakov' (another character name).

karamazov,smerdyakov

IMPORTANT: Note that the csv must be lower case, even if the terms that are to be replaced contain upper case letters.

You'll note that there is a 'Simple replace' option in the window. If selected, all occurences of the terms in the first column will be replaced with those in the second column. If not selected, txtorg will first tokenize the documents, then replace occurences of the tokens in the first column with those in the second. Thus, if you want to replace bi- or tri-grams with a single term, select 'Simple replace'. However, this option is prone to false positives for simpler terms, so think carefully about the implications of both options.

Custom Python Script

txtorg supports custom preprocessing scripts written in Python. You may wish to run your documents through a script of your own to remove markup or to do some sort of unsupported preprocessing. If so, select 'Select Python script' and select the Python script you'd like to run on your corpus. Txtorg will read the script, then will call a function 'custom()' on each document before adding it to the corpus. Thus, your script must contain a function 'custom()', which must take as input a document (type str) and output a preprocessed document (also type str). All preprocessing must occur within the function 'custom()', so you may wish to write your preprocessing functions as separate functions, then wrap them in a function named 'custom()'. However, we leave this up to the user, as the main point of this functionality is flexibility!

If you write a custom preprocessing script, please send it to us, as we'd like to provide a repository of scripts created by users.

Automated Spelling Correction

txtorg supports automated spellchecking in English. The spellchecker is quite rudimentary, so you ought not use it if your corpus has a number of proper nouns or other complicated terms. However, if you'd like to run it, simply select the corresponding radio button and incorrectly spelled words will be replaced by their correct counterparts.

Rebuild Index File

IMPORTANT: After importing documents, you must select the corpus, then click 'Corpus -> Rebuild Index File'. Also, note that in the examples above, both the second and the third import options yield the same corpus, while the first is identical except it has no metadata.

Analyzer Selection (Specify Non-English Language)

At the core of txtorg is Apache Lucene (Cutting et al., 2013), a high-performance text search engine library. By drawing on the active open source Lucene community, txtorg is able to provide support for a diverse set of languages. txtorg currently includes support for Arabic, Bulgarian, Portuguese (separate tools for Brazil and Portugal), Catalan, Chinese, Japanese, Korean, Czech, Danish, German, Greek, English, Spanish, Basque, Persian, Finnish, French, Irish, Galician, Hindi, Hungarian, Armenian, Indonesian, Italian, Latvian, Dutch, Norwegian, Romanian, Russian, Swedish, Thai, and Turkish, among others.

txtorg leverages the dedicated language-specific preprocessing utilities (stemming, segmentation, etc) that have been created by the open source Lucene community. Thus, if you select the Czech analyzer, for instance, txtorg will automatically process your text according to best practice for Czech language text. You can find more information about the analyzers here.

To select an analyzer, select the corpus by clicking on it, then click 'Corpus' -> 'Change Analyzer'. A window will appear, and in the left menu you may select an analyzer by clicking on it. For English text, we recommend the EnglishAnalyzer. You can observe how each analyzer tokenizes the text by typing text into the 'Sample' window, then clicking 'Tokenize'. The tokens output by the selected analyzer will appear in the window below. Once you've decided on an analyzer, click 'OK'. txtorg will then reindex the documents given the new analyzer, so you must wait a moment (several moments for large corpora). A window will appear when the reindexing is complete, after which you may search for terms.

Select Documents

To export a TDM, you must first select a TDM. In txtorg, this is done with Lucene queries, documented on this page and elsewhere on the internet. txtorg supports all valid Lucene queries. Queries are input in the search box.

Select All Docs

To select all documents, search for 'all' (without quotes).

Terms

Fields

Lucene supports fielded (meta) data. When performing a search you can either specify a field, or use the default field (the content of the document).

More Complicated Searches

txtorg supports all valid Lucene queries, and Lucene queries can be incredibly complex. We point users to the official documentation here for more information on the full range of search options.

Export TDM

After you have selected the documents you want, you can export the documents as a document-term matrix or you can export the full documents in their unprocessed form.

Filter On Term Frequency

Infrequent and extremely frequent words often provide little additional information and can be stripped from the TDM. To do this, txtorg supports constraining the document-term matrix to terms within a user-specified frequency range. That is, users may bound the document-term matrix to terms that appear at least a certain number of times but no more than a specified number of times. To do so, in the rightmost window, simply enter the lower bound and the upper bound in the boxes for 'at least' and 'at most'. By default, these values indicate the least and most documents a term appears in.

TDM Format

Next, you must click 'Export TDM'. The user will be asked to select one of three formats. 'Standard STM' is the Blei et al format, also used in the STM package for R. If you intend to read the TDM into a topic modeling package, you probably want this format. Delimited STM is the same, but the sparse matrix is delimited by commas, rather than spaces. And 'Flat CSV file' is simply a standard CSV, which you might want for simpler applications with just a few documents, or in rare occasions where the matrix is not sparse.

Export Full Documents

To export the full documents, simply click 'Export Files'. This can be especially helpful if you'd like to back out the full text associated with a row in the document-term matrix. For instance, you may want to use this with the plotQuote() function in the stm package.

Example