Skip to content Skip to navigation

Corpus tools

Locally available corpus software

Name Description Where Manual
Stuttgart Corpus Workbench (CQP, XKWIC) Regular expression searches, sorting, frequencies, subcorpora Turing http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/
Gsearch Tag and word searches; syntactic searches with self-defined grammar AFS http://web.stanford.edu/dept/linguistics/corpora/cas-tut-gsearch.html
CorpusSearch Lite [v1.1] search corpora in the Penn Treebank format. It is not corpus specific, but will work on any corpus in the correct format. It can be used to search any of the English Parsed Corpora series. AFS http://www-users.york.ac.uk/~lang22/YCOE/doc/corpussearch/CSRefToc.htm
TIGERsearch [v2.1] searches or browses syntactically & POS-tagged corpora; graphical user interface; graphic tree display AFS http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/TIGERSearch/manual.html
TGrep & TGrep2 [v1 and v2] searches syntactically & POS-tagged corpora (NB: corpora must be pre-indexed) AFS http://tedlab.mit.edu/~dr/Tgrep2/tgrep2.pdf
The grep-family: grep, egrep, sgrep, cgrep, agrep non-syntactic regular expression searches of text-files AFS http://web.stanford.edu/dept/linguistics/corpora/cas-tut-grep.html
UNIX commands: wc, freq, cat non-syntactic regular expression searches of text-files AFS  
Thorsten Brants's part-of-speech tagger (TnT) POS tagging; preparation of corpora AFS http://web.stanford.edu/dept/linguistics/corpora/cas-tut-tnt.html
Tregex searches syntactically & POS-tagged corpora AFS http://nlp.stanford.edu/software/tregex/The_Wonderful_World_of_Tregex.ppt
TDTlite a collection of scripts that allow you to extract data from large corpora and combine the output into a comprehensive database in a format suitable for importing into statistical analysis programs AFS  

Other software

The list below is by no means complete, so ask your colleagues, fellow students, or the Corpus TA for more leads.

Name Description Where Manual
Xwaves sound file player; manupulation; reads annotated files Phonetics Lab  
Praat sound file player; phonetic analysis; manupulation; reads/creates annotated files; Phonetics Lab guide
tutorials

Scripts and little helpers

If you find any scripts that are not listed here that you consider useful, please let us know and maybe send a short description of the script along (2-5 lines) — this will help others a lot.

  • Brett Kessler's Search
    From Brett Kessler's webpage: This program searches text corpora for arbitrary regular expressions and produces a report in HTML format. It can read local files, or those available by HTTP or FTP, and it knows how to unpack ZIP files. It requires Perl 5, and the following network modules: Net:: FTP, LWP:: Simple, and LWP:: UserAgent.
  • Chris Manning's sgrep.prl script
    Based on an earlier version by Tom Veatch. Does whole sentence matching of newswire corpus.
  • Chris Manning's extractbody.prl script
    Takes out a particular SGML element (given stand-off annotation), and thus can prefilter LDC newswire.
  • Jason Brenier's ExtractUnitAcoustics script
    On AFS at /afs/ir/data/linguistic-data/Switchboard/swbd-tools/ExtractUnitAcoustics
    Extracts acoustic features from Switchboard data. See 00README.TXT file in the directory for instructions.