Locally available corpus software
Name | Description | Where | Manual |
---|---|---|---|
Stuttgart Corpus Workbench (CQP, XKWIC) | Regular expression searches, sorting, frequencies, subcorpora | Turing | http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ |
Gsearch | Tag and word searches; syntactic searches with self-defined grammar | AFS | http://web.stanford.edu/dept/linguistics/corpora/cas-tut-gsearch.html |
CorpusSearch Lite [v1.1] | search corpora in the Penn Treebank format. It is not corpus specific, but will work on any corpus in the correct format. It can be used to search any of the English Parsed Corpora series. | AFS | http://www-users.york.ac.uk/~lang22/YCOE/doc/corpussearch/CSRefToc.htm |
TIGERsearch [v2.1] | searches or browses syntactically & POS-tagged corpora; graphical user interface; graphic tree display | AFS | http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/TIGERSearch/manual.html |
TGrep & TGrep2 [v1 and v2] | searches syntactically & POS-tagged corpora (NB: corpora must be pre-indexed) | AFS | http://tedlab.mit.edu/~dr/Tgrep2/tgrep2.pdf |
The grep-family: grep, egrep, sgrep, cgrep, agrep | non-syntactic regular expression searches of text-files | AFS | http://web.stanford.edu/dept/linguistics/corpora/cas-tut-grep.html |
UNIX commands: wc, freq, cat | non-syntactic regular expression searches of text-files | AFS | |
Thorsten Brants's part-of-speech tagger (TnT) | POS tagging; preparation of corpora | AFS | http://web.stanford.edu/dept/linguistics/corpora/cas-tut-tnt.html |
Tregex | searches syntactically & POS-tagged corpora | AFS | http://nlp.stanford.edu/software/tregex/The_Wonderful_World_of_Tregex.ppt |
TDTlite | a collection of scripts that allow you to extract data from large corpora and combine the output into a comprehensive database in a format suitable for importing into statistical analysis programs | AFS |
Other software
The list below is by no means complete, so ask your colleagues, fellow students, or the Corpus TA for more leads.
Name | Description | Where | Manual |
---|---|---|---|
Xwaves | sound file player; manupulation; reads annotated files | Phonetics Lab | |
Praat | sound file player; phonetic analysis; manupulation; reads/creates annotated files; | Phonetics Lab | guide tutorials |
Scripts and little helpers
If you find any scripts that are not listed here that you consider useful, please let us know and maybe send a short description of the script along (2-5 lines) — this will help others a lot.
- Brett Kessler's Search
From Brett Kessler's webpage: This program searches text corpora for arbitrary regular expressions and produces a report in HTML format. It can read local files, or those available by HTTP or FTP, and it knows how to unpack ZIP files. It requires Perl 5, and the following network modules: Net:: FTP, LWP:: Simple, and LWP:: UserAgent. - Chris Manning's sgrep.prl script
Based on an earlier version by Tom Veatch. Does whole sentence matching of newswire corpus. - Chris Manning's extractbody.prl script
Takes out a particular SGML element (given stand-off annotation), and thus can prefilter LDC newswire. - Jason Brenier's ExtractUnitAcoustics script
On AFS at /afs/ir/data/linguistic-data/Switchboard/swbd-tools/ExtractUnitAcoustics
Extracts acoustic features from Switchboard data. See 00README.TXT file in the directory for instructions.