The most popular parts of our corpus collection are stored on AFS at /afs/ir/data/linguistic-data/. The corpus TA also has hard copies of every corpus in our collection and can help you find whatever you may be looking for.
LDC Corpora
Most of our corpora are provided by the Linguistic Data Consortium (LDC), and we have nearly all of the LDC corpora released since about 2000.
See the full catalog of LDC corpora
On AFS
All LDC Corpora that have been uploaded are stored on the within the /ldc directory, with the corpus starting with the LDC code. For example, you can find the Chinese Propbank corpus (LDC2005T23) at:
/afs/ir/data/linguistic-data/ldc/LDC2005T23-Chinese-PropBank-1.0
Only high-demand LDC corpora are uploaded to AFS. If you find something in the catalog that you can't find on AFS, contact the corpus TA.
On the NLP machines
A complete inventory of LDC corpora is also maintained on the NLP group’s internal machines, at:
/scr/corpora/ldc/
Non-LDC Corpora
* Some corpora have access restrictions.
Read instructions for accessing corpora
Name | Annotation | Language | AFS location |
---|---|---|---|
Aleksova's corpus | Bulgarian (spoken) | ||
American Heritage Talking Dictionary (3rd edition) | English | ||
ATIS | Syntax, POS, some argument structure | English | |
Bavarian Archive of Speech Corpora (only annotations) | Prosody, syntax, POS, transcribed | German, English, Japanese | |
British National Corpus (BNC) World Edition | English | BNC-world | |
British National Corpus (BNC) Web Version 2.0 | On disk, easy-to-use interface | English | |
Brown Corpus | Syntax, POS, some argument structure | English | Brown |
Buckeye Corpus* | POS, phones, aligned speech, speakers | American English (spoken) | BuckeyeFull |
Census 1990 Names | English | IE/census1990names | |
CHRISTINE Corpus | POS, parsed, speakers [extra annotations of spoken BNC] | English (spoken) | CHRISTINE |
CMU Pronouncing Dictionary | Phonology, stress | English | CMU-Pronouncing-Dict |
Columbia Quoted Speech Attribution Corpus | Entities, quotes | English | Columbia-Quoted-Speech-Attribution |
Cornell SMART Archive | English | SMART-Archive | |
Corpus de Français Parlé Parisien des années 2000 | Interviews of Parisians within the past decade. Audio files and transcripts are available for download. See here. | French (spoken) | |
Corpus de la parole | Corpus of spoken languages in modern-day France. Contains audio interviews, some with transcripts. See here. | French (spoken) | |
Corpus of Contemporary American English (COCA) | Word lemmas, POS, relations | American English | COCA |
Corpus Gesproken Nederlands | Contemporary Dutch (spoken) | ||
Corpus of Historical American English (COHA) | Word lemmas, POS, relations | American English | COHA |
Corpus of Spoken Professional American English | POS (use MonoConc) | American English (spoken) | |
DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2) | English | ||
EMILLE/CIIL | Monolingual and parallel corpora, some Hindi annotated for demonstratives, some Urdu annotated with part-of-speech | Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telugu, Urdu | |
Enron Email Corpus | English | Enron-Email-Corpus | |
Excite log | English | IR | |
FrameNet Lexical Semantics Database | English | FrameNet | |
International Computer Archive of Modern and Medieval English | English | ICAME | |
International Corpus of English - British Component | (use tgrep2) | English | ICE-GB |
International Corpus of English - Singapore Component | (use tgrep2) | English | ICE-Singapore |
IViE | Prosody, phonetic, etc. | British dialects | |
John Rylands Univ Corpus of late 18c prose | Early Modern English | Rylands18cProse | |
Kristie Seymore's Information Extraction Data | English | IE/Kristie-Seymore-IE | |
KIEL Corpus of Spontaneous Speech | Aligned recordings, phones, speakers. Also includes German lexicon | German (spoken) | KIEL-Spontaneous |
Lexique | French lexical database: orthography, phonology, morphology, syntactic category, lemma, frequency | French | Lexique |
LUCY | POS, parsed [extra annotations of written BNC] | English | LUCY |
Mooney Job Data | English | IE/Mooney-Job-Data | |
MuchMore Springer Bilingual Corpus | Part-of-Speech, Morphology (inflection and decomposition), Chunks, Semantic Classes, Semantic Relations | English, German | MuchMore |
MULTEXT-East | lexica, annotated translations of Orwell's 1984 | Bulgarian, Croatian, Czech, Estonian, English, Hungarian, Romanian, Slovene | MULTEXT |
NEGRA | Syntax (LFG-based), POS, some argument structure (use TIGERSearch) | German | NEGRA |
Nihon Kokugo Daijiten | Japanese | KokugoDaijiten | |
Parallel Pan American Health Corpus | Parallel Spanish-English text from The Pan American Health Organization, Conferences and General Services Division | English, Spanish | PanAmericanHealthOrg |
PARC 700 Dependency Bank | 700 dependency-parsed sentences from Wall Street Journal | English | PARC700DepsBank |
PPCME2* | diachronic corpus | PPCME2 | |
PropBank | predicate structure enriched treebank | English | Proposition-Bank-1 |
Remedia Story Comprehension* | English | QA | |
Reuters Corpus* | English | Reuters-Corpus | |
RNC German radio news (Nachrichten) corpus | Prosodically annotated & transcribed speech files | German (spoken) | |
Stanford Speed-date Corpus* | Recordings, transcripts, and speaker information for a series of speed-dates | English (spoken) | SpeedDate |
Switchboard Corpus | Syntax, POS, some argument structure (use TIGERSearch) | English (spoken) | Switchboard |
Switchboard LINK Project Corpus* | Syntax, POS; some arg-str, animacy, information status, and coreference (use tgrep2) | English (spoken) | Treebank/LINK-swbd |
SUSANNE Corpus, Release 5 | POS, parsed [extra annotations of Brown Corpus] | English | SUSANNE |
TIGER Treebank | Syntax (LFG-based), POS, some argument structure (use TIGERSearch) | German | |
TIGER sample corpora | Syntax, POS, some argument structure (use TIGERSearch) | English | TIGERCorpus |
TREC Text Research Collection Vols. 4 (May 1996) & 5 (April 1997) | English | ||
Unified Medical Language System (UMLS) | English | UMLS | |
Verbmobil Dialogs | German, English, Japanese | Verbmobil-Dialogs | |
Wall Street Journal | Syntax, POS, some argument structure (use TIGERSearch) | English | Treebank |
Wolverhampton Coreference | coreference and anaphora | English | Wolverhampton-Coreference |
WordNet | lexical information database | English | WordNet |
YCOE* | Syntax, POS, CAT, lemma (use TIGERSearch) | English | |
Yomiuri Shinbun | Japanese | YomiuriShinbun |