A corpus most commonly refers to a collection of structured text. Please consider asking your question on, if your question is not closely related to programming or you are just looking for a freely available corpus for any purpose.

A corpus most commonly refers to a collection of structured text (although e.g. audio corpora do exist, too). Text corpora can be comprised of anything from a collection of the raw text of newspaper articles to documents with their words labeled with their part of speech, grammatical function, narrative function, and a number of other annotations. A corpus may contain texts of a single language, or it may contain texts written in multiple languages.

Common Uses and Applications

Text corpora are commonly used in computational linguistics and natural language processing research. Often they are annotated or 'labeled' to identify various attributes such as the topics or themes of the documents contained in the corpora, or the part of speech of the words in the corpora. Labelled corpora are often expensive to produce as they require a human to manually examine and classify the corpus.

A labeled corpus could be used as a training dataset for various machine learning or natural language processing algorithms. For example, a labelled corpus could be used in an algorithm for classifying documents. A corpus could exist of 200 newspaper articles, 50 of which are about sports, 50 about politics, 50 about the arts, and 50 about finance. Those 200 labelled newspaper articles could be fed into some algorithm which examines the articles and identifies the attributes of each category, 'learning' what each of the four categories look like. Once this learning has occurred, a new unlabelled corpus of some number of newspaper articles could be fed into the algorithm, and based on the knowledge learned from the labelled corpus, it could then identify or classify each article as falling under one of the four categories of sports, politics, art or finance.

Examples of Corpora

The Brown Corpus consists of 500 samples of writing published in 1961 grouped into 15 different genres including sports, politics, sciences, and fiction. In addition to being divided into genres, the Brown Corpus has also been tagged with a special notation that identifies the parts of speech of every word in the corpus. Each word is followed by a '/' symbol and then a list of all of its part of speech tags. For example, a singular noun is identified by the symbol 'nn' while a possessive singular noun is identified by the symbol 'nn$'.

Sample from the Brown Corpus:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd
Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj 
primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/''
that/cs any/dti irregularities/nns took/vbd place/nn ./.

WordNet is a large database of English words grouped into sets of synonyms. WordNet consists of a separate structured hierarchy for nouns, verbs, adjectives, and adverbs. The hierarchy is structured with 'is a' relationships, where a child node has an 'is a' relationship with its parent node. Other relationships (antonyms, hyperonyms, etc) are annotated, too.

Sample from WordNet via Wikipedia:

 dog, domestic dog, Canis familiaris
    => canine, canid
       => carnivore
         => placental, placental mammal, eutherian, eutherian mammal
           => mammal
             => vertebrate, craniate
               => chordate
                 => animal, animate being, beast, brute, creature, fauna
                   => ...
