Questions tagged [corpus]

A corpus most commonly refers to a collection of structured text. Please consider asking your question on https://opendata.stackexchange.com, if your question is not closely related to programming or you are just looking for a freely available corpus for any purpose.

A corpus most commonly refers to a collection of structured text (although e.g. audio corpora do exist, too). Text corpora can be comprised of anything from a collection of the raw text of newspaper articles to documents with their words labeled with their part of speech, grammatical function, narrative function, and a number of other annotations. A corpus may contain texts of a single language, or it may contain texts written in multiple languages.

Common Uses and Applications

Text corpora are commonly used in computational linguistics and natural language processing research. Often they are annotated or 'labeled' to identify various attributes such as the topics or themes of the documents contained in the corpora, or the part of speech of the words in the corpora. Labelled corpora are often expensive to produce as they require a human to manually examine and classify the corpus.

A labeled corpus could be used as a training dataset for various machine learning or natural language processing algorithms. For example, a labelled corpus could be used in an algorithm for classifying documents. A corpus could exist of 200 newspaper articles, 50 of which are about sports, 50 about politics, 50 about the arts, and 50 about finance. Those 200 labelled newspaper articles could be fed into some algorithm which examines the articles and identifies the attributes of each category, 'learning' what each of the four categories look like. Once this learning has occurred, a new unlabelled corpus of some number of newspaper articles could be fed into the algorithm, and based on the knowledge learned from the labelled corpus, it could then identify or classify each article as falling under one of the four categories of sports, politics, art or finance.

Examples of Corpora

The Brown Corpus consists of 500 samples of writing published in 1961 grouped into 15 different genres including sports, politics, sciences, and fiction. In addition to being divided into genres, the Brown Corpus has also been tagged with a special notation that identifies the parts of speech of every word in the corpus. Each word is followed by a '/' symbol and then a list of all of its part of speech tags. For example, a singular noun is identified by the symbol 'nn' while a possessive singular noun is identified by the symbol 'nn$'.

Sample from the Brown Corpus:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd
Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj 
primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/''
that/cs any/dti irregularities/nns took/vbd place/nn ./.

WordNet is a large database of English words grouped into sets of synonyms. WordNet consists of a separate structured hierarchy for nouns, verbs, adjectives, and adverbs. The hierarchy is structured with 'is a' relationships, where a child node has an 'is a' relationship with its parent node. Other relationships (antonyms, hyperonyms, etc) are annotated, too.

Sample from WordNet via Wikipedia:

 dog, domestic dog, Canis familiaris
    => canine, canid
       => carnivore
         => placental, placental mammal, eutherian, eutherian mammal
           => mammal
             => vertebrate, craniate
               => chordate
                 => animal, animate being, beast, brute, creature, fauna
                   => ...

654 questions

votes

4 answers

Creating a new corpus with NLTK

I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python. I have a bunch of .txt files and I want to be able to use the corpus…

asked Feb 09 '11 at 23:19

alvas

94,813
90
365
641

votes

4 answers

How can I change the default Mysql connection timeout when connecting through python?

I connected to a mysql database using python con = _mysql.connect('localhost', 'dell-pc', '', 'test') The program that I wrote takes a lot of time in full execution i.e. around 10 hours. Actually, I am trying to read distinct words from a corpus.…

python mysql corpus

asked Feb 06 '13 at 10:30

Animesh Pandey

5,246
10
49
118

votes

4 answers

Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

My project uses the NLTK. How can I list the project's corpus & model requirements so they can be automatically installed? I don't want to click through the nltk.download() GUI, installing packages one by one. Also, any way to freeze that same list…

installation packages nltk requirements corpus

asked Apr 30 '11 at 18:34

Bluu

4,327
4
27
33

votes

4 answers

DocumentTermMatrix error on Corpus argument

I have the following code: # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus,…

r tm corpus

asked Jun 12 '14 at 18:44

user1477388

19,139
26
125
240

votes

6 answers

How to create a word cloud from a corpus in Python?

From Creating a subset of words from a corpus in R, the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim…

python nltk corpus gensim word-cloud

asked May 20 '13 at 08:51

alvas

94,813
90
365
641

votes

3 answers

Is there any Treebank for free?

Is any place I can download Treebank of English phrases for free or less than $100? I need training data containing bunch of syntactic parsed sentences (>1000) in English in any format. Basically all I need is just words in this sentences being…

nlp tagging corpus

asked Jan 21 '12 at 00:09

YMC

4,243
7
44
66

votes

3 answers

How to strip headers/footers from Project Gutenberg texts?

I've tried various methods to strip the license from Project Gutenberg texts, for use as a corpus for a language learning project, but I can't seem to come up with an unsupervised, reliable approach. The best heuristic I've come up with so far is…

nlp text-processing heuristics corpus stripping

asked Aug 12 '09 at 22:48

heartpunk

1,965
18
24

votes

6 answers

R tm package vcorpus: Error in converting corpus to data frame

I am using the tm package to clean up some data using the following code: mycorpus <- Corpus(VectorSource(x)) mycorpus <- tm_map(mycorpus, removePunctuation) I then want to convert the corpus back into a data frame in order to export a text file…

r tm corpus

asked Jul 11 '14 at 18:11

lmcshane

votes

6 answers

Adding custom stopwords in R tm

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list?

r text-mining stop-words corpus tm

asked Aug 26 '13 at 14:22

Brian

6,390
14
52
70

votes

2 answers

The similar method from the nltk module produces different results on different machines. Why?

I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different results for text1.similar() than others. All versions and etc. were the…

python nlp nltk similarity corpus

asked Nov 06 '15 at 02:57

David Beales

votes

1 answer

Make dataframe of top N frequent terms for multiple corpora using tm package in R

I have several TermDocumentMatrixs created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... …

r text-mining corpus tm

asked Mar 19 '13 at 17:12

elfs

votes

4 answers

More efficient means of creating a corpus and DTM with 4M rows

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { …

r data.table corpus term-document-matrix qdap

asked Aug 15 '14 at 16:57

user1477388

19,139
26
125
240

votes

1 answer

Classification using movie review corpus in NLTK/Python

I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from…

python nlp nltk sentiment-analysis corpus

asked Jan 14 '14 at 06:08

user3128184

votes

1 answer

R RKEA - Not enough training instances with class labels (required: 1, provided: 0)!

I'm trying to get RKEA to work in R Studio. Here's my current code: #Imports packages library(RKEA) library(tm) #Creates a corpus of training sentences data <- c("This is a sentence", "I am in an office", "I'm working on a…

r keyword extraction tm corpus

asked Oct 17 '17 at 14:01

peter337

votes

3 answers

TermDocumentMatrix errors in R

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create…

r text-mining tm corpus term-document-matrix

asked Aug 28 '14 at 14:36

Brian P

1,438
3
21
36

2 3

…

43 44 Next