Questions tagged [corpus]

A corpus most commonly refers to a collection of structured text. Please consider asking your question on https://opendata.stackexchange.com, if your question is not closely related to programming or you are just looking for a freely available corpus for any purpose.

A corpus most commonly refers to a collection of structured text (although e.g. audio corpora do exist, too). Text corpora can be comprised of anything from a collection of the raw text of newspaper articles to documents with their words labeled with their part of speech, grammatical function, narrative function, and a number of other annotations. A corpus may contain texts of a single language, or it may contain texts written in multiple languages.

Common Uses and Applications

Text corpora are commonly used in computational linguistics and natural language processing research. Often they are annotated or 'labeled' to identify various attributes such as the topics or themes of the documents contained in the corpora, or the part of speech of the words in the corpora. Labelled corpora are often expensive to produce as they require a human to manually examine and classify the corpus.

A labeled corpus could be used as a training dataset for various machine learning or natural language processing algorithms. For example, a labelled corpus could be used in an algorithm for classifying documents. A corpus could exist of 200 newspaper articles, 50 of which are about sports, 50 about politics, 50 about the arts, and 50 about finance. Those 200 labelled newspaper articles could be fed into some algorithm which examines the articles and identifies the attributes of each category, 'learning' what each of the four categories look like. Once this learning has occurred, a new unlabelled corpus of some number of newspaper articles could be fed into the algorithm, and based on the knowledge learned from the labelled corpus, it could then identify or classify each article as falling under one of the four categories of sports, politics, art or finance.

Examples of Corpora

The Brown Corpus consists of 500 samples of writing published in 1961 grouped into 15 different genres including sports, politics, sciences, and fiction. In addition to being divided into genres, the Brown Corpus has also been tagged with a special notation that identifies the parts of speech of every word in the corpus. Each word is followed by a '/' symbol and then a list of all of its part of speech tags. For example, a singular noun is identified by the symbol 'nn' while a possessive singular noun is identified by the symbol 'nn$'.

Sample from the Brown Corpus:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd
Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj 
primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/''
that/cs any/dti irregularities/nns took/vbd place/nn ./.

WordNet is a large database of English words grouped into sets of synonyms. WordNet consists of a separate structured hierarchy for nouns, verbs, adjectives, and adverbs. The hierarchy is structured with 'is a' relationships, where a child node has an 'is a' relationship with its parent node. Other relationships (antonyms, hyperonyms, etc) are annotated, too.

Sample from WordNet via Wikipedia:

 dog, domestic dog, Canis familiaris
    => canine, canid
       => carnivore
         => placental, placental mammal, eutherian, eutherian mammal
           => mammal
             => vertebrate, craniate
               => chordate
                 => animal, animate being, beast, brute, creature, fauna
                   => ...
654 questions
84
votes
4 answers

Creating a new corpus with NLTK

I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python. I have a bunch of .txt files and I want to be able to use the corpus…
alvas
  • 94,813
  • 90
  • 365
  • 641
76
votes
4 answers

How can I change the default Mysql connection timeout when connecting through python?

I connected to a mysql database using python con = _mysql.connect('localhost', 'dell-pc', '', 'test') The program that I wrote takes a lot of time in full execution i.e. around 10 hours. Actually, I am trying to read distinct words from a corpus.…
Animesh Pandey
  • 5,246
  • 10
  • 49
  • 118
56
votes
4 answers

Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

My project uses the NLTK. How can I list the project's corpus & model requirements so they can be automatically installed? I don't want to click through the nltk.download() GUI, installing packages one by one. Also, any way to freeze that same list…
Bluu
  • 4,327
  • 4
  • 27
  • 33
56
votes
4 answers

DocumentTermMatrix error on Corpus argument

I have the following code: # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus,…
user1477388
  • 19,139
  • 26
  • 125
  • 240
45
votes
6 answers

How to create a word cloud from a corpus in Python?

From Creating a subset of words from a corpus in R, the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim…
alvas
  • 94,813
  • 90
  • 365
  • 641
24
votes
3 answers

Is there any Treebank for free?

Is any place I can download Treebank of English phrases for free or less than $100? I need training data containing bunch of syntactic parsed sentences (>1000) in English in any format. Basically all I need is just words in this sentences being…
YMC
  • 4,243
  • 7
  • 44
  • 66
20
votes
3 answers

How to strip headers/footers from Project Gutenberg texts?

I've tried various methods to strip the license from Project Gutenberg texts, for use as a corpus for a language learning project, but I can't seem to come up with an unsupervised, reliable approach. The best heuristic I've come up with so far is…
heartpunk
  • 1,965
  • 18
  • 24
17
votes
6 answers

R tm package vcorpus: Error in converting corpus to data frame

I am using the tm package to clean up some data using the following code: mycorpus <- Corpus(VectorSource(x)) mycorpus <- tm_map(mycorpus, removePunctuation) I then want to convert the corpus back into a data frame in order to export a text file…
lmcshane
  • 954
  • 3
  • 11
  • 25
16
votes
6 answers

Adding custom stopwords in R tm

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list?
Brian
  • 6,390
  • 14
  • 52
  • 70
15
votes
2 answers

The similar method from the nltk module produces different results on different machines. Why?

I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different results for text1.similar() than others. All versions and etc. were the…
David Beales
  • 431
  • 2
  • 15
15
votes
1 answer

Make dataframe of top N frequent terms for multiple corpora using tm package in R

I have several TermDocumentMatrixs created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... …
elfs
  • 165
  • 1
  • 1
  • 4
13
votes
4 answers

More efficient means of creating a corpus and DTM with 4M rows

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { …
user1477388
  • 19,139
  • 26
  • 125
  • 240
13
votes
1 answer

Classification using movie review corpus in NLTK/Python

I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from…
user3128184
  • 213
  • 1
  • 2
  • 9
12
votes
1 answer

R RKEA - Not enough training instances with class labels (required: 1, provided: 0)!

I'm trying to get RKEA to work in R Studio. Here's my current code: #Imports packages library(RKEA) library(tm) #Creates a corpus of training sentences data <- c("This is a sentence", "I am in an office", "I'm working on a…
peter337
  • 121
  • 3
12
votes
3 answers

TermDocumentMatrix errors in R

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create…
Brian P
  • 1,438
  • 3
  • 21
  • 36
1
2 3
43 44