Questions tagged [quanteda]

The `quanteda` package provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda

The quanteda package, written by Kenneth Benoit and Paul Nulty, provides a fast, flexible toolset for the management, processing, and quantitative analysis of textual data in R.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, for instance by tokenizing them, with or without stopwords or stemming, or to segment them by sentence or paragraph units.

quanteda is carefully designed to work with Unicode and UTF-8 encodings, and is based on the stringi package which in turn is based on the ICU libraries.

quanteda implements bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data. quanteda includes a suite of sophisticated tools to extract features of the texts into a quantitative matrix, where these features can be defined according to a dictionary or thesaurus, including the declaration of collocations to be treated as single features.

Once converted into a quantitative matrix (known as a "dfm" for document-feature matrix), the textual feature can be analyzed using quantitative methods for describing, comparing, or scaling texts, or used to train machine learning methods for class prediction.

Resources

473 questions
6
votes
1 answer

Stem completion in R replaces names, not data

My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling process, so that I'm not counting variations on the…
J. Trimarco
  • 149
  • 1
  • 8
6
votes
1 answer

Quanteda: how to remove my own list of words

Since there is no ready implementation of stopwords for Polish in quanteda, I would like to use my own list. I have it in a text file as a list separated by spaces. If need be, I can also prepare a list separated by new lines. How can I remove the…
Jacek Kotowski
  • 724
  • 12
  • 40
5
votes
1 answer

tidytext, quanteda, and tm returning different tf-idf scores

I am trying to work on tf-idf weighted corpus (where I expect tf to be a proportion by document rather than simple count). I would expect the same values to be returned by all the classic text mining libraries, but I am getting different values. Is…
Radim
  • 378
  • 1
  • 10
5
votes
1 answer

R: removal of regex from Quanteda DFM, Sparse Document-Feature Matrix, object?

Quanteda package provides the sparse document-feature matrix DFM and its methods contain removeFeatures. I have tried dfm(x, removeFeatures="\\b[a-z]{1-3}\\b") to remove too short words as well as dfm(x, keptFeatures="\\b[a-z]{4-99}\\b") to preserve…
hhh
  • 44,388
  • 56
  • 154
  • 251
5
votes
1 answer

Feature selection in document-feature matrix by using chi-squared test

I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test. I know there were already a lot of people asked this…
5
votes
1 answer

what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R?

I am in dire need. I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. Therefore, my corpus has non-ASCII characters such as U+00F8. I am using Quanteda and I have imported…
Ricardo
  • 81
  • 1
  • 5
4
votes
2 answers

Remove digits glued to words for quanteda objects of class tokens

A related question can be found here but does not directly tackle this issue I discuss below. My goal is to remove any digits that occur with a token. For instance, I want to be able to get rid of the numbers in situations like: 13f, 408-k, 10-k,…
Francesco Grossetti
  • 1,206
  • 6
  • 14
4
votes
2 answers

Error in reading Chinese in txt: corpus() only works on character, corpus, Corpus, data.frame, kwic objects

I try to produce a wordcloud and obtain word frequency for a Chinese speech using R, jiebaR and corpus, but cannot make a corpus. Here is my code: library(jiebaR) library(stringr) library(corpus) cutter <- worker() v36 <- readLines('v36.txt',…
ronzenith
  • 105
  • 8
4
votes
1 answer

How to do fuzzy pattern matching with quanteda and kwic?

I have texts written by doctors and I want to be able to highlight specific words in their context (5 words before and 5 words after the word I search for in their text). Say I want to search for the word 'suicidal'. I would then use the kwic…
Joran
  • 43
  • 3
4
votes
2 answers

Naive Bayes in Quanteda vs caret: wildly different results

I'm trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda with the ones in caret. However, I can't seem to get caret to…
JBGruber
  • 8,083
  • 1
  • 13
  • 35
4
votes
2 answers

Convert dfmSparse from Quanteda package to Data Frame or Data Table in R

I have a dfmSparse object (large, with 2.1GB) which is tokenized and with ngrams (unigrams, bigrams, trigrams and fourgrams), and I want to convert it to a data frame or a data table object with the columns: Content and Frequency. I tried to…
Diego Gaona
  • 448
  • 4
  • 18
4
votes
1 answer

How do you use a LIWC-formatted dictionary with the R package Quanteda?

As LIWC software and dictionaries are proprietary, I was pleased to see they seemed to play well with the still-in-development but excellent R package Quanteda. The documentation for the R package Quanteda demonstrates its use with a LIWC-format…
Joshua Rosenberg
  • 3,364
  • 4
  • 26
  • 61
4
votes
2 answers

Generating all word unigrams through trigrams in R

I am trying to generate a list of all unigrams through trigrams in R to, eventually, make a document-phrase matrix with columns including all single words, bigrams, and trigrams. I expected to find an easy package for this, and have not succeeded. …
miratrix
  • 181
  • 2
  • 12
3
votes
1 answer

Merge two dataframe by rows using common words

df1 <- data.frame(freetext = c("open until monday night", "one more time to insert your coin"), numid = c(291,312)) df2 <- data.frame(freetext = c("open until night", "one time to insert your be"), aid = c(3,5)) I would line to merge the two…
foc
  • 907
  • 1
  • 9
  • 25
3
votes
1 answer

How to initialize second glove model with solution from first?

I am trying to implement one of the solutions to the question about How to align two GloVe models in text2vec?. I don't understand what are the proper values for input at GlobalVectors$new(..., init = list(w_i, w_j). How do I ensure the values for…
Ben
  • 38,669
  • 17
  • 120
  • 206
1
2 3
31 32