Questions tagged [countvectorizer]

This tag is for questions on the process of turning a collection of text documents into numerical feature vectors using the class CountVectorizer from Python's scikit-learn library.

296 questions
16
votes
3 answers

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing. My…
14
votes
2 answers

List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer

I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example 'and' 123 times, 'to' 100 times, 'for' 90 times,…
12
votes
1 answer

Sklearn: adding lemmatizer to CountVectorizer

I added lemmatization to my countvectorizer, as explained on this Sklearn page. from nltk import word_tokenize from nltk.stem import WordNetLemmatizer class LemmaTokenizer(object): def __init__(self): self.wnl =…
Rens
  • 360
  • 1
  • 3
  • 13
10
votes
2 answers

CountVectorizer does not print vocabulary

I have installed python 2.7, numpy 1.9.0, scipy 0.15.1 and scikit-learn 0.15.2. Now when I do the following in python: train_set = ("The sky is blue.", "The sun is bright.") test_set = ("The sun in the sky is bright.", "We can see the shining sun,…
Archana
  • 173
  • 1
  • 2
  • 10
9
votes
1 answer

Empty vocabulary for single letter by CountVectorizer

Trying to convert string into numeric vector, ### Clean the string def names_to_words(names): print('a') words = re.sub("[^a-zA-Z]"," ",names).lower().split() print('b') return words ### Vectorization def Vectorizer(): …
LookIntoEast
  • 6,297
  • 15
  • 50
  • 76
8
votes
1 answer

sklearn partial fit of CountVectorizer

Does CountVectorizer support partial fit? I would like to train the CountVectorizer using different batches of data.
Donbeo
  • 14,217
  • 30
  • 93
  • 162
7
votes
1 answer

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the…
Logan Yang
  • 1,615
  • 5
  • 22
  • 39
7
votes
4 answers

Apply CountVectorizer to column with list of words in rows in Python

I made a preprocessing part for text analysis and after removing stopwords and stemming like this: test[col] = test[col].apply( lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words]) train[col] =…
Yury Wallet
  • 972
  • 10
  • 17
6
votes
1 answer

Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)

I have a dataset with ~30k unique documents that were flagged because they have a certain keyword in them. Some of the key fields in the dataset are document title, filesize, keyword, and excerpt (50 words around keyword). Each of these ~30k unique…
6
votes
1 answer

How to preserve punctuation marks in Scikit-Learn text CountVectorizer or TfidfVectorizer?

Is there any way for me to preserve punctuation marks of !, ?, " and ' from my text documents using text CountVectorizer or TfidfVectorizer parameters in scikit-learn?
5
votes
1 answer

Encoding text in ML classifier

I am trying to build a ML model. However I am having difficulties in understanding where to apply the encoding. Please see below the steps and functions to replicate the process I have been following. First I split the dataset into train and test: #…
5
votes
2 answers

CountVectorizer converts words to lower case

In my classification model, I need to maintain uppercase letters, but when I use sklearn countVectorizer to built the vocabulary, uppercase letters convert to lowercase! To exclude implicit tokinization, I built a tokenizer which just pass the text…
user_007
  • 3,639
  • 2
  • 33
  • 64
4
votes
1 answer

Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer

I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF…
4
votes
2 answers

Python: CountVectorizer ignores one letter word "I"

I have a list called dictionary1. I use the following code to get sparse count matrices of texts: cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=None) cv1.fit_transform(dictionary1) I notice however that …
SAFEX
  • 1,441
  • 7
  • 21
4
votes
1 answer

Vectorize list of lists uisng countvectorizer() & tfidfvectorizer()

So I have the following list of lists which is tokenized: tokenized_list = [['ALL', 'MY', 'CATS', 'IN', 'A', 'ROW'], ['WHEN', 'MY', 'CAT', 'SITS', 'DOWN', ',', 'SHE', 'LOOKS', 'LIKE', 'A', 'FURBY', 'TOY',…
explorer_x
  • 139
  • 2
  • 10
1
2 3
19 20