This tag is for questions on the process of turning a collection of text documents into numerical feature vectors using the class CountVectorizer from Python's scikit-learn library.
Questions tagged [countvectorizer]
296 questions
16
votes
3 answers
How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?
I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing.
My…
Kevin Markham
- 4,396
- 1
- 23
- 33
14
votes
2 answers
List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer
I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example
'and' 123 times, 'to' 100 times, 'for' 90 times,…
user1506145
- 4,716
- 7
- 38
- 68
12
votes
1 answer
Sklearn: adding lemmatizer to CountVectorizer
I added lemmatization to my countvectorizer, as explained on this Sklearn page.
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
def __init__(self):
self.wnl =…
Rens
- 360
- 1
- 3
- 13
10
votes
2 answers
CountVectorizer does not print vocabulary
I have installed python 2.7, numpy 1.9.0, scipy 0.15.1 and scikit-learn 0.15.2.
Now when I do the following in python:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun,…
Archana
- 173
- 1
- 2
- 10
9
votes
1 answer
Empty vocabulary for single letter by CountVectorizer
Trying to convert string into numeric vector,
### Clean the string
def names_to_words(names):
print('a')
words = re.sub("[^a-zA-Z]"," ",names).lower().split()
print('b')
return words
### Vectorization
def Vectorizer():
…
LookIntoEast
- 6,297
- 15
- 50
- 76
8
votes
1 answer
sklearn partial fit of CountVectorizer
Does CountVectorizer support partial fit?
I would like to train the CountVectorizer using different batches of data.
Donbeo
- 14,217
- 30
- 93
- 162
7
votes
1 answer
Scala Spark - split vector column into separate columns in a Spark DataFrame
I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the…
Logan Yang
- 1,615
- 5
- 22
- 39
7
votes
4 answers
Apply CountVectorizer to column with list of words in rows in Python
I made a preprocessing part for text analysis and after removing stopwords and stemming like this:
test[col] = test[col].apply(
lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])
train[col] =…
Yury Wallet
- 972
- 10
- 17
6
votes
1 answer
Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)
I have a dataset with ~30k unique documents that were flagged because they have a certain keyword in them. Some of the key fields in the dataset are document title, filesize, keyword, and excerpt (50 words around keyword). Each of these ~30k unique…
Derek Jedamski
- 175
- 1
- 9
6
votes
1 answer
How to preserve punctuation marks in Scikit-Learn text CountVectorizer or TfidfVectorizer?
Is there any way for me to preserve punctuation marks of !, ?, " and ' from my text documents using text CountVectorizer or TfidfVectorizer parameters in scikit-learn?
Suhairi Suhaimin
- 145
- 1
- 13
5
votes
1 answer
Encoding text in ML classifier
I am trying to build a ML model. However I am having difficulties in understanding where to apply the encoding.
Please see below the steps and functions to replicate the process I have been following.
First I split the dataset into train and test:
#…
LdM
- 213
- 11
5
votes
2 answers
CountVectorizer converts words to lower case
In my classification model, I need to maintain uppercase letters, but when I use sklearn countVectorizer to built the vocabulary, uppercase letters convert to lowercase!
To exclude implicit tokinization, I built a tokenizer which just pass the text…
user_007
- 3,639
- 2
- 33
- 64
4
votes
1 answer
Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer
I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF…
Highchiller
- 164
- 11
4
votes
2 answers
Python: CountVectorizer ignores one letter word "I"
I have a list called dictionary1. I use the following code to get sparse count matrices of texts:
cv1 = sklearn.feature_extraction.text.CountVectorizer(stop_words=None)
cv1.fit_transform(dictionary1)
I notice however that …
SAFEX
- 1,441
- 7
- 21
4
votes
1 answer
Vectorize list of lists uisng countvectorizer() & tfidfvectorizer()
So I have the following list of lists which is tokenized:
tokenized_list = [['ALL', 'MY', 'CATS', 'IN', 'A', 'ROW'], ['WHEN', 'MY',
'CAT', 'SITS', 'DOWN', ',', 'SHE', 'LOOKS', 'LIKE', 'A',
'FURBY', 'TOY',…
explorer_x
- 139
- 2
- 10