Questions tagged [nltk]

The Natural Language Toolkit is a Python library for computational linguistics.

The Natural Language ToolKit (NLTK) is a Python library for computational linguistics. It is currently available for Python versions 2.7 or 3.2+

NLTK includes a great number of common natural language processing tools including a tokenizer, chunker, a part of speech (POS) tagger, a stemmer, a lemmatizer, and various classifiers such as Naive Bayes and Decision Trees. In addition to these tools, NLTK has built in many common corpora including the Brown Corpus, Reuters, and WordNet. The NLTK corpora collection also includes a few non-English corpora in Portuguese, Polish and Spanish.

The book Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper is freely available online under the Creative Commons Attribution Noncommercial No Derivative Works 3.0 US Licence. A citable paper NLTK: the natural language ToolKit was first published in 2003 and then again in 2006 for researchers to acknowledge the contribution in ongoing research in Computational Linguistics.

NLTK is currently distributed under an Apache version 2.0 licence.

6577 questions
344
votes
7 answers

What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can…
TIMEX
  • 217,272
  • 324
  • 727
  • 1,038
163
votes
16 answers

Failed loading english.pickle with nltk.data.load

When trying to load the punkt tokenizer... import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ...a LookupError was raised: > LookupError: > ********************************************************************* …
Martin
  • 1,633
  • 2
  • 12
  • 5
155
votes
10 answers

What is the difference between lemmatization vs stemming?

When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was?
TIMEX
  • 217,272
  • 324
  • 727
  • 1,038
151
votes
15 answers

n-grams in python, four, five, six grams?

I'm looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = bigrams(string) print string_bigrams I am aware that…
Shifu
  • 1,863
  • 2
  • 14
  • 15
149
votes
8 answers

What are all possible pos tags of NLTK?

How do I find a list with all possible pos tags used by the Natural Language Toolkit (nltk)?
OrangeTux
  • 9,840
  • 7
  • 45
  • 67
148
votes
10 answers

How to check if a word is an English word with Python?

I want to check in a Python program if a word is in the English dictionary. I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task. def is_english_word(word): pass # how to I implement…
Barthelemy
  • 7,041
  • 6
  • 30
  • 35
130
votes
11 answers

How to get rid of punctuation using NLTK tokenizer?

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also…
lizarisk
  • 6,692
  • 9
  • 42
  • 66
120
votes
14 answers

How to remove stop words using nltk or python

So I have a dataset that I would like to remove stop words from using stopwords.words('english') I'm struggling how to use this within my code to just simply take out these words. I have a list of the words from this dataset already, the part i'm…
Alex
  • 1,595
  • 5
  • 14
  • 15
115
votes
8 answers

how to check which version of nltk, scikit learn installed?

In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script: import nltk echo nltk.__version__ but it stops shell script at import line in linux terminal tried to see in this…
nlper
  • 1,987
  • 5
  • 21
  • 35
111
votes
26 answers

pip issue installing almost any library

I have a difficult time using pip to install almost anything. I'm new to coding, so I thought maybe this is something I've been doing wrong and have opted out to easy_install to get most of what I needed done, which has generally worked. However,…
contentclown
  • 1,141
  • 2
  • 8
  • 8
107
votes
18 answers

Resource u'tokenizers/punkt/english.pickle' not found

My Code: import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ERROR Message: [ec2-user@ip-172-31-31-31 sentiment]$ python mapper_local_v1.0.py Traceback (most recent call last): File "mapper_local_v1.0.py", line 16,…
Supreeth Meka
  • 1,679
  • 2
  • 12
  • 16
99
votes
6 answers

Python: tf-idf-cosine: to find document similarity

I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the…
add-semi-colons
  • 14,928
  • 43
  • 126
  • 211
92
votes
18 answers

How to use Stanford Parser in NLTK using Python

Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)
ThanaDaray
  • 1,573
  • 4
  • 20
  • 28
90
votes
7 answers

How to config nltk data directory from code?

How to config nltk data directory from code?
Juanjo Conti
  • 25,163
  • 37
  • 101
  • 128
84
votes
4 answers

Creating a new corpus with NLTK

I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python. I have a bunch of .txt files and I want to be able to use the corpus…
alvas
  • 94,813
  • 90
  • 365
  • 641
1
2 3
99 100