2

I want to filter a column containing tweets (3+million rows) in a pandas dataframe by dropping those tweets that do not contain a keyword/s. To do this, I'm running the following loop (sorry, I'm new to python):

filter_word_indicators = []
for i in range(1, len(df)):
    if 'filter_word' in str(df.tweets[0:i]):
        indicator = 1 
    else:
        indicator = 0
    filter_word_indicators.append(indicator)

The idea is to then drop tweets if the indicator equals 0. The problem is that this loop is taking forever to run. I'm sure there is a better way to drop tweets that do not contain my 'filer_word', but I don't know how to code it up. Any help would be great.

Dan
  • 43,452
  • 14
  • 75
  • 140
Tom
  • 23
  • 1
  • 3
  • is this python 2 or 3? Also, do you have a sense of what percentage of the tweets have the word vs not ? – JacobIRR Jun 19 '19 at 22:59
  • Python 3. I anticipate that only around 1% will have the keywords I intend on filtering on. – Tom Jun 19 '19 at 23:07
  • Can you post some sample inputs and outputs. I suggest adding code to create a dataframe with say 3 fake tweets that are only a couple of words long as well as the desired result after the filtering. Don't use actual long tweets. – Dan Jun 19 '19 at 23:24

2 Answers2

3

Check out pandas.Series.str.contains, which you can use as follows.

df[~df.tweets.str.contains('filter_word')]

MWE

In [0]: df = pd.DataFrame(
            [[1, "abc"],
             [2, "bce"]],
            columns=["number", "string"]
        )    
In [1]: df
Out[1]: 
   number string
0       1    abc
1       2    bce

In [2]: df[~df.string.str.contains("ab")]
Out[2]: 
   number string
1       2    bce

Timing

Ran a small timing test on the following synthetic DataFrame with three million random strings the size of a tweet

df = pd.DataFrame(
    [
        "".join(random.choices(string.ascii_lowercase, k=280))
        for _ in range(3000000)
    ],
    columns=["strings"],
)

and the keyword abc, comparing the original solution, map + regex and this proposed solution (str.contains). The results are as follows.

original       99s
map + regex    21s
str.contains  2.8s
PidgeyUsedGust
  • 687
  • 4
  • 11
0

I create the following example:

df = pd.DataFrame("""Suggested order for Amazon Prime Doctor Who series
Why did pressing the joystick button spit out keypresses?
Why tighten down in a criss-cross pattern?
What exactly is the 'online' in OLAP and OLTP?
How is hair tissue mineral analysis performed?
Understanding the reasoning of the woman who agreed with King Solomon to "cut the baby in half"
Can Ogre clerics use Purify Food and Drink on humanoid characters?
Heavily limited premature compiler translates text into excecutable python code
How many children?
Why are < or > required to use /dev/tcp
Hot coffee brewing solutions for deep woods camping
Minor traveling without parents from USA to Sweden
Non-flat partitions of a set
Are springs compressed by energy, or by momentum?
What is "industrial ethernet"?
What does the hyphen "-" mean in "tar xzf -"?
How long would it take to cross the Channel in 1890's?
Why do all the teams that I have worked with always finish a sprint without completion of all the stories?
Is it illegal to withhold someone's passport and green card in California?
When to remove insignificant variables?
Why does Linux list NVMe drives as /dev/nvme0 instead of /dev/sda?
Cut the gold chain
Why do some professors with PhDs leave their professorships to teach high school?
"How can you guarantee that you won't change/quit job after just couple of months?" How to respond?""".split('\n'), columns = ['Sentence'])

You can juste create a simple function with regular expression (more flexible in case of capital characters):

def tweetsFilter(s, keyword):
    return bool(re.match('(?i).*(' + keyword + ').*', s))

This function can be called to obtain the boolean series of strings which contains the specific keywords. The mapcan speed up your script (you need to test!!!):

keyword = 'Why'
sel = df.Sentence.map(lambda x: tweetsFilter(x, keyword))
df[sel]

And we obtained:

    Sentence
1   Why did pressing the joystick button spit out ...
2   Why tighten down in a criss-cross pattern?
9   Why are < or > required to use /dev/tcp
17  Why do all the teams that I have worked with a...
20  Why does Linux list NVMe drives as /dev/nvme0 ...
22  Why do some professors with PhDs leave their p...
B.Gees
  • 1,035
  • 9
  • 26