1

I am trying to reduce the noise from a large dataset with grammatical keywords. Is there a way to horizontally trim the data-set based on a particular set of keywords.

Input: 

id1, id2, keyword, freq, gp1, gps2 
222, 111, #paris, 100, loc1, loc2 
444, 234, have, 1000, loc3, loc4
434, 134, #USA, 30, loc5, loc6
234, 234, she, 600, loc1, loc2
523, 5234,mobile, 900, loc3, loc4

From here I need to remove the words like have, she, and, did those common keywords which are useful to me. I am trying to eliminate entire row with such keywords. I am trying to remove the noise from the dataset for future analysis purpose.

What could be the simple way to eliminate such rows with set of choice keywords.

Appreciate the suggestion, thanks in advance !!

Teodor
  • 669
  • 7
  • 14
Sitz Blogz
  • 927
  • 5
  • 23
  • 46
  • You mention two things in the comments: (a) that you are reading from a CSV file and (b) that it's really big (2 GB). This may lead to other solutions being better than a Pandas Dataframe, as this is a really simple operation on a huge data set. Are you on Windows or Unix? – Teodor May 21 '17 at 14:02
  • windows anaconda 3.x with updated packages – Sitz Blogz May 21 '17 at 14:04

4 Answers4

2

Assuming you have a dataframe df... Use isin to find which rows have or don't have a list or set of words. Then use Boolean indexing to filter the dataframe.

list_of_words = ['she', 'have', 'did', 'and']
df[~df.keyword.isin(list_of_words)]
piRSquared
  • 240,659
  • 38
  • 359
  • 510
  • 1
    I recommend this over my answer. Building a filter as a Numpy array may be useful for other problems. – Teodor May 21 '17 at 13:42
  • 1
    I get a small (5 - 15 %) speedup when I change `list_of_words = ['she', 'have', 'did', 'and']` to `list_of_words = set(['she', 'have', 'did', 'and'])`. Sets are optimized for fast membership lookup. – Teodor May 21 '17 at 13:57
1

I did something similar not to long ago. I was pleasantly surprised by how well Pandas and Numpy play together, and the resulting speed when sticking to vectorized operations.

The example below doesn't require any other files than the source file. Modify table to your needs.

from StringIO import StringIO

import pandas as pd
import numpy as np

src = """id1, id2, keyword, freq, gp1, gps2
222, 111, #paris, 100, loc1, loc2
444, 234, have, 1000, loc3, loc4
434, 134, #USA, 30, loc5, loc6
234, 234, she, 600, loc1, loc2
523, 5234,mobile, 900, loc3, loc4
"""

src_handle = StringIO(src)

blacklist_words = """
have she and did
""".split()

# Separate by comma and remove whitespace
table = pd.read_table(src_handle, sep=",\s*")

# You can create a single filter by straight-out comparison
filter_have = table["keyword"] == "have"

# Which you can use as a key directly
print table[filter_have]

# We'll solve this by building the filter you need and applying it.

def filter_on_blacklisted_words(keyword, blacklist_words, dataframe):
    """Filter a Pandas dataframe by removing any rows that has column {keyword}
    in blacklist. Try to keep things vectorized for performance.
    """

    # In the beginning, accept all values, and take the number of values from
    # the dataframe we're using. Zeros is falsey.
    blacklist_filter = np.zeros_like(dataframe[keyword])

    for word in blacklist_words:
        blacklist_filter = np.logical_or(blacklist_filter,
                                         dataframe[keyword] == word)
    return dataframe[np.logical_not(blacklist_filter)]

print filter_on_blacklisted_words("keyword", blacklist_words, table)
Teodor
  • 669
  • 7
  • 14
  • This looks great .. My doubt is about loading a csv with about 2 gb size and making changes in the same. Will this be able to do it ? – Sitz Blogz May 21 '17 at 13:44
  • I'd just try it. If it doesn't work, you can load the blacklist into memory and traverse the 2 GB CSV line by line. The bottleneck here will be reading the file anyway. – Teodor May 21 '17 at 13:56
1

Given data:

df = pd.DataFrame({
    'keyword': ['#paris', 'have', '#USA', 'she', 'mobile']
})
stopwords = set(['have', 'she', 'and', 'did'])

The following method test whether the stopword is a part of the keyword or not:

df = df[df['keyword'].str.contains('|'.join(stopwords)) == False]

Output:

  keyword
0  #paris
2    #USA
4  mobile

The next method test whether the stopword matches (1:1) the keyword or not:

df = df.drop(df[df['keyword'].map(lambda word: word in stopwords)].index)

Output:

  keyword
0  #paris
2    #USA
4  mobile
Darius
  • 6,219
  • 2
  • 24
  • 42
  • Thank you for the answer.. I need to have all the columns also and as mentioned I have a large dataset and i need asimple way to only remove the rows with those keywords and keep the reaming data as it is.. – Sitz Blogz May 21 '17 at 14:10
  • 1
    Just add the missing columns to the given data. It works. – Darius May 21 '17 at 14:20
1

New take given memory requirements. I'm adding this as a new answer as the old one still is useful for small files. This one reads the input file line by line, not loading the whole file into memory.

Save the program to filterbigcsv.py, then run it with python filterbigcsv.py big.csv clean.csv to read from big.csv and write to clean.csv. For an 1.6 GB test file, this takes a minute on my system. Memory usage is contant at 3 MB.

This script should handle any file size, you'll just have to wait longer for it to finish.

import sys


input_filename = sys.argv[1]
output_filename = sys.argv[2]


blacklist = set("""
have she and did
""".strip().split())


blacklist_column_index = 2 # Third column, zero indexed


with open(input_filename, "r") as fin, \
     open(output_filename, "w") as fout:
    for line in fin:
        if line.split(",")[blacklist_column_index].strip(", ") in blacklist:
            pass # Don't pass through
        else:
            fout.write(line) # Print line as it was, with its original line ending
Teodor
  • 669
  • 7
  • 14
  • 1
    Awesome !! you are a saver .. Thank you so much .. This is what I have been looking for .. Using IDE with such big files they just freeze the screen and go into infinite processing .. – Sitz Blogz May 21 '17 at 14:51
  • Glad to help :) – Teodor May 21 '17 at 14:58
  • Thank you again .. If you have time could you also look into this questions.. for this question I am try to reduce the dataset mainly. http://stackoverflow.com/questions/44077739/extract-edge-and-communities-from-list-of-nodes/44080933?noredirect=1#comment75201862_44080933 – Sitz Blogz May 21 '17 at 15:01