2

Been searching for a while in order to understand how to do this basic task without any success which is very strange.

I have a dataset where some of the rows contain '-', I have no clue under which columns these values lie.

How do I search in the whole dataset (including all columns) for '-' and drop the rows containing this value?

thank you!

Fiddelush
  • 29
  • 3
  • https://stackoverflow.com/questions/56754831/drop-columns-if-rows-contain-a-specific-value-in-pandas – yulGM Nov 21 '20 at 23:24

3 Answers3

2
df = pd.DataFrame([['-', 2, '-', 0],
                   [3, 4, '-', 1],
                   ['-', '-', '-', 5],
                   ['-', 3, '-', 4]],
                  columns=list('ABCD'))

then df looks like

df = 
    A   B   C   D
0   -   2   -   0
1   3   4   -   1
2   -   -   -   5
3   -   3   -   4

you can replace all '-' with nan values with df.replace() function. you can do help(df.replace) do learn more

df = df.replace('-', np.nan)
df=
A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   NaN NaN NaN 5
3   NaN 3.0 NaN 4

to drop columns with these '-' you can drop them with df.dropna(how='all')

0

Please Try

df[~df.apply(lambda x: x.str.contains('-')).sum(1).gt(0)]

How it works

    df.apply(lambda x: x.str.contains('-'))#Gives a boolean return of where there is '-'

   df.apply(lambda x: x.str.contains('-')).sum(1)# Sums the boolean in each row. 

   df.apply(lambda x: x.str.contains('-')).sum(1).gt(0)#tests and returns true where the value was greater than zero, meaning anywhere where there was '-' returned at least 1

   ~df.apply(lambda x: x.str.contains('-')).sum(1).gt(0)- Reverse the above, giving you rows where there was no '-'


   df[~df.apply(lambda x: x.str.contains('-')).sum(1).gt(0)]# Mask the rows where there was no'-' into a dataframe
wwnde
  • 14,189
  • 2
  • 8
  • 21
0

This is a bit more robust than wwnde's answer, as it will work if some of the columns aren't originally strings:

df.loc[~df.apply(lambda x: any('-' in str(col) for col in x), axis = 1)]

If you have data that's stored as datetime, it will display as having -, but will return an error if you check for inclusion without converting to str first. Negative numbers will also return True once converted to str. If you want different behavior, you'll have to do something more complicated, such as

df.loc[~df.apply(lambda x: any('-' in col if isinstance(col, str) else False for col in x), axis = 1)]

Acccumulation
  • 2,309
  • 1
  • 5
  • 10