15

I have a 227x4 DataFrame with country names and numerical values to clean (wrangle ?).

Here's an abstraction of the DataFrame:

import pandas as pd
import random
import string
import numpy as np
pdn = pd.DataFrame(["".join([random.choice(string.ascii_letters) for i in range(3)]) for j in range (6)], columns =['Country Name'])
measures = pd.DataFrame(np.random.random_integers(10,size=(6,2)), columns=['Measure1','Measure2'])
df = pdn.merge(measures, how= 'inner', left_index=True, right_index =True)

df.iloc[4,1] = 'str'
df.iloc[1,2] = 'stuff'
print(df)

  Country Name Measure1 Measure2
0          tua        6        3
1          MDK        3    stuff
2          RJU        7        2
3          WyB        7        8
4          Nnr      str        3
5          rVN        7        4

How do I replace string values with np.nan in all columns without touching the country names?

I tried using a boolean mask:

mask = df.loc[:,measures.columns].applymap(lambda x: isinstance(x, (int, float))).values
print(mask)

[[ True  True]
 [ True False]
 [ True  True]
 [ True  True]
 [False  True]
 [ True  True]]

# I thought the following would replace by default false with np.nan in place, but it didn't
df.loc[:,measures.columns].where(mask, inplace=True)
print(df)

  Country Name Measure1 Measure2
0          tua        6        3
1          MDK        3    stuff
2          RJU        7        2
3          WyB        7        8
4          Nnr      str        3
5          rVN        7        4


# this give a good output, unfortunately it's missing the country names
print(df.loc[:,measures.columns].where(mask))

  Measure1 Measure2
0        6        3
1        3      NaN
2        7        2
3        7        8
4      NaN        3
5        7        4

I have looked at several questions related to mine ([1], [2], [3], [4], [5], [6], [7], [8]), but could not find one that answered my concern.

Bharath
  • 26,906
  • 5
  • 45
  • 85
Malik Koné
  • 515
  • 4
  • 15
  • "A meta-question, Is it normal that it takes me more than 3 hours to formulate a question here (including research) ?" – Yes. The success of [so] and the entire Stack Exchange network is predicated on the high quality of its content, both questions and answers. You can't throw together a high quality question in a couple of minutes. Personally, I'd put the required effort more on the order of days than hours. I certainly have spent an entire day or more on an answer, and I expect the asker to expend at least an order of magnitude more effort, since he's the one getting the benefit. – Jörg W Mittag Oct 29 '17 at 17:57
  • Side note: meta-questions should be asked on [meta]. – Jörg W Mittag Oct 29 '17 at 17:58
  • @JörgWMittag I was just counting the time put to write the question after I gave up trying on my own. If I had to count that it would be in days indeed. I'll make a question in meta when I have a few more hours in front of me. I was feeling dumb taking so much time to ask my question. But I feel better now and the quality of the answer is a proof that it was well worth the effort. Thank you! – Malik Koné Oct 29 '17 at 18:09

3 Answers3

11

Assign only columns of interest:

cols = ['Measure1','Measure2']
mask = df[cols].applymap(lambda x: isinstance(x, (int, float)))

df[cols] = df[cols].where(mask)
print (df)
  Country Name Measure1 Measure2
0          uFv        7        8
1          vCr        5      NaN
2          qPp        2        6
3          QIC       10       10
4          Suy      NaN        8
5          eFS        6        4

A meta-question, Is it normal that it takes me more than 3 hours to formulate a question here (including research) ?

In my opinion yes, create good question is really hard.

jezrael
  • 629,482
  • 62
  • 918
  • 895
  • I like you asnwer, but why `df2= df.loc[:,measures.columns].where(mask, inplace=True)` does not do the replacement ? While `df.loc[:,measures.columns].where(mask)` prints out correctly. – Malik Koné Oct 29 '17 at 14:45
  • Because `inplace` always return `None`, so `df2` is None – jezrael Oct 29 '17 at 14:47
  • I've edited the question.. I don't understand why df.loc[:,measures.columns].where(mask, inplace=True) does not modify df ? – Malik Koné Oct 29 '17 at 15:11
  • 1
    I think there is problem with assign to copy of `df`, same problem like `fillna` in [this](https://stackoverflow.com/a/43642196/2901002). If change your code to `df[measures.columns].where(mask)` get warning. – jezrael Oct 29 '17 at 15:11
9
cols = ['Measure1','Measure2']
df[cols] = df[cols].applymap(lambda x: x if not isinstance(x, str) else np.nan)

or

df[cols] = df[cols].applymap(lambda x: np.nan if isinstance(x, str) else x)

Result:

In [22]: df
Out[22]:
  Country Name  Measure1  Measure2
0          nBl      10.0       9.0
1          Ayp       8.0       NaN
2          diz       4.0       1.0
3          aad       7.0       3.0
4          JYI       NaN      10.0
5          BJO       9.0       8.0
MaxU
  • 173,524
  • 24
  • 290
  • 329
  • But why the negation `x if not isinstance(x, str)` instead of `x if isinstance(int,float) else `np.nan` ? – Malik Koné Oct 29 '17 at 15:04
  • 1
    That will replace all numbers with nan if you dont need negation then `x: np.nan if isinstance(x, str) else x` – Bharath Oct 29 '17 at 15:06
  • I don't want to replace number.. I want to replace non numbers with nan – Malik Koné Oct 29 '17 at 15:07
  • @MalikKoné, I think you want to use [Bharath shetty's solution](https://stackoverflow.com/a/47001627/5741205) – MaxU Oct 29 '17 at 15:09
  • All three answers are very interesting for me... My focus is on understanding I don't have to optimization of physical resources yet. :o) – Malik Koné Oct 29 '17 at 15:15
  • @MalikKoné, if your goal is to clean up numeric columns - i.e. replace all values to numeric dtypes, and replacing those that can't be converted to numeric dtypes to NaN's, then Bharath shetty's solution is the most idiomatic approach. If you want to replace cells of specific dtypes with NaN's then you can choose between jezrael's and mine solution... – MaxU Oct 29 '17 at 15:17
8

Use numeric with errors coerce i.e

cols = ['Measure1','Measure2']
df[cols] = df[cols].apply(pd.to_numeric,errors='coerce')
 Country Name  Measure1  Measure2
0          PuB       7.0       6.0
1          JHq       2.0       NaN
2          opE       4.0       3.0
3          pxl       3.0       6.0
4          ouP       NaN       4.0
5          qZR       4.0       6.0
leo
  • 6,927
  • 6
  • 41
  • 62
Bharath
  • 26,906
  • 5
  • 45
  • 85
  • 2
    I think we can get rid of `lambda` in this case: `df[cols] = df[cols].apply(pd.to_numeric, errors='corece')` – MaxU Oct 29 '17 at 15:15
  • @Bharathshetty, your answer is too good (if that is possible). I will indeed coerce the string to numeric values but this was not clear to me when I formulated the question. My focus was on how to use the boolean mask and why the inplace did not work. – Malik Koné Oct 29 '17 at 18:21
  • 1
    @Bharathshetty I think one should read `errors=coerce` instead of `errors=corece` – Malik Koné Oct 29 '17 at 18:29
  • That was a small typo. Sorry for that – Bharath Oct 30 '17 at 06:05