1

I have a pandas df like this:

import pandas as pd
df = pd.DataFrame({'sales':[114,114,114,113,12,10,8500,8666]})

enter image description here

It is cumulative data from sales. There is a bug in the data : Some observations are wrong. (No value can be greater than the previous one : i.e., 8500 and 8666 are not correct) I want to replace these values with the previous one.

I tried this

df['above'] = df.sales.shift(1)
df.loc[df.above < df.sales, 'sales'] = df.above

But this code works only one time, how to extend this code to replace all values until the end of the series (with the previous one) ?

The desired result should looks like :

enter image description here

SciPy
  • 3,241
  • 3
  • 15
  • 15

2 Answers2

3

You can replace values to NaNs by condition and then use ffill - fillna with method='ffill':

df.loc[df.above < df.sales, 'sales'] = np.nan
df['sales'] = df['sales'].ffill()
print (df)
   sales   above
0  114.0     NaN
1  114.0   114.0
2  114.0   114.0
3  113.0   114.0
4   12.0   113.0
5   10.0    12.0
6   10.0    10.0
7   10.0  8500.0

What is same as using mask, which by default replace to NaNs, last cast to int if always integer values:

df = pd.DataFrame({'sales':[114,114,114,113,12,10,8500,8666]})

df['sales'] = df['sales'].mask(df.sales.shift(1) < df.sales).ffill().astype(int)
print (df)
   sales
0    114
1    114
2    114
3    113
4     12
5     10
6     10
7     10

EDIT:

df = pd.DataFrame({'sales':[114,114,114,113,12,10,8500,8400]})

df['sales'] = df['sales'].mask((df.sales.shift(1) < df.sales).cumsum() > 0).ffill().astype(int)
print (df)
   sales
0    114
1    114
2    114
3    113
4     12
5     10
6     10
7     10

Detail:

print ((df.sales.shift(1) < df.sales).cumsum())
0    0
1    0
2    0
3    0
4    0
5    0
6    1
7    1
Name: sales, dtype: int32

print ((df.sales.shift(1) < df.sales).cumsum() > 0)
0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
Name: sales, dtype: bool
jezrael
  • 629,482
  • 62
  • 918
  • 895
  • Thanks but it doesn't work when a value is lower than the value above : for example with this df = pd.DataFrame({'sales':[114,114,114,113,12,10,8500,8400]}) – SciPy Dec 04 '17 at 09:26
  • So I think if `cumulative data` it is not possible. Please give me some time. – jezrael Dec 04 '17 at 09:30
  • One question, do you need replace from first `True`s of mask to end of DataFrame? – jezrael Dec 04 '17 at 09:33
  • yes ............Because the bug appears only at the end of the series (but the values are weird.) – SciPy Dec 04 '17 at 09:35
0

How about this?

df.loc[df['sales'] > 200] = 10

ref: https://stackoverflow.com/a/38467449/4772042

'200' is the proper value I caught. Change it to the desired value.

Cho
  • 105
  • 3
  • 11