2

Sometimes the string numbers in my DataFrames have commas in them representing either decimal or marking the thousand, some do not. The dataframe is an example of the range of price formats I receive via an API and vary depend on the currency. These are prices and the decimals will always be 2. So I need to output the string prices into float so I can sum them or separate them into other dataframes or use them for plotting graphs. I have created a loop to replace them, but is there a quicker way to do this without the loop?

My DataFrame and working loop is as follows:

data = {'amount': ['7,99', '6,99', '9.99', '-6,99', '1,000.00']}
df = pd.DataFrame(data)

fees = []
sales = []

for items in df['amount']:
    if items[-7:-6] == ',':
        items = float(items.replace(',', '').replace(' ',''))
    if items[-3:-2] == ',':
        items = float(items.replace(',', '.').replace(' ',''))
    items = float(items)
    if items <= 0:
        fees.append(items)
    else:
        sales.append(items)

I have attempted to do this without the loop but can't seem to work out where I have gone wrong.

df["amount"] = np.where((df['amount'][-7:-6] == ','),
                       df["amount"][-7:-6].str.replace(',', '').replace(' ',''),
                       df["amount"])

df["amount"] = np.where((df['amount'][-3:-2] == ','),
                       df["amount"][-3:-2].str.replace(',', '').replace(' ',''),
                       df["amount"])

Any help would be much appreciated. Thank you in advance

DidSquids
  • 77
  • 7
  • What is your expected output? – ManojK Apr 06 '20 at 15:52
  • 1
    The good question is where do those erroneous number representations come from.You can try to fix it in the dataframe, but how will you able to guess whether the string `'1,234'` is the integer value `1234` (comma as thousand separator) or the decimal `1.234` (comma as decimal separator)? – Serge Ballesta Apr 06 '20 at 15:58
  • Thanks for the comments and apologies for me not explaining where the numbers are from. These are prices and the decimals will always be 2. So I need to output the string prices into floats so I can sum them or separate them for use in other dataframes or plot them into graphs. – DidSquids Apr 06 '20 at 16:14

3 Answers3

1

You can use lambdas instead of numpy:

lambda1 = lambda items: float(str(items).replace(',', '').replace(' ','')) if str(items)[-7:-6] == ',' else items
lambda2 = lambda items: float(str(items).replace(',', '.').replace(' ','')) if str(items)[-3:-2] == ',' else items
to_float = lambda items: float(items)

df['amount_clean'] = df["amount"].map(lambda1).map(lambda2).map(to_float) 

=========================================================================

Edit: what are lambdas

In python, lambda functions are small anonymous functions with a single expression (see https://www.w3schools.com/python/python_lambda.asp)

Example with condition:

lambda x: x + 1 if x < 0 else x

This is equivalent to:

def my_lambda_function(x):
    if x < 0:
        return x + 1
    else:
        return x

When passed to the column of a pandas dataframe via the map function, the lambda expression will be applied to the value in each row of the column.

Hope this helps!

sigma1510
  • 735
  • 5
  • 17
  • 1
    Excellent, thank you so much. I have just started coding so I need to familiarise myself with lambdas so I can understand how this works. Appreciate the answer. – DidSquids Apr 06 '20 at 16:43
1

Since you mention the last two digits are decimal points, so the ',' needs to be replaced with '.' to make it float, but you also have some values like 1,000.00 that will become irrelevant if the ',' is replaced with '.', hence you can use a regex to identify what values to be replaced:

data = {'amount': ['7,99', '6,99', '9.99', '-6,99', '1,000.00']}
df = pd.DataFrame(data)
df

First the regex will match all string with ',' and two decimal points, then the replace function will replace the match with a '.' & the captured values (99 from ,99)

df['amount'] = df['amount'].str.replace(r'(,)(\d{2}$)',r'.\2') 
# here `r'.\2'`is second `captured group` in `regex`

Then to convert 1,000.00 to float we will replace the ',' with blank

df['amount'] = df['amount'].str.replace(',','')

And then convert the data type to float

df['amount'] = df['amount'].astype(float)

print(df)
    amount
0   799.00
1   699.00
2     9.99
3  -699.00
4  1000.00
ManojK
  • 1,224
  • 2
  • 5
  • 14
  • 1
    Excellent, thank you so much for the answer, really appreciate it. I'm new to this so I need to learn about regex. – DidSquids Apr 06 '20 at 16:57
  • [This](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) can help a lot – ManojK Apr 06 '20 at 17:00
0

Try using split and join,

df.amount.str.split(',').str.join('').astype(float)

Output

0     799.00
1     699.00
2       9.99
3    -699.00
4    1000.00
Name: amount, dtype: float64
Vishnudev
  • 7,765
  • 1
  • 11
  • 43
  • Thank you for your help. My problem is that not all the commas are decimals. The prices in Euros normally have commas as decimals, but sometimes the thousands are marked with a comma too in other currencies. All prices have 2 decimals so the last 2 digits are always decimals. – DidSquids Apr 06 '20 at 17:35
  • Okay. It seems very specific to your case. – Vishnudev Apr 06 '20 at 17:46