7

Morning.

I have condensed a much larger situation down to the following:

I have one file with a dataframe with some values in it.

df = pd.DataFrame(
{'joe': [['dog'], ['cat'], ['fish'], ['rabbit']], 'ben': [['dog'], ['fish'], ['fish'], ['bear']]})

df:
      ben       joe
0   [dog]     [dog]
1  [fish]     [cat]
2  [fish]    [fish]
3  [bear]  [rabbit]

The type of data contained within this dataframe is as follows:

type(df.iloc[2,1]),df.iloc[2,1]
>>> (list, ['fish'])

When I save the dataframe to excel using pd.to_excel():

writer1 = pd.ExcelWriter('Input Output Test.xlsx')
df.to_excel(writer1,'Sheet1')
writer1.save()

I immediately read this back in to the same file as follows:

dfi = pd.read_excel(open('Input Output Test.xlsx'), sheetname='Sheet1')

I again test the type of the data:

type(dfi.iloc[2,1]),dfi.iloc[2,1]
>>> (unicode, u"['fish']")

The data is now in unicode format. This is problematic, as, when I compare the two dataframes as follows, all results are false, due to the non matching string formats:

np.where(df['joe'] == dfi['joe'],True,False)
dfi:
        ben         joe   test
0   ['dog']     ['dog']  False
1  ['fish']     ['cat']  False
2  ['fish']    ['fish']  False
3  ['bear']  ['rabbit']  False

What is going on during the read and write process that is causing this change, and how I change it to maintain str format post save?

E: Unfortunately, the nature of my problem dictates that the dataframe must be saved, and manipulated within a different file.

Edit, in response to EdChum's comment: if I instead store these strings as strings, not lists: I still get the same error:

df = pd.DataFrame({'joe': ['dog', 'cat', 'fish', 'rabbit'], 'ben': ['dog', 'fish', 'fish', 'bear']})

    ben     joe
0   dog     dog
1  fish     cat
2  fish    fish
3  bear  rabbit

writer1 = pd.ExcelWriter('Input Output Test Joe.xlsx')
df.to_excel(writer1,'Sheet1')
writer1.save()

dfi = pd.read_excel(open('Input Output Test Joe.xlsx','rb'), sheetname='Sheet1')

type(dfi.iloc[2, 1]), dfi.iloc[2, 1]
(unicode, u'fish')

And again, the comparison fails.

Edit: Evaluation of Unicode to a regular string can also be achieved via ast.literal_eval() as described here: Convert string representation of list to list in Python or as EdChum's suggestion.

Note, If you use to_csv() and read_csv() this issue is not present.

But why does to_excel() / re_excel() change the original code?

Community
  • 1
  • 1
Chuck
  • 3,166
  • 5
  • 28
  • 65
  • not sure if what you posted is correct but storing a list as the datatype when the list only contains a single string seems fundamentally wrong, can you not just store them as normal strings whereby all the comparisons will just work – EdChum Feb 03 '17 at 11:24
  • Hi @EdChum Thanks for the feedback. You mentioned this to me in a previous question actually... unfortunately it's a drawback of the data and methods I have inherited - I must work in the format I have been given. In addition, for the above dataframe, if I remove the lists and just make them strings, I still get this error. – Chuck Feb 03 '17 at 11:27
  • I don't suppose you could switch to Python 3... – PM 2Ring Feb 03 '17 at 11:29
  • I think you can just do `df.applymap(lambda x: x[0])` to convert the dtype from a list to just plain string then you can do `np.where(df['ben'] == df['joe'],True,False)` which will just work – EdChum Feb 03 '17 at 11:30
  • does passing `encoding='ascii'` to `to_excel` work? – EdChum Feb 03 '17 at 11:34
  • What you say is true, Unfortunately, the nature of my problem dictates that the dataframe must be saved, and manipulated within a different file. – Chuck Feb 03 '17 at 11:34
  • @EdChum No, it does not. – Chuck Feb 03 '17 at 11:36
  • can you try without the `writer1 = pd.ExcelWriter('Input Output Test.xlsx')` line so just `df.to_excel(Input Output Test.xlsx,'Sheet1', encoding='ascii')` – EdChum Feb 03 '17 at 11:37
  • @EdChum Just tried - no luck. – Chuck Feb 03 '17 at 11:42
  • Sorry out of suggestions, I don't have the pre-requisite libs installed in order to try to reproduce this, best of luck – EdChum Feb 03 '17 at 11:43
  • @EdChum Thanks for your suggestions anyway! – Chuck Feb 03 '17 at 11:48
  • I can reproduce the last @EdChum example, and for me the comparison works, even thought `df` is `str` and `dfi` is unicode type. – Fabio Lamanna Feb 03 '17 at 11:59
  • @FabioLamanna The test works as it compares columns `joe` and `ben` within the **same** dataframe `df`. In my example (which is just for brevity's sake) I am trying to compare the column `joe` in `df` and column `joe` in `dfi`. `dfi` suffers from the unicode error I have described. – Chuck Feb 03 '17 at 12:11
  • @CharlesMorris actually I did compare them with your code `np.where(df['joe'] == dfi['joe'],True,False)` and it returns `array([ True, True, True, True], dtype=bool)` – Fabio Lamanna Feb 03 '17 at 12:15
  • @FabioLamanna My apologies. You are correct! The conversion does work, even across the weird Unicode format. EdChum, if you want to post that as a workaround answer, I'll accept, Thank you, as usual for your most helpful commentary. I do wonder why the original problem exists still though. – Chuck Feb 03 '17 at 12:25

1 Answers1

2

But why does to_excel() / re_excel() change the original code?

I don't know. I briefly looked at the source of to_excel from_excel, but couldn't find any clues.
Setting engine='xlsxwriter' and leaving encoding to default seems to do it, i.e.:

import pandas as pd
df = pd.DataFrame(
    {'joe': [['dog'], ['cat'], ['fish'], ['rabbit']], 'ben': [['dog'], ['fish'], ['fish'], ['bear']]})

with pd.ExcelWriter ('Input Output Test.xlsx') as writer:
    df.to_excel(writer, sheet_name='Sheet1', engine='xlsxwriter')

dfi = pd.read_excel('Input Output Test.xlsx')

assert eval(dfi.iloc[2,1]) == df.iloc[2,1]
# True
Eric Aya
  • 68,765
  • 33
  • 165
  • 232
Pedro Lobito
  • 75,541
  • 25
  • 200
  • 222