8

I want to set the value of a pandas column as a list of strings. However, my efforts to do so didn't succeed because pandas take the column value as an iterable and I get a: ValueError: Must have equal len keys and value when setting with an iterable.

Here is an MWE

>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>> df
col1    col2
0   1   4
1   2   5
2   3   6

>> df['new_col'] = None
>> df.loc[df.col1 == 1, 'new_col'] = ['a', 'b']
ValueError: Must have equal len keys and value when setting with an iterable

I tried to set the dtype as list using df.new_col = df.new_col.astype(list) and that didn't work either.

I am wondering what would be the correct approach here.


EDIT

The answer provided here: Python pandas insert list into a cell using at didn't work for me either.

Unni
  • 3,338
  • 5
  • 34
  • 45

5 Answers5

10

Don't do this.

Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended.

The main reason holding lists in series is not recommended is you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks. Your series will be of object dtype, which represents a sequence of pointers, much like list. You will lose benefits in terms of memory and performance, as well as access to optimized Pandas methods.

See also What are the advantages of NumPy over regular Python lists? The arguments in favour of Pandas are the same as for NumPy.

That said, since you are going against the purpose and design of Pandas, there are many who face the same problem and have asked similar questions:

jpp
  • 134,728
  • 29
  • 196
  • 240
  • Thank you so much for the answer. Now, I will have to go through a guild ridden coding session or restructure the entire thing. Tough choice! – Unni Sep 28 '18 at 23:31
  • 1
    On a side note, what is the recommended approach if one has to store an arbitrarily long sequence of values under one column? – Unni Sep 28 '18 at 23:33
  • 2
    @Unni, Pandas is probably *not* the right structure for you. The name Pandas is derived from [panel data](https://en.wikipedia.org/wiki/Panel_data). As such, it's designed for structured data stored in *arrays*. Each row in this array is indexed and can't be arbitrarily long. `list`, possibly combined with `dict` may be more appropriate. – jpp Sep 28 '18 at 23:43
  • What if your cell values are vectors/tensors? – Ark-kun Aug 17 '20 at 02:59
  • @Ark-kun, If your vectors are the same length, use a dataframe. Otherwise, use something else, like a dictionary. – jpp Aug 17 '20 at 07:51
  • I want to make sure I understand correctly: For most ML tasks I will need to split the feature tensors into "InputA_1"..."InputA_256"..."InputB_1"..."InputB_256"..."Label_1"..."Label_1000". So, I need to split my 3 N-dimensional NumPy arrays into thousand individual column arrays. Is that correct? – Ark-kun Aug 17 '20 at 08:55
  • @Ark-kun, In the case you describe, one dataframe can hold your InputA and InputB vectors (since they both have 256 entries). Another dataframe can hold the 1000-length vector. – jpp Aug 17 '20 at 09:30
  • @jpp Thank your for your patience. Features and labels also have the "row"/index dimension with size 1000000 for InputA, InputB and Labels. Maybe it's easier to understand like this: data={'InputA': np.rand(1000000, 256), 'InputB': np.rand(1000000, 256), 'Label': np.rand(1000000, 1000)}. So the Dataframe index dimension has size 1000000. – Ark-kun Aug 18 '20 at 18:15
  • Instead of storing `[a, b]` in a single dataframe cell, I created two columns for `a` and `b`. – Alfred Wallace Oct 14 '20 at 04:49
6

Not easy, one possible solution is create helper Series:

df.loc[df.col1 == 1, 'new_col'] = pd.Series([['a', 'b']] * len(df))
print (df)
   col1  col2 new_col
0     1     4  [a, b]
1     2     5     NaN
2     3     6     NaN

Another solution, if need set missing values to empty list too is use list comprehension:

#df['new_col'] = [['a', 'b'] if x == 1 else np.nan for x in df['col1']]

df['new_col'] = [['a', 'b'] if x == 1 else [] for x in df['col1']]
print (df)
   col1  col2 new_col
0     1     4  [a, b]
1     2     5      []
2     3     6      []

But then you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks.

jezrael
  • 629,482
  • 62
  • 918
  • 895
  • Thank you so much. That works! I would go with the first one and set the rest of them to `NaN` or `None`. I should probably consider two separate columns if I know the length of the list is bound by two all the time. Do you think updating every column at once like this would be slow on large data due to additional memory fetch? – Unni Sep 28 '18 at 23:24
0

you answer is simple: select column to convert to list here

my_list = df["col1"].tolist()



>>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>>> df
   col1  col2
0     1     4
1     2     5
2     3     6
>>> my_list = df["col1"].tolist()
>>> my_list
[1, 2, 3]
Karn Kumar
  • 5,809
  • 1
  • 18
  • 33
  • I wanted to set the column value as a `list`. Not get them as a `list`. This doesn't answer the question. – Unni Sep 28 '18 at 23:25
0

You can try below code:

list1=[1,2,3]
list2=[4,5,6]
col=[str(“,”.join(map(str, list1))), str(“,”.join(map(str, list2)))]
df=pd.DataFrame(np.random.randint(low=0, high=0, size(5,2)), columns=col)
print(df)

Hope this is the expected output:

Pranay
  • 11
  • 2
0

Also using np.where:

df['new_col'] = np.where(df.col1 == 1,  pd.Series([['a', 'b']]) , np.nan)
Loochie
  • 1,519
  • 8
  • 13