2

I know parts of this question might be simple, but I am a beginner in this and would really appreciate the simplest possible solution: I have an excel (.xlsx file) where one of its columns has its cells each have a list of lists of numbers (with the numbers space-separated and there is even a space at the end of each list). So, the column looks something like this:

ColumnHeader  
[[[9 9 9 9 9 13 ][11 11 11 11 11 11 ][11 11 11 11 11 11 ][9 9 9 9 9 9 ]  
[[[9 9 9 9 9 9 ][9 9 9 9 9 9 ]]]  
[[[9 9 9 9 ][14 14 14 14 ][13 13 13 13 ]]]  

Note how each list has a different number of lists. Also, note that each list of lists has an extra [ and ] before and after it, respectively.

What I would like to do is to ideally read the whole xlsx file in python (remember that there are other columns in the file that have just numbers), store it in a pandas dataframe, but have this column above be stored as a list of lists. So, if I later print this column, I would get something like the below (and that series if converted to a list would be a list of list of lists:

ColumnHeader  
[[9,9,9,9,9,13],[11,11,11,11,11,11],[11,11,11,11,11,11],[9,9,9,9,9,9]]  
[[9,9,9,9,9,9],[9,9,9,9,9,9]]  
[[9,9,9,9],[14,14,14,14],[13,13,13,13]]  

If I just straight forwardly read the xlsx file into a pandas dataframe, it obviously reads this column as text, which is not what I desire.

Any help on this would be highly appreciated.

Aly

Aly
  • 193
  • 2
  • 9

2 Answers2

1

I suggest that you load the incriminated column as a string and then you convert it to a nested list using this functionality. Define a function that takes a string and returns a list:

import pandas as pd
import ast
# Load some test data     
df = pd.DataFrame({'fake_list' : ['[[[9 9 9 9 9 13 ][11 11 11 11 11 11 ][11 11 11 11 11 11 ][9 9 9 9 9 9 ]]]',
                                '[[[9 9 9 9 9 9 ][9 9 9 9 9 9 ]]] ', 
                                '[[[9 9 9 9 ][14 14 14 14 ][13 13 13 13 ]]]'],
                   'a': [1,2,3],
                   'b': [4,5,6]})

def fix_list(s):
    s1 = s.strip() #strip white space at the edge of the string
    s1 = s1[1:-1]  # remove edge parenthesis 
    s1 = s1.replace(' ',',').replace('][', '],[')  # make some replacements so that it looks like a nested list
    return ast.literal_eval(s1) # transform string to a nested list

And then apply the function to the column you need to transform:

df['true_list'] = df['fake_list'].apply(fix_list)
print df.true_list[0]
# [[9, 9, 9, 9, 9, 13], [11, 11, 11, 11, 11, 11], [11, 11, 11, 11, 11, 11], [9, 9, 9, 9, 9, 9]]

Alternatively, you can transform the incriminated column while reading from excel using converters:

 df = pd.read_excel('file.xlsx', converters = {'fake_list':fix_list()} 
VinceP
  • 1,584
  • 15
  • 28
  • That was very useful, to the point, and exactly answers my question. Thanks a lot :) – Aly Sep 19 '17 at 16:34
1

You can do it without panda, just using the built-in csv lib

from csv import reader

# read csv file as a list of lists
with open('students.csv', 'r') as read_obj:
    # pass the file object to reader() to get the reader object
    csv_reader = reader(read_obj)
    # Pass reader object to list() to get a list of lists
    list_of_rows = list(csv_reader)
    print(list_of_rows)

If you wish to exclude the first row, use .pop function

list_of_rows.pop(0)

based on: https://thispointer.com/python-read-csv-into-a-list-of-lists-or-tuples-or-dictionaries-import-csv-to-list/