Create new schema or column names on pyspark Dataframe

Question

I saw this post and it was somewhat helpful except that I need to change the headers of a dataframe using a list, because it's long and changes with every dataset I input, so I can't really write out/ hard-code in the new column names.

Ex:

df = sqlContext.read.load("./assets/"+filename, 
                          format='com.databricks.spark.csv', 
                          header='false', 
                          inferSchema='false')
devices = df.first()
metrics = df.take(2)[1]
# Adding the two header rows together as one as a way of later searching through and sorting rows
# delimiter is "..." since it doesn't occur anywhere in the data and we don't have to wory about multiple splits
header = [str(devices[i]) +"..."+ str(metrics[i]) for i in range(len(devices))]

df2 = df.toDF(header)

Then of course I get this error:

IllegalArgumentException: u"requirement failed: The number of columns doesn't match.\nOld column names (278):

The length of header = 278 and the number of columns is the same. So, the real question is, how do I do a non-hard-coded re-naming of headers in a dataframe when I have a list of the new names?

I'm suspecting I have to make the input not in the form of an actual list object, but how do I do this without iterating through each column (with a selectexpr or alias and creating several new dfs (immutable) with one new updated column at a time? (yuck)

What will be returned in `len(devices)`? – T. Gawęda Aug 31 '17 at 15:56 — T. Gawęda, Aug 31 '17 at 15:56

MaFF · Answer 1 · 2017-09-01T05:35:40.907

0

You can iterate through the old column names and give them your new column names as aliases. A good way to do this is to use function zip in python.

First let's create our column names lists:

old_cols = df.columns
new_cols = [str(d) + "..." + str(m) for d, m in zip(devices, metrics)]

Although I'm assuming "..." refers to another python object, because "..." wouldn't be a good character sequence in a column name.

Finally:

df2 = df.select([df[oc].alias(nc) for oc, nc in zip(old_cols, new_cols)])

edited Sep 01 '17 at 05:35

answered Aug 31 '17 at 16:31

MaFF

6,794
2
21
33

When I run the last part (df2 = df.select....) I get this error: "TypeError: 'Column' object is not callable". I tried converting each to a list object but it gives the same error, so it must be with the original df column? – Alexsandra Guerra Aug 31 '17 at 23:37
I'm sorry I forgot the brackets. Iterating though a list `[f(c) for c in list]` without brackets won't work. I've modified my answer – MaFF Sep 01 '17 at 05:35

score 0 · Answer 2 · answered Sep 01 '17 at 02:21

I tried a different approach. Since I wanted to simulate the hard coded list (and not actual list object), I used the exec() statement with a string created with all the linked headers.

Note: this is limitted to 255 columns. So if you want more than that, you'll have to break it up

for i in range(len(header)):
    # For the first of the column names, need to initiate the string header_str
    if i == 0:
        header_str = "'" + str(header[i])+"',"
    # For the last of the names, need a different string to close it without a comma
    elif i == len(header)-1:
        header_str = header_str + "'" + header[i] + "'"
    #For everything in the middle: just add it all together the same way
    else:
        header_str = header_str + "'" + header[i] + "',"

exec("df2 = df.toDF("+ header_str +")")

Create new schema or column names on pyspark Dataframe

2 Answers2