2

I have a function which changes the column headers of a DF with a new set of headers in a list.

def updateHeaders(dataFrame, newHeader):
    oldColumns = dataFrame.schema.names
    dfNewCol =  reduce(lambda dataFrame, idx: dataFrame.withColumnRenamed(oldColumns[idx], newHeader[idx]), xrange(len(oldColumns)), dataFrame)
    return dfNewCol

I capture the newHeader list from another function. The first header in the list is named as Action. Later I apply a filter function in which I drop the Action column and create a new DF

def willBeInserted(dataFrame):
    insertData = ["I"] # Some rows of 'Action' column include value "I"
    insertDF = dataFrame.filter(dataFrame.Action.isin(insertData)).drop('Action')
    return insertDF

Later I call the functions

DF1 = updateHeaders(someDF, headerList) #Update the headers
DF2 = willBeInserted(DF1) #Drop 'Action' column and create new DF

The result is the following error:

pyspark.sql.utils.AnalysisException: u'Reference 'Action' is ambiguous, could be: Action#29, Action#221.;"

I tried the solution approaches in this link and in other similar questions, no change so far. Any ideas?

desertnaut
  • 46,107
  • 19
  • 109
  • 140
ylcnky
  • 705
  • 1
  • 9
  • 19

1 Answers1

1

here is some code to rename columns using a udf - hope this helps:

dataDf=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low'])
dataDf.show()

+-----+----+----+
| name|High| Low|
+-----+----+----+
|Alice| 4.3|null|
|  Bob| NaN| 897|
+-----+----+----+


newColNames=['FistName','newHigh','newLow']

def changeColNames(df,newColNameLst):
    for field,newCol in zip(df.schema.fields,newColNameLst):
        df = df.withColumnRenamed(str(field.name), newCol)
    return df

df2=changeColNames(dataDf,newColNames)
df2.show()

+--------+-------+------+
|FistName|newHigh|newLow|
+--------+-------+------+
|   Alice|    4.3|  null|
|     Bob|    NaN|   897|
+--------+-------+------+
Grant Shannon
  • 3,047
  • 1
  • 28
  • 26