How to change dataframe column names in pyspark?

Question

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list

However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas ?

My spark version is 1.5.0

score 404 · Accepted Answer · edited Mar 10 '20 at 18:34

404

There are many ways to do that:

Option 1. Using selectExpr.

data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                  ["Name", "askdaosdka"])
data.show()
data.printSchema()

# Output
#+-------+----------+
#|   Name|askdaosdka|
#+-------+----------+
#|Alberto|         2|
#| Dakota|         2|
#+-------+----------+

#root
# |-- Name: string (nullable = true)
# |-- askdaosdka: long (nullable = true)

df = data.selectExpr("Name as name", "askdaosdka as age")
df.show()
df.printSchema()

# Output
#+-------+---+
#|   name|age|
#+-------+---+
#|Alberto|  2|
#| Dakota|  2|
#+-------+---+

#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

from functools import reduce

oldColumns = data.schema.names
newColumns = ["name", "age"]

df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
df.printSchema()
df.show()

Option 3. using alias, in Scala you can also use as.

from pyspark.sql.functions import col

data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
data.show()

# Output
#+-------+---+
#|   name|age|
#+-------+---+
#|Alberto|  2|
#| Dakota|  2|
#+-------+---+

Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")

df2.show()

# Output
#+-------+---+
#|   name|age|
#+-------+---+
#|Alberto|  2|
#| Dakota|  2|
#+-------+---+

edited Mar 10 '20 at 18:34

Taiwo O. Adetiloye

760
12
27

answered Dec 03 '15 at 22:54

Alberto Bonsanto

15,350
8
56
90

1

I did it with a `for` loop + `withColumnRenamed`, but your `reduce` option is very nice :) – Felipe Gerard Nov 03 '16 at 20:35
@FelipeGerard that is a very bad idea – Alberto Bonsanto Nov 03 '16 at 21:17
1

Well since nothing gets done in Spark until an action is called on the DF, it's just less elegant code... In the end the resulting DF is exactly the same! – Felipe Gerard Nov 03 '16 at 21:41
2

@FelipeGerard Please check [this post](https://stackoverflow.com/questions/35066231/stack-overflow-while-processing-several-columns-with-a-udf), bad things may happen if you have many columns. – Alberto Bonsanto Nov 03 '16 at 21:48
That happens when you use `withColumn` many times. I used `withColumnRenamed` just like you, but iterated with a loop instead of a `reduce`. I'll bet you the DF is the exact same one if you do `for idx in range(n): data = data.withColumnRenamed(oldColumns[idx], newColumns[idx])` than what you posted :) – Felipe Gerard Nov 03 '16 at 21:52
@FelipeGerard I was just saying – Alberto Bonsanto Nov 03 '16 at 21:53
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/127333/discussion-between-felipe-gerard-and-alberto-bonsanto). – Felipe Gerard Nov 03 '16 at 21:55
1

@AlbertoBonsanto How to select column as alias if there are more than 100 columns which is the best option – Apr 02 '17 at 15:37
hi @Alberto Bonsanto, Thanks for your answer. Using your withColumnRenamed() answer I receive a mistake: "NameError: name 'reduce' is not defined". Sure I am missing something trivial. – NuValue May 23 '18 at 12:26
3

@NuValue, you should first run `from functools import reduce` – joaofbsm Aug 01 '18 at 05:11
Hi @AlbertoBonsanto thank you great answer. I have a 3K+ rows dataframe and my code is inspired from your reduce solution, however it seems to run forever, do you know which one I should use for my use case? (nb: i also have 2K+ columns) – Simon30 Jun 19 '19 at 15:23
I found the solution: use df.toDF(*new_cols) instead of weird reduce – Simon30 Jun 19 '19 at 15:50
In PySpark 2.4 with Python 3.6.8 the only method that works out of these is `df.select('id').withColumnRenamed('id', 'new_id')` and `spark.sql("SELECT id AS new_id FROM df")` – rjurney Jul 01 '19 at 03:42
At python 3.7 I recommend Option 3. it work flawlessly (option 1 and 2 has some issues ex.: library places moved https://stackoverflow.com/questions/8689184/nameerror-name-reduce-is-not-defined-in-python ) – sogu Aug 13 '19 at 17:36
I think I'll use this `df.selectExpr(*["{0} as {0}_other".format(x) for x in df.columns])` to rename all the columns at once (I'll be doing a `crossJoin` with itself) – KGS Dec 12 '19 at 13:19
Option 2: use `range()` instead of `xrange()` for Python 3.* – Pete Jan 28 '20 at 18:39
option 2 works for me. IMO option 2 is the best if you want to rename all your columns. – Frank Aug 21 '20 at 21:54
Option 3 is the right way to do it in spark (scala/python). Check this out: https://stackoverflow.com/a/62728542/8551891 – Krunal Patel May 18 '21 at 15:38

score 225 · Answer 2 · edited Jul 15 '20 at 13:58

225

df = df.withColumnRenamed("colName", "newColName")\
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.

edited Jul 15 '20 at 13:58

Sotos

44,023
5
28
55

answered Mar 30 '16 at 07:25

Pankaj Kumar

2,459
1
12
9

1

is there a variant of this solution that leaves all other columns unchanged? with this method, and others, only the explicitly named columns remained (all others removed) – Quetzalcoatl Dec 22 '17 at 05:22
4

+1 it worked fine for me, just edited the specified column leaving others unchanged and no columns were removed. – mnis.p Jul 18 '18 at 05:51
4

@Quetzalcoatl This command appears to change only the specified column while maintaining all other columns. Hence, a great command to rename just one of potentially many column names – user989762 Aug 24 '18 at 09:07
1

@user989762: agreed; my initial understanding was incorrect on this one...! – Quetzalcoatl Aug 24 '18 at 17:27
1

This is great for renaming a few columns. See my answer for a solution that can programatically rename columns. Say you have 200 columns and you'd like to rename 50 of them that have a certain type of column name and leave the other 150 unchanged. In that case, you won't want to manually run `withColumnRenamed` (running `withColumnRenamed` that many times would also be inefficient, [as explained here](https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015)). – Powers Jul 19 '20 at 22:31
@Powers : I checked your article but it contains a command to cast all the string types to integers. Could you plz write here how one needs to use your Medium article's code to rename, say only two columns as written in the above answer? – Sheldore Jan 31 '21 at 00:37

score 81 · Answer 3 · edited Jun 06 '17 at 05:56

81

If you want to change all columns names, try df.toDF(*cols)

edited Jun 06 '17 at 05:56

Petter Friberg

19,652
9
51
94

answered Jun 06 '17 at 05:52

user8117731

811
5
2

7

this solution is the closest to df.columns = new_column_name_list per the OP, both in how concise it is and its execution. – Quetzalcoatl Mar 29 '18 at 01:17
I think this should be selected as the best answer – HanaKaze Jan 24 '20 at 23:11
For me I was getting the header names from a pandas dataframe, so I just used `df = df.toDF(*my_pandas_df.columns)` – Nic Scozzaro Jun 08 '20 at 05:52
3

This answer confuses me. Shouldn't there be a mapping from old column names to new names? Does this work by having `cols` be the new column names, and just assuming the the order of names in `cols` corresponds to the column order of the dataframe? – rbatt Jun 23 '20 at 22:15
This is clearly the best answer if you want to pass in a list of columns – poiter Apr 26 '21 at 16:28
1

@rbatt Using `df.select` in combination with `pyspark.sql.functions col-method` is a reliable way to do this since it maintains the mapping/alias applied & thus the order/schema is maintained after the rename operations. Checkout the comment for code snippet: https://stackoverflow.com/a/62728542/8551891 – Krunal Patel May 17 '21 at 16:40

pbahr · Answer 4 · 2018-04-23T14:50:38.423

60

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDf trick.

edited Apr 23 '18 at 14:50

answered Apr 13 '18 at 15:17

pbahr

1,080
9
14

2

This code generates a simple physical plan that's easy for Catalyst to optimize. It's also elegant. +1 – Powers Jul 19 '20 at 22:25

score 18 · Answer 5 · answered Jan 15 '17 at 15:22

18

If you want to rename a single column and keep the rest as it is:

from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])

answered Jan 15 '17 at 15:22

Ratul Ghosh

181
1
4

score 18 · Answer 6 · edited Jul 11 '18 at 18:32

18

df.withColumnRenamed('age', 'age2')

edited Jul 11 '18 at 18:32

Def_Os

4,642
5
27
58

answered Jun 28 '18 at 14:49

Sahan Jayasumana

392
3
8

1

[Pankaj Kumar's answer](https://stackoverflow.com/a/36302241) and [Alberto Bonsanto's answer](https://stackoverflow.com/a/34077809) (which are from 2016 and 2015, respectively) already suggest using `withColumnRenamed`. – Andrew Myers Jul 12 '18 at 01:08
Thanks, yes but there are a couple of different syntax's, maybe we should collect them into a more formal answer? data.withColumnRenamed(oldColumns[idx], newColumns[idx]) vs data.withColumnRenamed(columnname, new columnname) i think it depends on which version of pyspark your using – Sahan Jayasumana Oct 12 '18 at 23:43
1

This is not a different syntax. The only difference is you did not store your column names in an array. – Ed Bordin Jan 08 '19 at 06:00

score 16 · Answer 7 · answered Dec 07 '18 at 15:00

this is the approach that I used:

create pyspark session:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()

create dataframe:

df = spark.createDataFrame(data = [('Bob', 5.62,'juice'),  ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])

view df with column names:

df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob|  5.62|juice|
| Sue|  0.85| milk|
+----+------+-----+

create a list with new column names:

newcolnames = ['NameNew','AmountNew','ItemNew']

change the column names of the df:

for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)

view df with new column names:

df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
|    Bob|     5.62|  juice|
|    Sue|     0.85|   milk|
+-------+---------+-------+

score 11 · Answer 8 · edited May 29 '20 at 17:58

11

I made an easy to use function to rename multiple columns for a pyspark dataframe, in case anyone wants to use it:

def renameCols(df, old_columns, new_columns):
    for old_col,new_col in zip(old_columns,new_columns):
        df = df.withColumnRenamed(old_col,new_col)
    return df

old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)

Be careful, both lists must be the same length.

edited May 29 '20 at 17:58

Vedom

2,707
3
12
15

answered Mar 22 '19 at 11:57

Manrique

1,458
2
11
28

1

Nice job on this one. A bit of overkill for what I needed though. And you can just pass the df because `old_columns` would be the same as `df.columns`. – Darth Egregious Sep 26 '19 at 14:14

score 8 · Answer 9 · answered Jun 20 '18 at 14:24

8

Another way to rename just one column (using import pyspark.sql.functions as F):

df = df.select( '*', F.col('count').alias('new_count') ).drop('count')

answered Jun 20 '18 at 14:24

scottlittle

13,132
5
41
62

score 5 · Answer 10 · answered Oct 11 '19 at 10:19

5

You can use the following function to rename all the columns of your dataframe.

def df_col_rename(X, to_rename, replace_with):
    """
    :param X: spark dataframe
    :param to_rename: list of original names
    :param replace_with: list of new names
    :return: dataframe with updated names
    """
    import pyspark.sql.functions as F
    mapping = dict(zip(to_rename, replace_with))
    X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
    return X

In case you need to update only a few columns' names, you can use the same column name in the replace_with list

To rename all columns

df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])

To rename a some columns

df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])

answered Oct 11 '19 at 10:19

Clock Slave

6,266
9
55
94

I like that this uses the select statement with aliases and uses more of an "immutable" type of framework. I did, however, find that the `toDF` function and a list comprehension that implements whatever logic is desired was much more succinct. for example, `def append_suffix_to_columns(spark_df, suffix): return spark_df.toDF([c + suffix for c in spark_df.columns])` – John Haberstroh Oct 01 '20 at 01:52
Since `mapping` is a dictionary, why can't you simply use `mapping[c]` instead of `mapping.get(c, c)`? – Sheldore Jan 31 '21 at 00:41

score 4 · Answer 11 · answered Dec 15 '20 at 13:45

4

Method 1:

df = df.withColumnRenamed("new_column_name", "old_column_name")

Method 2: If you want to do some computation and rename the new values

df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")

answered Dec 15 '20 at 13:45

Gourav Bansal

69
3

There was a lot of similar answers so no need to post another one duplicate. – astentx Dec 15 '20 at 15:47
2

The first argument in withColumnRenamed is the old column name. Your Method 1 is wrong – Sheldore Jan 25 '21 at 15:56

score 3 · Answer 12 · edited Jan 31 '18 at 15:19

3

I use this one:

from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()

edited Jan 31 '18 at 15:19

Isma

13,156
5
30
45

answered Jan 31 '18 at 14:33

mike

51
3

2

While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Isma Jan 31 '18 at 15:19

score 3 · Answer 13 · answered May 31 '20 at 08:40

We can use various approaches to rename the column name.

First, let create a simple DataFrame.

df = spark.createDataFrame([("x", 1), ("y", 2)], 
                                  ["col_1", "col_2"])

Now let's try to rename col_1 to col_3. PFB a few approaches to do the same.

# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()

# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()

# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()

# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()

Here is the output.

+-----+-----+
|col_3|col_2|
+-----+-----+
|    x|    1|
|    y|    2|
+-----+-----+

I hope this helps.

score 3 · Answer 14 · answered Oct 16 '20 at 07:14

You can put into for loop, and use zip to pairs each column name in two array.

new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]

new_df = df
for old, new in zip(df.columns, new_name):
    new_df = new_df.withColumnRenamed(old, new)

score 2 · Answer 15 · answered Nov 03 '20 at 11:51

2

I like to use a dict to rename the df.

rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
    df = df.withColumnRenamed(col, rename[col])

answered Nov 03 '20 at 11:51

Michael H.

425
5
8

score 1 · Answer 16 · answered Jun 27 '17 at 14:42

1

For a single column rename, you can still use toDF(). For example,

df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()

answered Jun 27 '17 at 14:42

ganeiy

292
2
9

score 1 · Answer 17 · edited Oct 10 '20 at 13:17

1

There are multiple approaches you can use:

df1=df.withColumn("new_column","old_column").drop(col("old_column"))
df1=df.withColumn("new_column","old_column")
df1=df.select("old_column".alias("new_column"))

edited Oct 10 '20 at 13:17

aboger

1,756
5
27
40

answered Oct 10 '20 at 06:14

pankajs

51
1
3

why have to use withColumn to create another duplicate column with different name when you can use withColumnRenamed ? – Haha TTpro Oct 22 '20 at 03:40

score 0 · Answer 18 · answered Mar 28 '21 at 04:50


from pyspark.sql.types import StructType,StructField, StringType, IntegerType

CreatingDataFrame = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = StructType([ \
    StructField("employee_name",StringType(),True), \
    StructField("department",StringType(),True), \
    StructField("state",StringType(),True), \
    StructField("salary", IntegerType(), True), \
    StructField("age", StringType(), True), \
    StructField("bonus", IntegerType(), True) \
  ])

 
OurData = spark.createDataFrame(data=CreatingDataFrame,schema=schema)

OurData.show()

# COMMAND ----------

GrouppedBonusData=OurData.groupBy("department").sum("bonus")


# COMMAND ----------

GrouppedBonusData.show()


# COMMAND ----------

GrouppedBonusData.printSchema()

# COMMAND ----------

from pyspark.sql.functions import col

BonusColumnRenamed = GrouppedBonusData.select(col("department").alias("department"), col("sum(bonus)").alias("Total_Bonus"))
BonusColumnRenamed.show()

# COMMAND ----------

GrouppedBonusData.groupBy("department").count().show()

# COMMAND ----------

GrouppedSalaryData=OurData.groupBy("department").sum("salary")

# COMMAND ----------

GrouppedSalaryData.show()

# COMMAND ----------

from pyspark.sql.functions import col

SalaryColumnRenamed = GrouppedSalaryData.select(col("department").alias("Department"), col("sum(salary)").alias("Total_Salary"))
SalaryColumnRenamed.show()

How to change dataframe column names in pyspark?

18 Answers18

Linked

Related