244

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list

However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas ?

My spark version is 1.5.0

Shubhanshu Mishra
  • 4,850
  • 4
  • 18
  • 23

18 Answers18

404

There are many ways to do that:

  • Option 1. Using selectExpr.

    data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)], 
                                      ["Name", "askdaosdka"])
    data.show()
    data.printSchema()
    
    # Output
    #+-------+----------+
    #|   Name|askdaosdka|
    #+-------+----------+
    #|Alberto|         2|
    #| Dakota|         2|
    #+-------+----------+
    
    #root
    # |-- Name: string (nullable = true)
    # |-- askdaosdka: long (nullable = true)
    
    df = data.selectExpr("Name as name", "askdaosdka as age")
    df.show()
    df.printSchema()
    
    # Output
    #+-------+---+
    #|   name|age|
    #+-------+---+
    #|Alberto|  2|
    #| Dakota|  2|
    #+-------+---+
    
    #root
    # |-- name: string (nullable = true)
    # |-- age: long (nullable = true)
    
  • Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

    from functools import reduce
    
    oldColumns = data.schema.names
    newColumns = ["name", "age"]
    
    df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
    df.printSchema()
    df.show()
    
  • Option 3. using alias, in Scala you can also use as.

    from pyspark.sql.functions import col
    
    data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
    data.show()
    
    # Output
    #+-------+---+
    #|   name|age|
    #+-------+---+
    #|Alberto|  2|
    #| Dakota|  2|
    #+-------+---+
    
  • Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

    sqlContext.registerDataFrameAsTable(data, "myTable")
    df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
    
    df2.show()
    
    # Output
    #+-------+---+
    #|   name|age|
    #+-------+---+
    #|Alberto|  2|
    #| Dakota|  2|
    #+-------+---+
    
Taiwo O. Adetiloye
  • 760
  • 12
  • 27
Alberto Bonsanto
  • 15,350
  • 8
  • 56
  • 90
  • 1
    I did it with a `for` loop + `withColumnRenamed`, but your `reduce` option is very nice :) – Felipe Gerard Nov 03 '16 at 20:35
  • @FelipeGerard that is a very bad idea – Alberto Bonsanto Nov 03 '16 at 21:17
  • 1
    Well since nothing gets done in Spark until an action is called on the DF, it's just less elegant code... In the end the resulting DF is exactly the same! – Felipe Gerard Nov 03 '16 at 21:41
  • 2
    @FelipeGerard Please check [this post](https://stackoverflow.com/questions/35066231/stack-overflow-while-processing-several-columns-with-a-udf), bad things may happen if you have many columns. – Alberto Bonsanto Nov 03 '16 at 21:48
  • That happens when you use `withColumn` many times. I used `withColumnRenamed` just like you, but iterated with a loop instead of a `reduce`. I'll bet you the DF is the exact same one if you do `for idx in range(n): data = data.withColumnRenamed(oldColumns[idx], newColumns[idx])` than what you posted :) – Felipe Gerard Nov 03 '16 at 21:52
  • @FelipeGerard I was just saying – Alberto Bonsanto Nov 03 '16 at 21:53
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/127333/discussion-between-felipe-gerard-and-alberto-bonsanto). – Felipe Gerard Nov 03 '16 at 21:55
  • 1
    @AlbertoBonsanto How to select column as alias if there are more than 100 columns which is the best option –  Apr 02 '17 at 15:37
  • hi @Alberto Bonsanto, Thanks for your answer. Using your withColumnRenamed() answer I receive a mistake: "NameError: name 'reduce' is not defined". Sure I am missing something trivial. – NuValue May 23 '18 at 12:26
  • 3
    @NuValue, you should first run `from functools import reduce` – joaofbsm Aug 01 '18 at 05:11
  • Hi @AlbertoBonsanto thank you great answer. I have a 3K+ rows dataframe and my code is inspired from your reduce solution, however it seems to run forever, do you know which one I should use for my use case? (nb: i also have 2K+ columns) – Simon30 Jun 19 '19 at 15:23
  • I found the solution: use df.toDF(*new_cols) instead of weird reduce – Simon30 Jun 19 '19 at 15:50
  • In PySpark 2.4 with Python 3.6.8 the only method that works out of these is `df.select('id').withColumnRenamed('id', 'new_id')` and `spark.sql("SELECT id AS new_id FROM df")` – rjurney Jul 01 '19 at 03:42
  • At python 3.7 I recommend Option 3. it work flawlessly (option 1 and 2 has some issues ex.: library places moved https://stackoverflow.com/questions/8689184/nameerror-name-reduce-is-not-defined-in-python ) – sogu Aug 13 '19 at 17:36
  • I think I'll use this `df.selectExpr(*["{0} as {0}_other".format(x) for x in df.columns])` to rename all the columns at once (I'll be doing a `crossJoin` with itself) – KGS Dec 12 '19 at 13:19
  • Option 2: use `range()` instead of `xrange()` for Python 3.* – Pete Jan 28 '20 at 18:39
  • option 2 works for me. IMO option 2 is the best if you want to rename all your columns. – Frank Aug 21 '20 at 21:54
  • Option 3 is the right way to do it in spark (scala/python). Check this out: https://stackoverflow.com/a/62728542/8551891 – Krunal Patel May 18 '21 at 15:38
225
df = df.withColumnRenamed("colName", "newColName")\
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.

Sotos
  • 44,023
  • 5
  • 28
  • 55
Pankaj Kumar
  • 2,459
  • 1
  • 12
  • 9
  • 1
    is there a variant of this solution that leaves all other columns unchanged? with this method, and others, only the explicitly named columns remained (all others removed) – Quetzalcoatl Dec 22 '17 at 05:22
  • 4
    +1 it worked fine for me, just edited the specified column leaving others unchanged and no columns were removed. – mnis.p Jul 18 '18 at 05:51
  • 4
    @Quetzalcoatl This command appears to change only the specified column while maintaining all other columns. Hence, a great command to rename just one of potentially many column names – user989762 Aug 24 '18 at 09:07
  • 1
    @user989762: agreed; my initial understanding was incorrect on this one...! – Quetzalcoatl Aug 24 '18 at 17:27
  • 1
    This is great for renaming a few columns. See my answer for a solution that can programatically rename columns. Say you have 200 columns and you'd like to rename 50 of them that have a certain type of column name and leave the other 150 unchanged. In that case, you won't want to manually run `withColumnRenamed` (running `withColumnRenamed` that many times would also be inefficient, [as explained here](https://medium.com/@manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015)). – Powers Jul 19 '20 at 22:31
  • @Powers : I checked your article but it contains a command to cast all the string types to integers. Could you plz write here how one needs to use your Medium article's code to rename, say only two columns as written in the above answer? – Sheldore Jan 31 '21 at 00:37
81

If you want to change all columns names, try df.toDF(*cols)

Petter Friberg
  • 19,652
  • 9
  • 51
  • 94
user8117731
  • 811
  • 5
  • 2
  • 7
    this solution is the closest to df.columns = new_column_name_list per the OP, both in how concise it is and its execution. – Quetzalcoatl Mar 29 '18 at 01:17
  • I think this should be selected as the best answer – HanaKaze Jan 24 '20 at 23:11
  • For me I was getting the header names from a pandas dataframe, so I just used `df = df.toDF(*my_pandas_df.columns)` – Nic Scozzaro Jun 08 '20 at 05:52
  • 3
    This answer confuses me. Shouldn't there be a mapping from old column names to new names? Does this work by having `cols` be the new column names, and just assuming the the order of names in `cols` corresponds to the column order of the dataframe? – rbatt Jun 23 '20 at 22:15
  • This is clearly the best answer if you want to pass in a list of columns – poiter Apr 26 '21 at 16:28
  • 1
    @rbatt Using `df.select` in combination with `pyspark.sql.functions col-method` is a reliable way to do this since it maintains the mapping/alias applied & thus the order/schema is maintained after the rename operations. Checkout the comment for code snippet: https://stackoverflow.com/a/62728542/8551891 – Krunal Patel May 17 '21 at 16:40
60

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDf trick.

pbahr
  • 1,080
  • 9
  • 14
  • 2
    This code generates a simple physical plan that's easy for Catalyst to optimize. It's also elegant. +1 – Powers Jul 19 '20 at 22:25
18

If you want to rename a single column and keep the rest as it is:

from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])
Ratul Ghosh
  • 181
  • 1
  • 4
18

df.withColumnRenamed('age', 'age2')

Def_Os
  • 4,642
  • 5
  • 27
  • 58
  • 1
    [Pankaj Kumar's answer](https://stackoverflow.com/a/36302241) and [Alberto Bonsanto's answer](https://stackoverflow.com/a/34077809) (which are from 2016 and 2015, respectively) already suggest using `withColumnRenamed`. – Andrew Myers Jul 12 '18 at 01:08
  • Thanks, yes but there are a couple of different syntax's, maybe we should collect them into a more formal answer? data.withColumnRenamed(oldColumns[idx], newColumns[idx]) vs data.withColumnRenamed(columnname, new columnname) i think it depends on which version of pyspark your using – Sahan Jayasumana Oct 12 '18 at 23:43
  • 1
    This is not a different syntax. The only difference is you did not store your column names in an array. – Ed Bordin Jan 08 '19 at 06:00
16

this is the approach that I used:

create pyspark session:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()

create dataframe:

df = spark.createDataFrame(data = [('Bob', 5.62,'juice'),  ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])

view df with column names:

df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob|  5.62|juice|
| Sue|  0.85| milk|
+----+------+-----+

create a list with new column names:

newcolnames = ['NameNew','AmountNew','ItemNew']

change the column names of the df:

for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)

view df with new column names:

df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
|    Bob|     5.62|  juice|
|    Sue|     0.85|   milk|
+-------+---------+-------+
Grant Shannon
  • 3,047
  • 1
  • 28
  • 26
11

I made an easy to use function to rename multiple columns for a pyspark dataframe, in case anyone wants to use it:

def renameCols(df, old_columns, new_columns):
    for old_col,new_col in zip(old_columns,new_columns):
        df = df.withColumnRenamed(old_col,new_col)
    return df

old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)

Be careful, both lists must be the same length.

Vedom
  • 2,707
  • 3
  • 12
  • 15
Manrique
  • 1,458
  • 2
  • 11
  • 28
  • 1
    Nice job on this one. A bit of overkill for what I needed though. And you can just pass the df because `old_columns` would be the same as `df.columns`. – Darth Egregious Sep 26 '19 at 14:14
8

Another way to rename just one column (using import pyspark.sql.functions as F):

df = df.select( '*', F.col('count').alias('new_count') ).drop('count')
scottlittle
  • 13,132
  • 5
  • 41
  • 62
5

You can use the following function to rename all the columns of your dataframe.

def df_col_rename(X, to_rename, replace_with):
    """
    :param X: spark dataframe
    :param to_rename: list of original names
    :param replace_with: list of new names
    :return: dataframe with updated names
    """
    import pyspark.sql.functions as F
    mapping = dict(zip(to_rename, replace_with))
    X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
    return X

In case you need to update only a few columns' names, you can use the same column name in the replace_with list

To rename all columns

df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])

To rename a some columns

df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])
Clock Slave
  • 6,266
  • 9
  • 55
  • 94
  • I like that this uses the select statement with aliases and uses more of an "immutable" type of framework. I did, however, find that the `toDF` function and a list comprehension that implements whatever logic is desired was much more succinct. for example, `def append_suffix_to_columns(spark_df, suffix): return spark_df.toDF([c + suffix for c in spark_df.columns])` – John Haberstroh Oct 01 '20 at 01:52
  • Since `mapping` is a dictionary, why can't you simply use `mapping[c]` instead of `mapping.get(c, c)`? – Sheldore Jan 31 '21 at 00:41
4

Method 1:

df = df.withColumnRenamed("new_column_name", "old_column_name")

Method 2: If you want to do some computation and rename the new values

df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")
3

I use this one:

from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()
Isma
  • 13,156
  • 5
  • 30
  • 45
mike
  • 51
  • 3
  • 2
    While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Isma Jan 31 '18 at 15:19
3

We can use various approaches to rename the column name.

First, let create a simple DataFrame.

df = spark.createDataFrame([("x", 1), ("y", 2)], 
                                  ["col_1", "col_2"])

Now let's try to rename col_1 to col_3. PFB a few approaches to do the same.

# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()

# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()

# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()

# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()

Here is the output.

+-----+-----+
|col_3|col_2|
+-----+-----+
|    x|    1|
|    y|    2|
+-----+-----+

I hope this helps.

neeraj bhadani
  • 2,301
  • 12
  • 21
3

You can put into for loop, and use zip to pairs each column name in two array.

new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]

new_df = df
for old, new in zip(df.columns, new_name):
    new_df = new_df.withColumnRenamed(old, new)
Haha TTpro
  • 3,921
  • 5
  • 28
  • 54
2

I like to use a dict to rename the df.

rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
    df = df.withColumnRenamed(col, rename[col])
Michael H.
  • 425
  • 5
  • 8
1

For a single column rename, you can still use toDF(). For example,

df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()
ganeiy
  • 292
  • 2
  • 9
1

There are multiple approaches you can use:

  1. df1=df.withColumn("new_column","old_column").drop(col("old_column"))

  2. df1=df.withColumn("new_column","old_column")

  3. df1=df.select("old_column".alias("new_column"))

aboger
  • 1,756
  • 5
  • 27
  • 40
pankajs
  • 51
  • 1
  • 3
  • why have to use withColumn to create another duplicate column with different name when you can use withColumnRenamed ? – Haha TTpro Oct 22 '20 at 03:40
0

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

CreatingDataFrame = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = StructType([ \
    StructField("employee_name",StringType(),True), \
    StructField("department",StringType(),True), \
    StructField("state",StringType(),True), \
    StructField("salary", IntegerType(), True), \
    StructField("age", StringType(), True), \
    StructField("bonus", IntegerType(), True) \
  ])

 
OurData = spark.createDataFrame(data=CreatingDataFrame,schema=schema)

OurData.show()

# COMMAND ----------

GrouppedBonusData=OurData.groupBy("department").sum("bonus")


# COMMAND ----------

GrouppedBonusData.show()


# COMMAND ----------

GrouppedBonusData.printSchema()

# COMMAND ----------

from pyspark.sql.functions import col

BonusColumnRenamed = GrouppedBonusData.select(col("department").alias("department"), col("sum(bonus)").alias("Total_Bonus"))
BonusColumnRenamed.show()

# COMMAND ----------

GrouppedBonusData.groupBy("department").count().show()

# COMMAND ----------

GrouppedSalaryData=OurData.groupBy("department").sum("salary")

# COMMAND ----------

GrouppedSalaryData.show()

# COMMAND ----------

from pyspark.sql.functions import col

SalaryColumnRenamed = GrouppedSalaryData.select(col("department").alias("Department"), col("sum(salary)").alias("Total_Salary"))
SalaryColumnRenamed.show()

Rahul
  • 107
  • 9