Renaming column names of a DataFrame in Spark Scala

Question

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}

zero323 · Accepted Answer · 2017-03-26T20:15:56.230

If structure is flat:

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

the simplest thing you can do is to use toDF method:

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

If you want to rename individual columns you can use either select with alias:

df.select($"_1".alias("x1"))

which can be easily generalized to multiple columns:

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

or withColumnRenamed:

df.withColumnRenamed("_1", "x1")

which use with foldLeft to rename multiple columns:

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

With nested structures (structs) one possible option is renaming by selecting a whole structure:

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

Note that it may affect nullability metadata. Another possibility is to rename by casting:

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

or:

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

Hi @zero323 When using withColumnRenamed I am getting AnalysisException can't resolve 'CC8. 1' given input columns... It fails even though CC8.1 is available in DataFrame please guide. — unk1102, Jun 09 '17 at 12:14
@u449355 It is not clear for me if this is nested column or a one containing dots. In the later case backticks should work (at least in some basic cases). — zero323, Jun 09 '17 at 12:54
what does `: _*)` mean in `df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)` — Anton Kim, Jul 27 '17 at 23:04
if i want to change records in particular column then how can i do that? for example name is 'abcd', i want to increase to next character like 'bcde'. how can i do this in spark? — , Jan 18 '18 at 07:07
Since 2.0.0, withColumnRenamed returns a new Dataset, so may be you should reassign the value again as `df = lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2)` — user238607, Apr 12 '18 at 14:57
To answer Anton Kim's question: the `: _*` is the scala so-called "splat" operator. It basically explodes an array-like thing into an uncontained list, which is useful when you want to pass the array to a function that takes an arbitrary number of args, but doesn't have a version that takes a `List[]`. If you're at all familiar with Perl, it is the difference between `some_function(@my_array) # "splatted"` and `some_function(\@my_array) # not splatted ... in perl the backslash "\" operator returns a reference to a thing`. — Mylo Stone, Apr 12 '18 at 23:13
This statement is really obscure to me `df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)`.. Could you decompose it please? especially the `lookup.getOrElse(c,c)` part. — Imad, Dec 28 '18 at 14:57
df.withColumnRenamed("_1", "x1") worked for me. thnx nice and simple — wayneeusa, Feb 27 '19 at 03:17

Tagar · Answer 2 · 2019-12-11T16:40:43.337

19

For those of you interested in PySpark version (actually it's same in Scala - see comment below) :

    merchants_df_renamed = merchants_df.toDF(
        'merchant_id', 'category', 'subcategory', 'merchant')

    merchants_df_renamed.printSchema()

Result:

root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)

edited Dec 11 '19 at 16:40

answered Mar 26 '17 at 20:23

Tagar

10,563
4
78
99

1

With using `toDF()` for renaming columns in DataFrame must be careful. This method works much slower than others. I have DataFrame contains 100M records and simple count query over it take ~3s, whereas the same query with `toDF()` method take ~16s. But when use `select col AS col_new` method for renaming I get ~3s again. More than 5 times faster! Spark 2.3.2.3 – Ihor Konovalenko Aug 29 '19 at 09:02

Mylo Stone · Answer 3 · 2018-04-12T23:06:56.040

6

def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}

In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.

edited Apr 12 '18 at 23:06

answered May 04 '17 at 23:06

Mylo Stone

199
2
7

like it for sure, nice and elegant – thebluephantom Aug 12 '20 at 20:25

score 1 · Answer 4 · answered Dec 01 '19 at 18:06

Suppose the dataframe df has 3 columns id1, name1, price1 and you wish to rename them to id2, name2, price2

val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)

I found this approach useful in many cases.

score 0 · Answer 5 · answered Aug 17 '20 at 07:09

tow table join not rename the joined key

// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)

// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
    day1 = day1.withColumnRenamed(x, y)
}

works!

R. RamkumarYugo · Answer 6 · 2020-10-21T08:34:20.927

0

Sometime we have the column name is below format in SQLServer or MySQL table

Ex  : Account Number,customer number

But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.

Solution:

val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)

edited Oct 21 '20 at 08:34

answered Oct 20 '20 at 04:13

R. RamkumarYugo

1
2

Renaming column names of a DataFrame in Spark Scala

6 Answers6

Linked

Related