95

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.

for( i <- 0 to origCols.length - 1) {
  df.withColumnRenamed(
    df.columns(i), 
    df.columns(i).toLowerCase
  );
}
Alper t. Turker
  • 29,733
  • 7
  • 65
  • 101
Sam
  • 1,117
  • 2
  • 10
  • 13

6 Answers6

245

If structure is flat:

val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
//  |-- _1: long (nullable = false)
//  |-- _2: string (nullable = true)
//  |-- _3: string (nullable = true)
//  |-- _4: double (nullable = false)

the simplest thing you can do is to use toDF method:

val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)

dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)

If you want to rename individual columns you can use either select with alias:

df.select($"_1".alias("x1"))

which can be easily generalized to multiple columns:

val lookup = Map("_1" -> "foo", "_3" -> "bar")

df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

or withColumnRenamed:

df.withColumnRenamed("_1", "x1")

which use with foldLeft to rename multiple columns:

lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))

With nested structures (structs) one possible option is renaming by selecting a whole structure:

val nested = spark.read.json(sc.parallelize(Seq(
    """{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))

nested.printSchema
// root
//  |-- foobar: struct (nullable = true)
//  |    |-- foo: struct (nullable = true)
//  |    |    |-- bar: struct (nullable = true)
//  |    |    |    |-- first: double (nullable = true)
//  |    |    |    |-- second: double (nullable = true)
//  |-- id: long (nullable = true)

@transient val foobarRenamed = struct(
  struct(
    struct(
      $"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
    ).alias("point")
  ).alias("location")
).alias("record")

nested.select(foobarRenamed, $"id").printSchema
// root
//  |-- record: struct (nullable = false)
//  |    |-- location: struct (nullable = false)
//  |    |    |-- point: struct (nullable = false)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
//  |-- id: long (nullable = true)

Note that it may affect nullability metadata. Another possibility is to rename by casting:

nested.select($"foobar".cast(
  "struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)

or:

import org.apache.spark.sql.types._

nested.select($"foobar".cast(
  StructType(Seq(
    StructField("location", StructType(Seq(
      StructField("point", StructType(Seq(
        StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema

// root
//  |-- record: struct (nullable = true)
//  |    |-- location: struct (nullable = true)
//  |    |    |-- point: struct (nullable = true)
//  |    |    |    |-- x: double (nullable = true)
//  |    |    |    |-- y: double (nullable = true)
zero323
  • 283,404
  • 79
  • 858
  • 880
  • Hi @zero323 When using withColumnRenamed I am getting AnalysisException can't resolve 'CC8. 1' given input columns... It fails even though CC8.1 is available in DataFrame please guide. – unk1102 Jun 09 '17 at 12:14
  • @u449355 It is not clear for me if this is nested column or a one containing dots. In the later case backticks should work (at least in some basic cases). – zero323 Jun 09 '17 at 12:54
  • 1
    what does `: _*)` mean in `df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)` – Anton Kim Jul 27 '17 at 23:04
  • if i want to change records in particular column then how can i do that? for example name is 'abcd', i want to increase to next character like 'bcde'. how can i do this in spark? –  Jan 18 '18 at 07:07
  • Since 2.0.0, withColumnRenamed returns a new Dataset, so may be you should reassign the value again as `df = lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2)` – user238607 Apr 12 '18 at 14:57
  • 1
    To answer Anton Kim's question: the `: _*` is the scala so-called "splat" operator. It basically explodes an array-like thing into an uncontained list, which is useful when you want to pass the array to a function that takes an arbitrary number of args, but doesn't have a version that takes a `List[]`. If you're at all familiar with Perl, it is the difference between `some_function(@my_array) # "splatted"` and `some_function(\@my_array) # not splatted ... in perl the backslash "\" operator returns a reference to a thing`. – Mylo Stone Apr 12 '18 at 23:13
  • how about java version ? – Celik Aug 08 '18 at 06:32
  • 1
    This statement is really obscure to me `df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)`.. Could you decompose it please? especially the `lookup.getOrElse(c,c)` part. – Imad Dec 28 '18 at 14:57
  • df.withColumnRenamed("_1", "x1") worked for me. thnx nice and simple – wayneeusa Feb 27 '19 at 03:17
19

For those of you interested in PySpark version (actually it's same in Scala - see comment below) :

    merchants_df_renamed = merchants_df.toDF(
        'merchant_id', 'category', 'subcategory', 'merchant')

    merchants_df_renamed.printSchema()

Result:

root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)

Tagar
  • 10,563
  • 4
  • 78
  • 99
  • 1
    With using `toDF()` for renaming columns in DataFrame must be careful. This method works much slower than others. I have DataFrame contains 100M records and simple count query over it take ~3s, whereas the same query with `toDF()` method take ~16s. But when use `select col AS col_new` method for renaming I get ~3s again. More than 5 times faster! Spark 2.3.2.3 – Ihor Konovalenko Aug 29 '19 at 09:02
6
def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
  t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}

In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.

Mylo Stone
  • 199
  • 2
  • 7
1

Suppose the dataframe df has 3 columns id1, name1, price1 and you wish to rename them to id2, name2, price2

val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)

I found this approach useful in many cases.

0

tow table join not rename the joined key

// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)

// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
    day1 = day1.withColumnRenamed(x, y)
}

works!

Colin Wang
  • 621
  • 6
  • 11
0
Sometime we have the column name is below format in SQLServer or MySQL table

Ex  : Account Number,customer number

But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.

Solution:

val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)