How to prefix columns names of dataframe efficiently without creating a new dataframe in Pyspark?

Question

In pandas you can rename all columns in one go in a "inplace" manner using

new_column_name_list =['Pre_'+x for x in df.columns]
df.columns = new_column_name_list

Can we do the above same step in Pyspark without having to finally create new dataframe? It is inefficient because we will have 2 dataframe with the same data but different column names leading to bad memory utlilization.

The below link answers the question but its not inplace.

How to change dataframe column names in pyspark? EDIT My question is clearly different from the question in above link

Please read my question again. I have clearly mentioned how tha question is different from what I am asking. — GeorgeOfTheRF, Jun 15 '17 at 09:23
The answers in the linked question, seems to answer your question, e.g. `data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))` — Yaron, Jun 15 '17 at 09:24
Aliasing creates a new `DataFrame` object, but it doesn't create a copy of the data. Unless you're worrying about local driver memory (in that case there is no good news for you) this is a duplicate. — zero323, Jun 15 '17 at 11:36
This will do **left_cols = df.columns** '**df = df.selectExpr([col + ' as left_' + col for col in left_cols])** — Nidhi, Mar 23 '21 at 13:47

koiralo · Accepted Answer · 2017-06-15T11:25:26.073

1

This is how you could do it in scala spark Create a map of new column and old column name dynamically and select with alias.

val to = df2.columns.map(col(_))

val from = (1 to to.length).map( i => (s"column$i"))

df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*).show

Previouse column names

"age", "names"

After changed

"column1". "column2"

However dataframe cannot be updated since they are immutable, But can bes assigned to new one for the further use. Only used dataframes are loaded in memory so this won't be issue.

Hope this helps

edited Jun 15 '17 at 11:25

answered Jun 15 '17 at 10:09

koiralo

19,619
4
39
64

Based on the above code we cannot rename on the existing dataframe itseft right? We will have to final say df3=df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*) to make the change permananent – GeorgeOfTheRF Jun 15 '17 at 10:26
Will df2=df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*) work? – GeorgeOfTheRF Jun 15 '17 at 10:32
This wont work because spark df is immutable? – GeorgeOfTheRF Jun 15 '17 at 10:58
Yes it changes the column name at once but does not changes in the original dataframe it returns a new dataframe, since the dataframe are immutable. – koiralo Jun 15 '17 at 11:26

How to prefix columns names of dataframe efficiently without creating a new dataframe in Pyspark?

1 Answers1