Spark Dataframe distinguish columns with duplicated name

Question

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:

[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0, 1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0, 1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))
]

Above result is created by join with a dataframe to itself, you can see there are 4 columns with both two a and f.

The problem is is there when I try to do more calculation with the a column, I cant find a way to select the a, I have try df[0] and df.select('a'), both returned me below error mesaage:

AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.

Is there anyway in Spark API that I can distinguish the columns from the duplicated names again? or maybe some way to let me change the column names?

score 108 · Answer 1 · edited Oct 22 '19 at 15:17

Lets start with some data:

from pyspark.mllib.linalg import SparseVector
from pyspark.sql import Row

df1 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=125231, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
])

df2 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
])

There are a few ways you can approach this problem. First of all you can unambiguously reference child table columns using parent columns:

df1.join(df2, df1['a'] == df2['a']).select(df1['f']).show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

You can also use table aliases:

from pyspark.sql.functions import col

df1_a = df1.alias("df1_a")
df2_a = df2.alias("df2_a")

df1_a.join(df2_a, col('df1_a.a') == col('df2_a.a')).select('df1_a.f').show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

Finally you can programmatically rename columns:

df1_r = df1.select(*(col(x).alias(x + '_df1') for x in df1.columns))
df2_r = df2.select(*(col(x).alias(x + '_df2') for x in df2.columns))

df1_r.join(df2_r, col('a_df1') == col('a_df2')).select(col('f_df1')).show(2)

## +--------------------+
## |               f_df1|
## +--------------------+
## |(5,[0,1,2,3,4],[0...|
## |(5,[0,1,2,3,4],[0...|
## +--------------------+

Thanks for your editing for showing so many ways of getting the correct column in those ambiguously cases, I do think your examples should go into the Spark programming guide. I've learned a lot! — resec, Nov 18 '15 at 13:03
small correction: `df2_r = **df2** .select(*(col(x).alias(x + '_df2') for x in df2.columns))` instead of `df2_r = df1.select(*(col(x).alias(x + '_df2') for x in df2.columns))`. For the rest, good stuff — Vzzarr, Oct 21 '19 at 13:36
I agree with this should be part of the Spark programming guide. Pure gold. I was able to finally untangle the source of ambiguity selecting columns by the old names before doing the join. The solution of programmatically appending suffixes to the names of the columns before doing the join all the ambiguity wnet away. — Pablo Adames, Apr 12 '20 at 19:53
@resec : Did you understand why the renaming was needed `df1_a = df1.alias("df1_a")` and why we can't use `df1` and `df2` directly? This answer did not explain why the renaming was needed to make `select('df1_a.f')` work — Sheldore, Feb 01 '21 at 17:35
@Sheldore It's in application to the original problem where there is one table `df` being joined with itself. Perhaps the solution would make more sense if it had written `df.alias("df1_a")` and `df.alias("df2_a")`. — timctran, Mar 03 '21 at 23:41

Glennie Helles Sindholt · Accepted Answer · 2020-10-01T14:17:59.517

64

I would recommend that you change the column names for your join.

df1.select(col("a") as "df1_a", col("f") as "df1_f")
   .join(df2.select(col("a") as "df2_a", col("f") as "df2_f"), col("df1_a" === col("df2_a"))

The resulting DataFrame will have schema

(df1_a, df1_f, df2_a, df2_f)

edited Oct 01 '20 at 14:17

answered Nov 18 '15 at 11:33

Glennie Helles Sindholt

11,095
3
40
47

5

You may need to fix your answer since the quotes aren't adjusted properly between column names. – Sameh Sharaf Jan 20 '18 at 10:13
3

@SamehSharaf I assume that you are the one down voting my answer? But the answer is in fact 100% correct - I'm simply using the scala `'`-shorthand for column selection, so there is in fact no problem with quotes. – Glennie Helles Sindholt Jan 20 '18 at 11:57
34

@GlennieHellesSindholt, fair point. It is confusing because the answer is tagged as `python` and `pyspark`. – Jorge Leitao Apr 08 '18 at 09:59
1

What if each dataframe contains 100+ columns and we just need to rename one column name that is the same? Surely, can't manually type in all those column names in the select clause – bikashg Feb 25 '20 at 16:31
9

In that case you could go with `df1.withColumnRenamed("a", "df1_a")` – Glennie Helles Sindholt Feb 26 '20 at 14:00
@GlennieHellesSindholt would you be able to write an pyspark equivalent of this answer? please – Dee Jun 30 '20 at 08:27
@Dee Just have a look at the answer below from zero323. – Glennie Helles Sindholt Jul 01 '20 at 11:44
@GlennieHellesSindholt Wondering if schema change approach could solve my issue: https://stackoverflow.com/questions/63966039/pyspark-multiple-joins-column-row-values-reducing-actions – Abhi Sep 20 '20 at 21:28

Paul Bendevis · Answer 3 · 2019-06-24T16:02:23.770

36

There is a simpler way than writing aliases for all of the columns you are joining on by doing:

df1.join(df2,['a'])

This works if the key that you are joining on is the same in both tables.

See https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

edited Jun 24 '19 at 16:02

answered Jun 19 '18 at 16:55

Paul Bendevis

1,416
2
20
35

4

this is the actual answer as of Spark 2+ – Matt Nov 13 '18 at 16:44
3

And for Scala: df1.join(df2, Seq("a")) – mauriciojost Jan 28 '19 at 12:36
1

page was moved to: https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html – bogdan.rusu Jun 21 '19 at 14:27
1

Glad I kept scrolling, THIS is the much better answer. If columns have different names, then no ambiguity issue. If columns have the same name, do this. There is little reason to every need to deal with ambiguous col names with this method. – Paul Fornia Jan 26 '21 at 16:56

score 7 · Answer 4 · answered Aug 22 '16 at 09:14

You can use def drop(col: Column) method to drop the duplicated column,for example:

DataFrame:df1

+-------+-----+
| a     | f   |
+-------+-----+
|107831 | ... |
|107831 | ... |
+-------+-----+

DataFrame:df2

+-------+-----+
| a     | f   |
+-------+-----+
|107831 | ... |
|107831 | ... |
+-------+-----+

when I join df1 with df2, the DataFrame will be like below:

val newDf = df1.join(df2,df1("a")===df2("a"))

DataFrame:newDf

+-------+-----+-------+-----+
| a     | f   | a     | f   |
+-------+-----+-------+-----+
|107831 | ... |107831 | ... |
|107831 | ... |107831 | ... |
+-------+-----+-------+-----+

Now, we can use def drop(col: Column) method to drop the duplicated column 'a' or 'f', just like as follows:

val newDfWithoutDuplicate = df1.join(df2,df1("a")===df2("a")).drop(df2("a")).drop(df2("f"))

Would this approach work if you are doing an outer join and the two columns have some dissimilar values? — prafi, Jul 07 '20 at 00:19
You may not want to drop if different relations with same schema. — thebluephantom, Aug 12 '20 at 20:08

score 5 · Answer 5 · edited Feb 20 '19 at 13:40

After digging into the Spark API, I found I can first use alias to create an alias for the original dataframe, then I use withColumnRenamed to manually rename every column on the alias, this will do the join without causing the column name duplication.

More detail can be refer to below Spark Dataframe API:

pyspark.sql.DataFrame.alias

pyspark.sql.DataFrame.withColumnRenamed

However, I think this is only a troublesome workaround, and wondering if there is any better way for my question.

score 5 · Answer 6 · answered Jul 26 '18 at 12:26

5

This is how we can join two Dataframes on same column names in PySpark.

df = df1.join(df2, ['col1','col2','col3'])

If you do printSchema() after this then you can see that duplicate columns have been removed.

answered Jul 26 '18 at 12:26

Nikhil Redij

671
8
14

score 4 · Answer 7 · answered Apr 27 '18 at 02:26

Suppose the DataFrames you want to join are df1 and df2, and you are joining them on column 'a', then you have 2 methods

Method 1

df1.join(df2,'a','left_outer')

This is an awsome method and it is highly recommended.

Method 2

df1.join(df2,df1.a == df2.a,'left_outer').drop(df2.a)

score 2 · Answer 8 · answered Aug 28 '19 at 11:45

This might not be the best approach, but if you want to rename the duplicate columns(after join), you can do so using this tiny function.

def rename_duplicate_columns(dataframe):
    columns = dataframe.columns
    duplicate_column_indices = list(set([columns.index(col) for col in columns if columns.count(col) == 2]))
    for index in duplicate_column_indices:
        columns[index] = columns[index]+'2'
    dataframe = dataframe.toDF(*columns)
    return dataframe

score 2 · Answer 9 · answered Dec 10 '19 at 13:36

if only the key column is the same in both tables then try using the following way (Approach 1):

left. join(right , 'key', 'inner')

rather than below(approach 2):

left. join(right , left.key == right.key, 'inner')

Pros of using approach 1:

the 'key' will show only once in the final dataframe
easy to use the syntax

Cons of using approach 1:

only help with the key column
Scenarios, wherein case of left join, if planning to use the right key null count, this will not work. In that case, one has to rename one of the key as mentioned above.

score 1 · Answer 10 · answered Sep 05 '19 at 17:03

If you have a more complicated use case than described in the answer of Glennie Helles Sindholt e.g. you have other/few non-join column names that are also same and want to distinguish them while selecting it's best to use aliasses, e.g:

df3 = df1.select("a", "b").alias("left")\
   .join(df2.select("a", "b").alias("right"), ["a"])\
   .select("left.a", "left.b", "right.b")

df3.columns
['a', 'b', 'b']

score 0 · Answer 11 · answered Feb 03 '21 at 14:11

What worked for me

import databricks.koalas as ks

df1k = df1.to_koalas()
df2k = df2.to_koalas()
df3k = df1k.merge(df2k, on=['col1', 'col2'])
df3 = df3k.to_spark()

All of the columns except for col1 and col2 had "_x" appended to their names if they had come from df1 and "_y" appended if they had come from df2, which is exactly what I needed.

Spark Dataframe distinguish columns with duplicated name

11 Answers11

Linked

Related