PySpark dataframe: working with duplicated column names after self join

Question

I have a dataframe like this (inspired by this question with slightly different setup):

df3 = hive_context.createDataFrame([
    Row(a=107831, f=3),
    Row(a=107531, f=2),
    Row(a=125231, f=2)
])

Based on this, i create two new objects. Each is a subset of the original dataframe:

from pyspark.sql.functions import col

df1 = (df3
  .filter(((col('a') == 107831) & (col('f') == 3))|
          ((col('a') == 125231) & (col('f') == 2))))

df2 = (df3
  .filter(((col('a') == 107831) & (col('f') == 3))|
          ((col('a') == 107531) & (col('f') == 2))))

Then, I would like to join those two datasets and obtain the f columns from each of them as in:

a = (df1
  .join(df2, (df1['a'] == df2['a']), how = 'full')
  .select(df1['f'], df2['f']).collect())

However, I get [Row(f=None, f=None), Row(f=2, f=2), Row(f=3, f=3)]

Instead of the desired [Row(f=3, f=4), Row(f=None, f=2), Row(f=2, f=None)] or expressed as a table:

+------+----+------+----+
|     a|   f|     a|   f|
+------+----+------+----+
|107831|   3|107831|   4|
|  null|null|107531|   2|
|125231|   2|  null|null|
+------+----+------+----+

Does anyone know how to resolve this? Do i have to store df1 and df2 somewhere?

When I run the scenario as in the above linked question, i get the expected results:

df1 = hive_context.createDataFrame([
    Row(a=107831, f=3),
    Row(a=125231, f=2),
])

df2 = hive_context.createDataFrame([
    Row(a=107831, f=4),
    Row(a=107531, f=2),
])

a = df1.join(df2, (df1['a'] == df2['a']), how = 'full').select(df1['f'], df2['f']).collect()
a

I run it on python 3.6 and spark 2.3

@AnkitKumarNamdeo That was a typo, thank you for catching it. However, it has no impact on the described behavior. PySpark still always calls the column from dataframe on left side of the join — ira, Sep 04 '18 at 17:38
they both should produce the same output can you please run them again, i think the joining condition was the only problem — Ankit Kumar Namdeo, Sep 04 '18 at 17:52
It is not. I tried it on actual dataset, i tried it on sample dataset. It just doesn't work. — ira, Sep 05 '18 at 07:50

score 5 · Accepted Answer · answered Sep 05 '18 at 15:39

5

In cases where there are duplicated column names, use aliases on your DataFrames to avoid the ambiguity:

a = df1.alias('l').join(df2.alias('r'), on='a', how = 'full').select('l.f', 'r.f').collect()
print(a)
#[Row(f=3, f=3), Row(f=None, f=2), Row(f=2, f=None)]

answered Sep 05 '18 at 15:39

pault

32,557
9
66
110

1

I was using another approach: `df1.join(df2, (df1['a'] == df2['a']), how = 'full').toDF('a_1', 'f_1', 'a_2', 'f_2').select('f_1', 'f_2').collect()` However, i like yours better. Thank you! – ira Sep 06 '18 at 12:34

PySpark dataframe: working with duplicated column names after self join

1 Answers1