0

Experts, i have a list of columns which i need to use for a join expression while joining 2 Pyspark (1.6) dataframes (i can't hardcode this list in my join expression as it is part of common code). I have these 2 dataframes-

>>> df.show()
+---------+---------+------+
|column1  |column2  |column3|
+---------+---------+------+
|     abcd|      xyz|   12A|
|    abcd1|      xyz|   12A|
+---------+---------+------+

>>> df1.show()
+---------+---------+------+
|column1  |column2  |column3|
+---------+---------+------+
|    abcd1|      xyz|   12A|
+---------+---------+------+

I am doing a left join and i only want entries which are present in "df" only.

Columns to join on-

joinCols = ['column1','column2','column3']

Join Expression-

>>> df.join(df1,joinCols,"left").show()
+---------+---------+------+
|column1  |column2  |column3|
+---------+---------+------+
|    abcd1|      xyz|   12A|
|     abcd|      xyz|   12A|
+---------+---------+------+

Now this join works file, no problem. But i am only looking for one entry in output.

|     abcd|      xyz|   12A|

Since join by default drops the columns from other dataframe "df1", i am not able to put condition on above join expression. I tried-

df.join(df1,joinCols,"left").filter(isnull(df1["column1"])).show()

This above errors out because it doesn't have a column in output to apply filter expression on. Any suggestions?

Sidd
  • 207
  • 3
  • 17
  • I believe spark 1.6 doesn't support it – Sidd Sep 25 '19 at 15:34
  • just try this it will resolve your problem >>> df.join(df1,joinCols,"left").filter(isnull(df1.column1)).show() – Mahesh Gupta Sep 26 '19 at 05:23
  • 1
    Does this answer your question? [How to use join with many conditions in pyspark?](https://stackoverflow.com/questions/45812537/how-to-use-join-with-many-conditions-in-pyspark) – Danny Varod Mar 25 '21 at 14:07

1 Answers1

0

Since you are on Spark v1.6, you will need to perform a join with how="leftsemi" option (Till Spark v2.1). From Spark v2.2, you have left_semi.

You can find a more detailed explanation about a left semi join at: Difference between INNER JOIN and LEFT SEMI JOIN

pissall
  • 5,780
  • 1
  • 16
  • 35