Experts, i have a list of columns which i need to use for a join expression while joining 2 Pyspark (1.6) dataframes (i can't hardcode this list in my join expression as it is part of common code). I have these 2 dataframes-
>>> df.show()
+---------+---------+------+
|column1 |column2 |column3|
+---------+---------+------+
| abcd| xyz| 12A|
| abcd1| xyz| 12A|
+---------+---------+------+
>>> df1.show()
+---------+---------+------+
|column1 |column2 |column3|
+---------+---------+------+
| abcd1| xyz| 12A|
+---------+---------+------+
I am doing a left join and i only want entries which are present in "df" only.
Columns to join on-
joinCols = ['column1','column2','column3']
Join Expression-
>>> df.join(df1,joinCols,"left").show()
+---------+---------+------+
|column1 |column2 |column3|
+---------+---------+------+
| abcd1| xyz| 12A|
| abcd| xyz| 12A|
+---------+---------+------+
Now this join works file, no problem. But i am only looking for one entry in output.
| abcd| xyz| 12A|
Since join by default drops the columns from other dataframe "df1", i am not able to put condition on above join expression. I tried-
df.join(df1,joinCols,"left").filter(isnull(df1["column1"])).show()
This above errors out because it doesn't have a column in output to apply filter expression on. Any suggestions?