1

I have the following code:

SparkConf sparkConf = new SparkConf();
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
HiveContext sqlContext = new HiveContext(ctx.sc());

DataFrame df1 = sqlContext.read().json("../smthng/*.json");
DataFrame df2 = sqlContext.read().json("../else/*.json");

df1.registerTempTable("df1");
df2.registerTempTable("df2");

DataFrame df= sqlContext.sql("SELECT * " +
                             "FROM df1 " +
                             "LEFT OUTER JOIN df2 ON df1.id = df2.id " +
                             "WHERE df1.id IS NULL").drop("df1.id");

Here, I'm trying to make an outer join, and then drop one of the id columns. Apparently a join keeps both columns, and when I'm trying to work with it further on, it can't decide which one to use (I get errors like Reference 'id' is ambiguous, could be: id#59, id#376.;). That's why I'm trying to drop one of these columns, but even though I do use ....drop("df1.id");, it doesn't work. Any ideas how I can drop one of the id columns? Thank you!

abaghel
  • 13,316
  • 2
  • 43
  • 60
lte__
  • 5,472
  • 13
  • 55
  • 106

3 Answers3

2

There is no column "df1.id" in DF created by join query hence drop is not working. Consider your query returns following DF.

+---+----+-----+----+
| id|name|email|  id|
+---+----+-----+----+
|  2| xxx| null|null|
|  3| zzz| null|null|
+---+----+-----+----+ 

If you call df.printSchema() you will get

root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- email: string (nullable = true)
|-- id: integer (nullable = true)

So if you call drop("id"). it will drop the "id" column and will return

+----+-----+
|name|email|
+----+-----+
| xxx| null|
| zzz| null|
+----+-----+

If you want to have one "id" column in final DF then change your query to like below

DataFrame df = sqlContext.sql("SELECT df1.id, df1.name, df2.email " +
            "FROM df1 " +
            "LEFT OUTER JOIN df2 ON df1.id = df2.id " 
            +"WHERE df1.id IS NULL");

and it will return

+---+----+-----+
| id|name|email|
+---+----+-----+
|  2| xxx| null|
|  3| zzz| null|
+---+----+-----+

Spark DF API provides API like "alias" or "withColumnRename" to handle the duplicate columns in DF. Please also refer SO post Spark Dataframe distinguish columns with duplicated name.

Community
  • 1
  • 1
abaghel
  • 13,316
  • 2
  • 43
  • 60
0

Use columns name in the sql and also use alias -

DataFrame df= sqlContext.sql("SELECT Column1 as col1,Column2 as col2...ColumnN as colN" +
                             "FROM df1 " +
                             "LEFT OUTER JOIN df2 ON df1.id = df2.id " +
                             "WHERE df1.id IS NULL").drop("df1.id");

Hope this approach will not give u error message like "ambiguous"

Avijit
  • 1,498
  • 5
  • 13
  • 31
0

Okay, I solved this by using the join like this:

DataFrame df = df1.join(df2, df1.col("id").equalTo(df2.col("customer_id")), "left")
            .select("*").where(df1.col("id").isNotNull())
            .drop(df1.col("id"));
lte__
  • 5,472
  • 13
  • 55
  • 106