Spark DataFrame and HiveSQL - Dropping a column not working?

Question

I have the following code:

SparkConf sparkConf = new SparkConf();
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
HiveContext sqlContext = new HiveContext(ctx.sc());

DataFrame df1 = sqlContext.read().json("../smthng/*.json");
DataFrame df2 = sqlContext.read().json("../else/*.json");

df1.registerTempTable("df1");
df2.registerTempTable("df2");

DataFrame df= sqlContext.sql("SELECT * " +
                             "FROM df1 " +
                             "LEFT OUTER JOIN df2 ON df1.id = df2.id " +
                             "WHERE df1.id IS NULL").drop("df1.id");

Here, I'm trying to make an outer join, and then drop one of the id columns. Apparently a join keeps both columns, and when I'm trying to work with it further on, it can't decide which one to use (I get errors like Reference 'id' is ambiguous, could be: id#59, id#376.;). That's why I'm trying to drop one of these columns, but even though I do use ....drop("df1.id");, it doesn't work. Any ideas how I can drop one of the id columns? Thank you!

Which spark version are you using? – Thiago Baldim Sep 08 '16 at 14:28 — Thiago Baldim, Sep 08 '16 at 14:28

score 2 · Answer 1 · edited May 23 '17 at 11:45

There is no column "df1.id" in DF created by join query hence drop is not working. Consider your query returns following DF.

+---+----+-----+----+
| id|name|email|  id|
+---+----+-----+----+
|  2| xxx| null|null|
|  3| zzz| null|null|
+---+----+-----+----+

If you call df.printSchema() you will get

root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- email: string (nullable = true)
|-- id: integer (nullable = true)

So if you call drop("id"). it will drop the "id" column and will return

+----+-----+
|name|email|
+----+-----+
| xxx| null|
| zzz| null|
+----+-----+

If you want to have one "id" column in final DF then change your query to like below

DataFrame df = sqlContext.sql("SELECT df1.id, df1.name, df2.email " +
            "FROM df1 " +
            "LEFT OUTER JOIN df2 ON df1.id = df2.id " 
            +"WHERE df1.id IS NULL");

and it will return

+---+----+-----+
| id|name|email|
+---+----+-----+
|  2| xxx| null|
|  3| zzz| null|
+---+----+-----+

Spark DF API provides API like "alias" or "withColumnRename" to handle the duplicate columns in DF. Please also refer SO post Spark Dataframe distinguish columns with duplicated name.

score 0 · Answer 2 · answered Sep 08 '16 at 18:24

0

Use columns name in the sql and also use alias -

DataFrame df= sqlContext.sql("SELECT Column1 as col1,Column2 as col2...ColumnN as colN" +
                             "FROM df1 " +
                             "LEFT OUTER JOIN df2 ON df1.id = df2.id " +
                             "WHERE df1.id IS NULL").drop("df1.id");

Hope this approach will not give u error message like "ambiguous"

answered Sep 08 '16 at 18:24

Avijit

1,498
5
13
31

But I have 30 columns in my dataset, is naming them all really the only option I have? – lte__ Sep 09 '16 at 08:30

score 0 · Answer 3 · answered Sep 09 '16 at 08:33

0

Okay, I solved this by using the join like this:

DataFrame df = df1.join(df2, df1.col("id").equalTo(df2.col("customer_id")), "left")
            .select("*").where(df1.col("id").isNotNull())
            .drop(df1.col("id"));

answered Sep 09 '16 at 08:33

lte__

5,472
13
55
106

Spark DataFrame and HiveSQL - Dropping a column not working?

3 Answers3