I have the following code:
SparkConf sparkConf = new SparkConf();
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
HiveContext sqlContext = new HiveContext(ctx.sc());
DataFrame df1 = sqlContext.read().json("../smthng/*.json");
DataFrame df2 = sqlContext.read().json("../else/*.json");
df1.registerTempTable("df1");
df2.registerTempTable("df2");
DataFrame df= sqlContext.sql("SELECT * " +
"FROM df1 " +
"LEFT OUTER JOIN df2 ON df1.id = df2.id " +
"WHERE df1.id IS NULL").drop("df1.id");
Here, I'm trying to make an outer join, and then drop one of the id
columns. Apparently a join keeps both columns, and when I'm trying to work with it further on, it can't decide which one to use (I get errors like Reference 'id' is ambiguous, could be: id#59, id#376.;
). That's why I'm trying to drop one of these columns, but even though I do use ....drop("df1.id");
, it doesn't work. Any ideas how I can drop one of the id
columns?
Thank you!