Simple join of two Spark DataFrame failing with "org.apache.spark.sql.AnalysisException: Cannot resolve column name"

Question

Update It turns out this has something to do with the way the Databricks Spark CSV reader is creating the DataFrame. In the example below that does not work, I read the people and address CSV using Databricks CSV reader, then write the resulting DataFrame to HDFS in Parquet format.

I changed the code to create the DataFrame with: (similar for the people.csv)

JavaRDD<Address> address = context.textFile("/Users/sfelsheim/data/address.csv").map(
            new Function<String, Address>() {
                public Address call(String line) throws Exception {
                    String[] parts = line.split(",");

                    Address addr = new Address();
                    addr.setAddrId(parts[0]);
                    addr.setCity(parts[1]);
                    addr.setState(parts[2]);
                    addr.setZip(parts[3]);

                    return addr;
                }
            });

and then write the resulting DataFrame to HDFS in Parquet format, and the join works as expected

I am reading the exact same CSV in both cases.

Running into an issue trying to perform a simple join of two DataFrames created from two different parquet files on HDFS.

[main] INFO org.apache.spark.SparkContext - Running Spark version 1.4.1

Using HDFS from Hadoop 2.7.0

Here is a sample to illustrate.

 public void testStrangeness(String[] args) {
    SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("joinIssue");
    JavaSparkContext context = new JavaSparkContext(conf);
    SQLContext sqlContext = new SQLContext(context);

    DataFrame people = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/people.parquet");
    DataFrame address = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/address.parquet");

    people.printSchema();
    address.printSchema();

    // yeah, works
    DataFrame cartJoin = address.join(people);
    cartJoin.printSchema();

    // boo, fails 
    DataFrame joined = address.join(people,
            address.col("addrid").equalTo(people.col("addressid")));

    joined.printSchema();
}

Contents of people

first,last,addressid 
your,mom,1 
fred,flintstone,2

Contents of address

addrid,city,state,zip
1,sometown,wi,4444
2,bedrock,il,1111

people.printSchema();

results in...

root
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)

address.printSchema();

results in...

root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)


DataFrame cartJoin = address.join(people);
cartJoin.printSchema();

Cartesian join works fine, printSchema() results in...

root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)

This join...

DataFrame joined = address.join(people,
address.col("addrid").equalTo(people.col("addressid")));

Results in the following exception.

Exception in thread "main" org.apache.spark.sql.AnalysisException: **Cannot resolve column name "addrid" among (addrid, city, state, zip);**
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
    at org.apache.spark.sql.DataFrame.col(DataFrame.scala:558)
    at dw.dataflow.DataflowParser.testStrangeness(DataflowParser.java:36)
    at dw.dataflow.DataflowParser.main(DataflowParser.java:119)

I tried changing it so people and address have a common key attribute (addressid) and used..

address.join(people, "addressid");

But got the same result.

Any ideas??

Thanks

score 2 · Answer 1 · answered Sep 10 '15 at 15:50

2

Turns out the problem was the CSV file was in UTF-8 format with a BOM. The DataBricks CSV implementation does not handle UTF-8 with BOM. Converted the files to UTF-8 without the BOM and all works fine.

answered Sep 10 '15 at 15:50

S. Felsheim

31
1
4

Can you explain what BOM is in this context? – dmux Sep 30 '16 at 13:18
BOM is Byte Order Mark http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom – S. Felsheim Apr 25 '17 at 17:46

score 0 · Answer 2 · answered Mar 10 '17 at 15:48

0

Was able to fix this by using Notepad++. Under the "Encoding" menu, I switched it from "Encode in UTF-8 BOM" to "Encode in UTF-8".

answered Mar 10 '17 at 15:48

PlutoTheCat

31
2

Simple join of two Spark DataFrame failing with "org.apache.spark.sql.AnalysisException: Cannot resolve column name"

2 Answers2