1

Update It turns out this has something to do with the way the Databricks Spark CSV reader is creating the DataFrame. In the example below that does not work, I read the people and address CSV using Databricks CSV reader, then write the resulting DataFrame to HDFS in Parquet format.

I changed the code to create the DataFrame with: (similar for the people.csv)

JavaRDD<Address> address = context.textFile("/Users/sfelsheim/data/address.csv").map(
            new Function<String, Address>() {
                public Address call(String line) throws Exception {
                    String[] parts = line.split(",");

                    Address addr = new Address();
                    addr.setAddrId(parts[0]);
                    addr.setCity(parts[1]);
                    addr.setState(parts[2]);
                    addr.setZip(parts[3]);

                    return addr;
                }
            });

and then write the resulting DataFrame to HDFS in Parquet format, and the join works as expected

I am reading the exact same CSV in both cases.


Running into an issue trying to perform a simple join of two DataFrames created from two different parquet files on HDFS.


[main] INFO org.apache.spark.SparkContext - Running Spark version 1.4.1

Using HDFS from Hadoop 2.7.0


Here is a sample to illustrate.

 public void testStrangeness(String[] args) {
    SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("joinIssue");
    JavaSparkContext context = new JavaSparkContext(conf);
    SQLContext sqlContext = new SQLContext(context);

    DataFrame people = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/people.parquet");
    DataFrame address = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/address.parquet");

    people.printSchema();
    address.printSchema();

    // yeah, works
    DataFrame cartJoin = address.join(people);
    cartJoin.printSchema();

    // boo, fails 
    DataFrame joined = address.join(people,
            address.col("addrid").equalTo(people.col("addressid")));

    joined.printSchema();
}

Contents of people

first,last,addressid 
your,mom,1 
fred,flintstone,2

Contents of address

addrid,city,state,zip
1,sometown,wi,4444
2,bedrock,il,1111

people.printSchema(); 

results in...

root
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)

address.printSchema();

results in...

root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)


DataFrame cartJoin = address.join(people);
cartJoin.printSchema();

Cartesian join works fine, printSchema() results in...

root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)

This join...

DataFrame joined = address.join(people,
address.col("addrid").equalTo(people.col("addressid")));

Results in the following exception.

Exception in thread "main" org.apache.spark.sql.AnalysisException: **Cannot resolve column name "addrid" among (addrid, city, state, zip);**
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
    at org.apache.spark.sql.DataFrame.col(DataFrame.scala:558)
    at dw.dataflow.DataflowParser.testStrangeness(DataflowParser.java:36)
    at dw.dataflow.DataflowParser.main(DataflowParser.java:119)

I tried changing it so people and address have a common key attribute (addressid) and used..

address.join(people, "addressid");

But got the same result.

Any ideas??

Thanks

Ram Ghadiyaram
  • 30,194
  • 13
  • 83
  • 107
S. Felsheim
  • 31
  • 1
  • 4

2 Answers2

2

Turns out the problem was the CSV file was in UTF-8 format with a BOM. The DataBricks CSV implementation does not handle UTF-8 with BOM. Converted the files to UTF-8 without the BOM and all works fine.

S. Felsheim
  • 31
  • 1
  • 4
0

Was able to fix this by using Notepad++. Under the "Encoding" menu, I switched it from "Encode in UTF-8 BOM" to "Encode in UTF-8".