1

I have Hortonworks HDP 2.6.3 running Spark2 (v2.2). My test case is very simple:

  1. Create a Hive table with some random values. Hive at port 10000

  2. Turn on Spark Thrift server at 10016

  3. Run pyspark and query the Hive table via 10016

However, I was unable to get the data from Spark due to NumberFormatException.

Here is my test case:

  1. Create Hive table with sample rows:
beeline> !connect jdbc:hive2://localhost:10000/default hive hive

create table test1 (id int, desc varchar(40));

insert into table test1 values (1,"aa"),(2,"bb");
  1. Run Spark Thrift server:

su - spark -c '/usr/hdp/2.6.3.0-235/spark2/sbin/start-thriftserver.sh --master yarn-client --executor-memory 512m --hiveconf hive.server2.thrift.port=10016'

  1. Run pyspark as spark user su - spark -c 'pyspark'

  2. Type in below code:

    df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test1",user="hive", password="hive").load()

    df.select("*").show()

  3. I got this error:

17/12/15 08:04:13 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "id" at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:394) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:393) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NumberFormatException: For input string: "id" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346) ... 23 more 17/12/15 08:04:13 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver): java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "id" at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:394) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:393) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NumberFormatException: For input string: "id" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346) ... 23 more

17/12/15 08:04:14 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job Traceback (most recent call last): File "", line 1, in File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 336, in show print(self._jdf.showString(n, 20)) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o75.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver): java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "id" at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:394) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:393) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NumberFormatException: For input string: "id" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346) ... 23 more

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2854) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2838) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2837) at org.apache.spark.sql.Dataset.head(Dataset.scala:2154) at org.apache.spark.sql.Dataset.take(Dataset.scala:2367) at org.apache.spark.sql.Dataset.showString(Dataset.scala:245) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.sql.SQLException: Cannot convert column 1 to integerjava.lang.NumberFormatException: For input string: "id" at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:351) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:394) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$6.apply(JdbcUtils.scala:393) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more Caused by: java.lang.NumberFormatException: For input string: "id" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.valueOf(Integer.java:766) at org.apache.hive.jdbc.HiveBaseResultSet.getInt(HiveBaseResultSet.java:346) ... 23 more

  1. I suspected it has something to do with id column so I changed to this: df.select("desc").show()

  2. Then I got this strange result:

+----+
|desc|
+----+
|desc|
|desc|
+----+
  1. If I go back to Hive to query, everything went fine via port 10016:
beeline> !connect jdbc:hive2://localhost:10016/default hive hive

select * from test1;

+-----+-------+--+
| id  | desc  |
+-----+-------+--+
| 1   | aa    |
| 2   | bb    |
+-----+-------+--+
  1. If I change port 10000 in pyspark, same problem persisted.

Could you please help me understand why and how to get the rows via Spark?

UPDATE 1

I followed @Achyuth advise below in both cases and they still don't work.

Case 1

Beeline:

create table test4 (id String, desc String);
insert into table test4 values ("1","aa"),("2","bb");
select * from test4;

Pyspark:

>>> df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test4",user="hive", password="hive").option("fetchsize", "10").load()
>>> df.select("*").show()
+---+----+
| id|desc|
+---+----+
| id|desc|
| id|desc|
+---+----+

For some reason, it returned in the column names?!

Case 2

Beeline:

create table test5 (id int, desc varchar(40)) STORED AS ORC;
insert into table test5 values (1,"aa"),(2,"bb");
select * from test5;

Pyspark:

Still same error Caused by: java.lang.NumberFormatException: For input string: "id"

UPDATE 2

Create a table and insert values via Hive port 10000 then query it. This works fine via beeline

beeline> !connect jdbc:hive2://localhost:10000/default hive hive
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Apache Hive (version 1.2.1000.2.5.3.0-37)
Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/default> create table test2 (id String, desc String) STORED AS ORC;
No rows affected (0.3 seconds)
0: jdbc:hive2://localhost:10000/default> insert into table test2 values ("1","aa"),("2","bb");
INFO  : Session is already open
INFO  : Dag name: insert into table tes..."1","aa"),("2","bb")(Stage-1)
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  :

INFO  : Status: Running (Executing on YARN cluster with App id application_1514019042819_0006)

INFO  : Map 1: -/-
INFO  : Map 1: 0/1
INFO  : Map 1: 0(+1)/1
INFO  : Map 1: 1/1
INFO  : Loading data to table default.test2 from webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test2/.hive-staging_hive_2017-12-23_04-29-54_569_601147868480753216-3/-ext-10000
INFO  : Table default.test2 stats: [numFiles=1, numRows=2, totalSize=317, rawDataSize=342]
No rows affected (15.414 seconds)
0: jdbc:hive2://localhost:10000/default> select * from table2;
Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'table2' (state=42S02,code=10001)
0: jdbc:hive2://localhost:10000/default> select * from test2;
+-----------+-------------+--+
| test2.id  | test2.desc  |
+-----------+-------------+--+
| 1         | aa          |
| 2         | bb          |
+-----------+-------------+--+
2 rows selected (0.364 seconds)

Also via beeline, I can use Spark Thrift Server 10016 to do the same thing and it worked fine:

beeline> !connect jdbc:hive2://localhost:10016/default hive hive
Connecting to jdbc:hive2://localhost:10016/default
1: jdbc:hive2://localhost:10016/default> create table test3 (id String, desc String) STORED AS ORC;
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (1.234 seconds)
1: jdbc:hive2://localhost:10016/default> insert into table test3 values ("1","aa"),("2","bb");
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (9.111 seconds)
1: jdbc:hive2://localhost:10016/default> select * from test3;
+-----+-------+--+
| id  | desc  |
+-----+-------+--+
| 1   | aa    |
| 2   | bb    |
+-----+-------+--+
2 rows selected (3.387 seconds)

This means Spark and Thrift Server work fine. But using pyspark I got same problem as the results are empty:

>>> df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test3",user="hive", password="hive").load()
>>> df.select("*").show()
+---+----+
| id|desc|
+---+----+
+---+----+

UPDATE 3

DESCRIBE EXTENDED test3;

# Detailed Table Information  | CatalogTable(
    Table: `default`.`test3`
    Owner: hive
    Created: Sat Dec 23 04:37:14 PST 2017
    Last Access: Wed Dec 31 16:00:00 PST 1969
    Type: MANAGED
    Schema: [`id` string, `desc` string]
    Properties: [totalSize=620, numFiles=2, transient_lastDdlTime=1514032656, STATS_GENERATED_VIA_STATS_TASK=true]
    Storage(Location: webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test3, InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, Serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde, Properties: [serialization.format=1]))

SHOW CREATE TABLE test3;

CREATE TABLE `test3`(`id` string, `desc` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'totalSize' = '620',
  'numFiles' = '2',
  'transient_lastDdlTime' = '1514032656',
  'STATS_GENERATED_VIA_STATS_TASK' = 'true'
)

su - spark -c 'hdfs dfs -cat webhdfs://demo.myapp.local:40070/apps/hive/warehouse/test3/part-00000'

enter image description here

HP.
  • 17,550
  • 43
  • 139
  • 240

2 Answers2

3

Even though you are creating the hive table with specific datatype, The underlying data in the table when you inserted is stored as String format.

So when the spark is trying to read the data, it uses the metastore to find the data types. It is present as int in the hive metastore and as string in the file and it is throwing the cast exception.

Solutions

Create the table as string and read from spark it will work.

create table test1 (id String, desc String);

If you want data type preserved, then specify the one of the file formats such as orc or parquet which creating the table and then insert it. You can able to read the file from spark without exceptions

 create table test1 (id int, desc varchar(40) STORED AS ORC);

Now question is wwhy hive able to read it? Hive has good cast options avialable while spark doesn't.

loneStar
  • 3,246
  • 16
  • 30
  • I tried both methods above and still had problems. Could you please look at my updated question? – HP. Dec 26 '17 at 23:52
  • can you please query the string table from the beeline hive and verify the output – loneStar Dec 27 '17 at 15:27
  • I had UPDATE 2 above. In sum, beeline works fine for either Hive 10000 or Spark Thrift at 10016. But getting the data in pyspark gave empty results. – HP. Dec 27 '17 at 19:05
  • can you create a channel so that we can chat? – loneStar Dec 27 '17 at 19:07
  • use show create table and with the hdfs location, do hdfs dfs -cat location. – loneStar Dec 27 '17 at 20:19
  • I had UPDATE 3 with the information you asked. Here is the chat channel https://chat.stackoverflow.com/rooms/162052/spark-2-2-thrift-error – HP. Dec 27 '17 at 20:56
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/162053/discussion-between-achyuth-and-hp). – loneStar Dec 27 '17 at 20:59
  • Thanks for the help. The answer is: spark = SparkSession.Builder().appName("test3").enableHiveSupport().getOrCreate(); spark.sql("select * from default.test3").collect() – HP. Dec 27 '17 at 22:29
  • @HP. did you manage to get it working with JDBC? especially the case where the returned result returns rows with the columns names, and the case where it returns empty dataframe – Kaygi22 Apr 29 '21 at 12:52
-1

Just solved this after looking up on the internet for hours!!!!(No internet didn't help)

Since I am still a noob in Scala/Hive, I can't provide a good explanation. But, I solved it by adding some external jars(Hive JDBC) files provided by AWS. And I also changed in the driver option to "com.amazon.hive.jdbc41.HS2Driver".

Go to the following link to download the drivers and also see the sample code.

http://awssupportdatasvcs.com/bootstrap-actions/Simba/latest/

Dharman
  • 21,838
  • 18
  • 57
  • 107