22

I have been trying to use sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver") to get Hive table into Spark without any success. I have done research and read below:

How to connect to remote hive server from spark

Spark 1.5.1 not working with hive jdbc 1.2.0

http://belablotski.blogspot.in/2016/01/access-hive-tables-from-spark-using.html

I used the latest Hortonworks Sandbox 2.6 and asked the community there the same question:

https://community.hortonworks.com/questions/156828/pyspark-jdbc-py4jjavaerror-calling-o95load-javasql.html?childToView=156936#answer-156936

What I want to do is very simple via pyspark:

df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="sample_07",user="maria_dev", password="maria_dev").load()

That gave me this error:

17/12/30 19:55:14 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10016/default
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/current/spark-client/python/pyspark/sql/readwriter.py", line 139, in load
    return self._df(self._jreader.load())
  File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o119.load.
: java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveResultSetMetaData.isSigned(HiveResultSetMetaData.java:143)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:136)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:57)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:748)

Using beeline, it works fine

beeline> !connect jdbc:hive2://localhost:10016/default maria_dev maria_dev
Connecting to jdbc:hive2://localhost:10016/default
Connected to: Spark SQL (version 2.1.1.2.6.1.0-129)
Driver: Hive JDBC (version 1.2.1000.2.6.1.0-129)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10016/default> select * from sample_07 limit 2;
+----------+-------------------------+------------+---------+--+
|   code   |       description       | total_emp  | salary  |
+----------+-------------------------+------------+---------+--+
| 00-0000  | All Occupations         | 134354250  | 40690   |
| 11-0000  | Management occupations  | 6003930    | 96150   |
+----------+-------------------------+------------+---------+--+

I could also do this:

spark = SparkSession.Builder().appName("testapp").enableHiveSupport().‌​getOrCreate()
spark.sql("select * from default.sample_07").collect()

But this reads into Hive Metadata directly. I would like to use JDBC to Spark Thrift Server for fine-grained security.

I could do PostgreSQL like so:

sqlContext.read.format("jdbc").options(driver="org.postgresql.Driver")

I could also use Scala java.sql.{DriverManager, Connection, Statement, ResultSet} to create JDBC Connection as a client side to get to Spark. But that basically puts all data into memory and then re-create Dataframe manually.

So the question is: Is there a way to create Spark dataframe with Hive table data without loading data into memory into JDBC client like Scala and not use SparkSession.Builder() like examples above? My use case is that I need to deal with fine-grained security.

Alper t. Turker
  • 29,733
  • 7
  • 65
  • 101
HP.
  • 17,550
  • 43
  • 139
  • 240
  • Asking three different questions is a clear sign your question is too broad, please edit to restrict it to a single questions, and post separate questions for the others. – Mark Rotteveel Jan 04 '18 at 07:13
  • I updated the question – HP. Jan 04 '18 at 09:05
  • 1
    For people who don't use Hortonworks - could include Spark version? – Alper t. Turker Feb 01 '18 at 11:32
  • Is beeline installed on the local only. Even if it ks installed on local can you try by giving its ip. As per my knowledge beeline might not be taking connection through localhost. Pls try once and let me know if it works. May be in yarn client mode it is not picking up local host. – Manu Gupta Feb 04 '18 at 09:46
  • This is Spark 2.2. But frankly it doesn't matter much. I can change the version if the latest and greatest just works. I tried localhost and FQDN, it is the same... – HP. Feb 06 '18 at 09:12
  • @HP. Sorry for the delay response. Feel free to contact me for chat – loneStar Feb 09 '18 at 19:38
  • @HP. I think i found some solution for this isuue – loneStar Feb 27 '18 at 18:41
  • @Achyuth which solution? – HP. Mar 14 '18 at 23:46

2 Answers2

2

I'm not sure if I understand your question correctly or not, But from what I understand you will need to get a hive table into data frame, for that you don't need to have the JDBC connection, in your example links they are trying to connect to different databases (RDBMS), not Hive.

Please see the below approach, using hive context you can get the table into a data frame.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SQLContext}

def main(args: Array[String]): Unit = {

val sparkConf = new SparkConf().setAppName("APPName")
    val sc = new SparkContext(sparkConf)
    val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
    val sqlContext = new SQLContext(sc)

val hive_df = hiveContext.sql("select * from schema.table").first()

//other way
// val hive_df= hiveContext.table ("SchemaName.TableName")

//Below will print the first line
df.first()
//count on dataframe
df.count()

}

If you really want to use the JDBC connection I have the below example that I used for Oracle database, which might help you.

val oracle_data = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:username/password//hostname:2134/databaseName", "dbtable" -> "Your query tmp", "driver" -> "oracle.jdbc.driver.OracleDriver"));
roh
  • 963
  • 1
  • 10
  • 17
  • 1
    I am aware of your example above. It is basically using Spark to get data directly from Hive Metastore. My requirement is to use JDBC with user/password to Spark Thrift Server as the system might have LDAP or Active Directory authentication for fine grained security (row/column filter and masking). Going directly into Metastore would work but that is one user accessing one cluster. – HP. Jan 08 '18 at 22:06
  • I have multiple users so I need to pass credential. Another thought I have is to use Kerberos to secure Hive Metastore but that doesn't give me fine grained security (it's either deny or allow access to a table-level, not row/column level). – HP. Jan 08 '18 at 22:06
1

Actually i looked into this. Hotornworks and cloudera are stooping the support to connect to hive from Spark through the Thrift Server.

So You are working on something which is impossible.

https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_thrift_server.

The Links says thrift is disabled but it is specifically to hive from spark. I am able to connect to all type of databases from spark except hive.

So you have to work on different style of authorization.

As spark object is directly connecting to hive they are removing the thrift support.

From your previous question, it is able to read the data but reading the wrong data. Spark 2.2 Thrift server error on dataframe NumberFormatException when query Hive table

Code

>>> df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="test4",user="hive", password="hive").option("fetchsize", "10").load()
>>> df.select("*").show()
+---+----+
| id|desc|
+---+----+
| id|desc|
| id|desc|
+---+----+

The problem here is in hive

The default way in the default dialect to quote identifiers are using double quotes. A SQL query like SELECT “dw_date” FROM table… will be parsed by Hive to select a string literal, instead of a column named “dw_date”. By replacing quotes with backticks, seems the issue is resolved. However, in my test, the column names get from Hive are all prefixed with the table name like table.dw_date. But you can’t directly wrap backticks around it like table.dw_date. Alternatively, we need to wrap each part individually

code

import org.apache.spark.sql.jdbc.JdbcDialect
    private case object HiveDialect extends JdbcDialect {
      override def canHandle(url : String): Boolean = url.startsWith("jdbc:hive2")
      override def quoteIdentifier(colName: String): String = {
        colName.split(‘.’).map(part => s”`$part`”).mkString(“.”)
      }
    }

Please follow the post below to implement the solution.

https://medium.com/@viirya/custom-jdbc-dialect-for-hive-5dbb694cc2bd

https://medium.com/@huaxing/customize-spark-jdbc-data-source-to-work-with-your-dedicated-database-dialect-beec6519af27

Register the dialect

JdbcDialects.registerDialect(HiveDialect)

Then hive jdbc works.

loneStar
  • 3,246
  • 16
  • 30
  • That is true for Cloudera but I doubt it is the case for Hortonworks https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_spark-component-guide/content/config-sts.html unless you have reliable source from inside the company. Regardless what's troublesome me is that they don't state specifically anywhere in the documentation that Spark to Thrift JDBC is not supported. If they did say that, I would be happy and move on... :) – HP. Feb 13 '18 at 17:29
  • Hi, when I run the above code, I am getting " error: not found: type JdbcDialect" this error in spark-shell can you please help me to fix this. – Gowtham SB May 20 '19 at 17:14
  • @GowthamSB Included the import – loneStar May 20 '19 at 18:15