How to read and write data in Google Cloud Bigtable in PySpark application?

Question

I am using Spark on a Google Cloud Dataproc cluster and I would like to access Bigtable in a PySpark job. Do we have any Bigtable connector for Spark like Google BigQuery connector?

How can we access Bigtable from a PySpark application?

score 6 · Answer 1 · answered Nov 02 '16 at 15:43

Cloud Bigtable is usually best accessed from Spark using the Apache HBase APIs.

HBase currently only provides Hadoop MapReduce I/O formats. These can be accessed from Spark (or PySpark) using the SparkContext.newAPIHadoopRDD methods. However converting the records into something usable in Python is difficult.

HBase is developing Spark SQL APIs, but these have not been integrated in a released version. Hortonworks has a Spark HBase Connector, but it compiles against Spark 1.6 (which requires Cloud Dataproc version 1.0) and I have not used it, so I cannot speak to how easy it is to use.

Alternatively you could use a Python based Bigtable client, and simply use PySpark for parallelism.

Hello Patric. I am trying to write to bigtable using pyspark. Is there any example how it can be done using SparkContext.newAPIHadoopRDD ? — MANISH ZOPE, Apr 16 '18 at 13:27

How to read and write data in Google Cloud Bigtable in PySpark application?

1 Answers1