0

I am building a Spark structured streaming job that does the below,

Streaming source,

val small_df = spark.readStream
  .format("kafka")
  .load()

small_df.createOrReplaceTempView("small_df")

A dataframe - Phoenix load

val phoenixDF = spark.read.format("org.apache.phoenix.spark")
  .option("table", "my_table")
  .option("zkUrl", "zk")
  .load()

phoenixDF.createOrReplaceTempView("phoenix_tbl")

Then, spark sql statement to join(on primary_key) with another small dataframe to filter records.

val filteredDF = spark.sql("select phoenix_data.* from small_df join phoenix_tbl on small_df.id = phoenix_tbl.id)

Observations:

Spark does full table scan and range scan for joins and filter respectively

Since small_df is streaming dataset I couldn't use filter and relying on join to filter records from phoenix table but ended up with full table scan which is not feasible.

More details on requirement

How can I perform range scan in this case?

I am doing similar to the one discussed here but the only difference is my small_df is a streaming dataset.

Jaison
  • 566
  • 1
  • 6
  • 31

0 Answers0