Scala Spark Filter RDD using Cassandra

Question

I am new to spark-Cassandra and Scala. I have an existing RDD. let say:

((url_hash, url, created_timestamp )).

I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls.

Cassandra Table looks like following:

 url_hash| url | created_timestamp | updated_timestamp

Any pointers will be great.

I tried something like this this:

   case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
   def timestamp = new java.utils.Date()
   val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
   val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
   val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
   newUrlsRDD = rdd1.subtractByKey(rdd3)

I am getting cassandra error

java.lang.NullPointerException: Unexpected null value of column full_url in      keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper

There are no null values in cassandra table

What have you tried? Turn the Cassandra table into another RDD, `map` both so they have `url_hash` as the key, then use `subtractByKey`? — The Archetypal Paul, Feb 08 '17 at 20:29
Thanks for the pointer. I updated the question with what I tried. now I am getting a null pointer exception — Abhishek, Feb 09 '17 at 00:14

score 1 · Accepted Answer · edited Oct 27 '17 at 01:29

Thanks The Archetypal Paul!

I hope somebody finds this useful. Had to add Option to case class.

Looking forward to better solutions

case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])

def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace",   "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3)

Scala Spark Filter RDD using Cassandra

1 Answers1

Linked