-2

I am new to spark-Cassandra and Scala. I have an existing RDD. let say:

((url_hash, url, created_timestamp )).

I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls.

Cassandra Table looks like following:

 url_hash| url | created_timestamp | updated_timestamp

Any pointers will be great.

I tried something like this this:

   case class UrlInfoT(url_sha256: String, full_url: String, created_ts: Date)
   def timestamp = new java.utils.Date()
   val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
   val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace", "url_info").select("url_sha256", "full_url", "created_ts")
   val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
   newUrlsRDD = rdd1.subtractByKey(rdd3) 

I am getting cassandra error

java.lang.NullPointerException: Unexpected null value of column full_url in      keyspace.url_info.If you want to receive null values from Cassandra, please wrap the column type into Option or use JavaBeanColumnMapper

There are no null values in cassandra table

Abhishek
  • 91
  • 1
  • 7
  • 1
    What have you tried? Turn the Cassandra table into another RDD, `map` both so they have `url_hash` as the key, then use `subtractByKey`? – The Archetypal Paul Feb 08 '17 at 20:29
  • Thanks for the pointer. I updated the question with what I tried. now I am getting a null pointer exception – Abhishek Feb 09 '17 at 00:14

1 Answers1

1

Thanks The Archetypal Paul!

I hope somebody finds this useful. Had to add Option to case class.

Looking forward to better solutions

case class UrlInfoT(url_sha256: String, full_url: Option[String], created_ts: Option[Date])

def timestamp = new java.utils.Date()
val rdd1 = rdd.map(row => (calcSHA256(row(1)), (row(1), timestamp)))
val rdd2 = sc.cassandraTable[UrlInfoT]("keyspace",   "url_info").select("url_sha256", "full_url", "created_ts")
val rdd3 = rdd2.map(row => (row.url_sha256,(row.full_url, row.created_ts)))
newUrlsRDD = rdd1.subtractByKey(rdd3) 
Shaido
  • 22,716
  • 18
  • 57
  • 64
Abhishek
  • 91
  • 1
  • 7