Cassandra - how can search by range in 2 key columns?

Question

How would I search by range on multiple key columns? The way I see we cannot have a query which might go to multiple partitions. Any possible solutions to achieve the same query results???

CREATE TABLE RangeSearch (segment TEXT, date TIMESTAMP, dec decimal, value text, PRIMARY KEY (segment,date,dec));


SELECT * FROM decimalrangeck 
WHERE segment='SEG1'
AND date>='2015-01-01' AND date<='2015-12-12';
AND dec>=100 AND dec<=200;

Take a look at this cassandra plugin: https://github.com/Stratio/cassandra-lucene-index — ssuperczynski, Dec 24 '15 at 18:11

score 1 · Answer 1 · answered Dec 15 '15 at 11:10

Short answer: there is no way to have an efficient range query for two or more clustering columns in Cassandra.

Long answer: to understand the core reasons for this limitation, you should consider the way how Cassandra stores columns inside a single partition.

All rows inside partition in a single SSTable file are sorted according to their clustering keys, like for your schema:

2015-01-01 100
2015-01-01 200
2015-01-01 300
2015-01-02 100
2015-01-02 200
2015-01-02 300
2015-01-03 100

There is a way to read a single slice of data (by using range query like SELECT * FROM decimalrangeck WHERE segment='SEG1' AND date>='2015-01-01' AND date<'2015-01-03';). For this query Cassandra will issue a single disk seek for rows because it definitely knows where the required data starts and ends (as it's sorted on disk).

You can even have a two-column slice query like SELECT * FROM decimalrangeck WHERE segment='SEG1' AND date>='2015-01-01' AND date<'2015-01-03' AND dec>=100 AND dec<=200. I believe you'll expect these results:

2015-01-01 100
2015-01-01 200
2015-01-02 100
2015-01-02 200

But you'll get a slightly surprising output:

2015-01-01 100
2015-01-01 200
2015-01-01 300
2015-01-02 100
2015-01-02 200

The difference is a row 2015-01-01 300. It appears in output for a reason: your query is split into two parts: the start of the slice (2015-01-01,100) and the end of the slice (2015-03-01, 100). After that Cassandra reads all the data in between these data points in a single disk seek.

The original range query for 2 clustering columns will require too many disk reads to complete. Such queries typically are considered too performance-unfriendly.

score 0 · Answer 2 · edited May 23 '17 at 12:15

Yes, you can do it by using spark with scala and spark-cassandra-connector!

Then you could use something like:

val conf = new SparkConf().setAppName(appName)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val cc = new CassandraSQLContext(sc)

def getDays(startDate: String, endDate: String) : scala.Array[String] = {
    var dates = scala.Array[String]()

    var sDate = sdfDay.parseDateTime(startDate)
    var eDate = sdfDay.parseDateTime(endDate)

    var date = ""

    while(sDate.isBefore(eDate.plusDays(1)) == true){
        var sDay = sDate.getDayOfMonth()
        var sMonth = sDate.getMonthOfYear()
        var sYear = sDate.getYear()

        date = "" + sYear + "-" + sMonth + "-" + sDay
        println(date)
        dates :+= date

        sDate = sDate.plusDays(1)
    }

    dates //return
}

def preaperQuery(wereElement: String, andDateElement: String, andDecElement: String, tableName: String, days: scala.Array[String], decs: scala.Array[String]) : String = {
    var q = "select * from " + tableName + "where " + wereElement + " and "
    for (day <- days) { 
        q = q + andDateElement + " = '" + day + "' or " 
    }

    for (dec <- decs) { 
        q = q + andDecElement + " = '" + dec + "' or " 
    }
    //remove last " or " 4 chars
    q = q.dropRight(4) 
    q //return
}

val days = Utils.getDays(startDate, endDate)
val decs = Array("100", "200", "300", "400") //or any function

var q = Utils.preaperQuery(wereElement="segment='SEG1'", andDateElement="date", andDecElement="ec", tableName="RangeSearch", days=days, decs=decs)

val selected_data = cc.sql(q).map(row => (row(0).toString, row(1).toString,row(2).toString,row(3).toString))

And there you have your selected data rdd that you could collect:

val data = selected_data.collect

I had similar problem...

Cassandra - how can search by range in 2 key columns?

2 Answers2