3

I store my data in a single Cassandra 2.0.10 table. There is one column (named score), type of integer, can take any values. I need to write a background job that would assign a value to another column, rank, giving value 1 for the row with the highest value in the score field, value 2 for the next to highest and so on. The row with the smallest score value must get the the total row count assigned to the rank. It is currently defined in CQL as

CREATE TABLE players
    (user int, rank int, score int, details blob, PRIMARY KEY(user))

Bet it something like PostgreSQL, I would do something like

select id, rank from players order by score desc offset A limit 100;

using increasing values for A and this way iterating the database in pages of size 100. It would give me top 100 players in one query, top 100 to 200 in the second, etc. Then I can fire update statements by id, one by one or in batches.

When I try to do the same in Cassandra CQL, turns out that many needed features are not supported (no order, not offset, no clear way how to visit all rows). I tried to build the index for the score column but this was not helpful.

This rank assignment is a helper job. It is no problem for it to take days or even weeks to iterate. It is ok to have it slightly inconsistent as scores may change while the job is running. It is not the main feature of the application. The main features do not use ranges queries and Cassandra works well there.

Is it possible to implement this rank assignment combining Java and CQL somehow or the limitations are severe enough I need to use a different database engine?

Audrius Meskauskas
  • 18,378
  • 9
  • 63
  • 78

1 Answers1

1

As per my experience, Cassandra does not fit well for such types of tasks. You can definitely make this working, but the solution will not be simple and effective. There is no problem to iterate over all rows in one table to update ranks, however there is a problem to iterate all the rows in the order of your ranks. You could potentially keep two tables:

players(id, rank) and rank_to_id(rank, id_list). Then you should query the second page using:

select * from rank_to_id where rank > 100 limit 100

The responsibility of your rank assigner will be to update both tables correctly when rank is changing. Basically by this you will implement a simple database index which PostgreSQL has out of the box.

Also I'd recommended you to take a look at Redis DB instead. It has such a great data type as Sorted Set which implements almost exactly what you need: http://redis.io/commands#sorted_set. However it depends on the data volume you have. Redis is in-memory database.

PostgreSQL also might be a good solution. Why don't you want to use it?