3

I recently created a keyspace and a column family in cassandra. I have the following

CREATE TABLE reports (
  id timeuuid PRIMARY KEY,
  report varchar
)

I want to select the report according to a range of time. so my query is the following;

select dateOf(id), id 
from keyspace.reports 
where token(id) > token(maxTimeuuid('2013-07-16 16:10:48+0300'));

It returns;

dateOf(id)                | id
--------------------------+--------------------------------------
 2013-07-16 16:10:37+0300 | 1b3f6d00-ee19-11e2-8734-8d331d938752
 2013-07-16 16:10:13+0300 | 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b
 2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
 2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e

So, it's wrong.

When I try to use the following cql;

select dateOf(id), id from keyspace.reports 
where token(id) > token(minTimeuuid('2013-07-16 16:12:48+0300'));

 dateOf(id)               | id
--------------------------+--------------------------------------
 2013-07-16 16:10:37+0300 | 1b3f6d00-ee19-11e2-8734-8d331d938752
 2013-07-16 16:10:13+0300 | 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b
 2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
 2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e

select dateOf(id), id from keyspace.reports
where token(id) > token(minTimeuuid('2013-07-16 16:13:48+0300'));

 dateOf(id)               | id
--------------------------+--------------------------------------
 2013-07-16 16:10:37+0300 | 1b275870-ee19-11e2-b3f3-af3e3057c60f
 2013-07-16 16:10:48+0300 | 21f9a390-ee19-11e2-89a2-97143e6cae9e

Is it random ? Why isn't it giving meaningful outputs ?

What's the best solution for this in cassandra ?

aacanakin
  • 2,684
  • 4
  • 20
  • 39

1 Answers1

3

You are using the token function, which isn't really useful in your context (querying between times using mintimeuuid and maxtimeuuid) and is generating random-looking, and incorrect output:

From the CQL documentation:

The TOKEN function can be used with a condition operator on the partition key column to query. The query selects rows based on the token of their partition key rather than on their value. The token of a key depends on the partitioner in use. The RandomPartitioner and Murmur3Partitioner do not yield a meaningful order.

If you are looking to retrieve based on all records between two dates it might make more sense to model your data as a wide row, with one record per column, rather than one record per row, e.g., creating the table:

CREATE TABLE reports (
  reportname text,
  id timeuuid,
  report text,
  PRIMARY KEY (reportname, id)
)

, populating the data:

insert into reports2(reportname,id,report) VALUES ('report', 1b3f6d00-ee19-11e2-8734-8d331d938752, 'a');
insert into reports2(reportname,id,report) VALUES ('report', 0d4b20e0-ee19-11e2-bbb3-e3eef18ad51b, 'b');
insert into reports2(reportname,id,report) VALUES ('report', 1b275870-ee19-11e2-b3f3-af3e3057c60f, 'c');
insert into reports2(reportname,id,report) VALUES ('report', 21f9a390-ee19-11e2-89a2-97143e6cae9e, 'd');

, and querying (no token calls!):

select dateOf(id),id from reports2 where reportname='report' and id>maxtimeuuid('2013-07-16 16:10:48+0300');

, which returns the expected result:

 dateOf(id)               | id
--------------------------+--------------------------------------
 2013-07-16 14:10:48+0100 | 21f9a390-ee19-11e2-89a2-97143e6cae9e

The downside to this is that all of your reports are in the one row, of course you can now store lots of different reports (keyed by reportname here). To get all reports called mynewreport in August 2013 you could query using:

select dateOf(id),id from reports2 where reportname='mynewreport' and id>=mintimeuuid('2013-08-01+0300') and id<mintimeuuid('2013-09-01+0300');
lorcan
  • 3,190
  • 3
  • 21
  • 30
  • if we are doing mintimeuuid on partition key, assuming it's of type timeuuid (than clustering key like in the example), would there be performance penalty? – ken Nov 08 '16 at 08:37