2

I am modelling a Cassandra schema to get a bit more familiar on the subject and was wondering what is the best practice regarding creating indexes.

For example:

create table emailtogroup(email text, groupid int, primary key(email));
select * from emailtogroup where email='joop';
create index on emailtogroup(groupid);
select * from emailtogroup where groupid=2 ;

Or i can create a entire new table:

create table grouptoemail(groupid int, email text,  primary key(groupid, email));
select * from grouptoemail where groupid=2;

They both do the job.

I would expect creating a new table is faster cause now groupid becomes the partition key. But i'm not sure what "magic" is happening when creating a index and if this magic has a downside.

TinusSky
  • 1,607
  • 5
  • 21
  • 31

2 Answers2

2

It depends on the cardinality of groupid. The cassandra docs:

When not to use an index

Do not use an index to query a huge volume of records for a small number of results. For example, if you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion users, looking up users by their email address (a value that is typically unique for each user) instead of by their state, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index. For columns containing unique data, it is sometimes fine performance-wise to use an index for convenience, as long as the query volume to the table having an indexed column is moderate and not under constant load.

Naturally, there is no support for counter columns, in which every value is distinct.

Conversely, creating an index on an extremely low-cardinality column, such as a boolean column, does not make sense. Each value in the index becomes a single row in the index, resulting in a huge row for all the false values, for example. Indexing a multitude of indexed columns having foo = true and foo = false is not useful.

So basically, if you are going to be dealing with a large dataset, and groupid won't return a lot of rows, a secondary index may not be the best idea.

Week #4 of DataStax Academy's Java Developement with Apache Cassandra class talks about how to model these problems efficiently. Check that out if you get a chance.

Aaron
  • 46,155
  • 11
  • 102
  • 118
  • Thank the clarification Bryce, its a bit like normal secondary indexed, only with a twist. I'll check the given link tomorrow. – TinusSky Dec 11 '13 at 18:39
2

According to me your first approach is correct.

create table emailtogroup(email text, groupid int, primary key(email));

because 1) in your case email is sort of unique, good candidate for primary key and 2) multiple emails can belong to same group, good candidate for secondary index. Please refer to this post - Cassandra: choosing a Partition Key

The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible.

The second form of table creation is useful for range scans. For example if you have a use case like

i) List all the email groups which the user has joined from 1st Jan 2010 to 1st Jan 2013.

In that case you may have to design a table like

create table grouptoemail(email text, ts timestamp, groupid int, primary key(email, ts));

In this case all the email gropus which the user joined will be clustered on disk.(stored together on disk)

Community
  • 1
  • 1
Vishal John
  • 3,944
  • 20
  • 39
  • It is a bit difficult to get your thinking right on this subject John. If i understand correctly: uniqueness for partition keys, unless you want to slice, cause than the partition keys need to be together. And low uniqueness (low cardinality) for secondary indexes, cause well with high cardinality a index doesn't make much sense. It is starting to sink in :-) – TinusSky Dec 11 '13 at 18:44