2

We are currently investigating Cassandra as the database for a large time series system.

I have read through https://academy.datastax.com/resources/getting-started-time-series-data-modeling about modelling time series data in Cassandra.

What we have is high velocity timeseries data coming in for many weather stations. Each weather station has a number of "sensors" that each collect three metrics: temperature, humidity, and light.

We are trying to store each series as a wide row. However, we expect to get billions of readings per station over the life of the project, so we would like to limit the row size.

We would like there to be a single row for each (weather_station_id, year, day_of_year), that is, a new row for every day. However, we still want the partition key to be weather_station_id - that is, we want all readings for a station to be on the same node.

We currently have the following schema, but I would like to get some feedback.

CREATE TABLE weather_station_data (
    weather_station_id int,
    year int,
    day_of_year int,
    time timestamp,
    sensor_id int,
    temperature int,
    humidity int,
    light int,
    PRIMARY KEY ((weather_station_id), year, day_of_year, time, sensor_id)
) WITH CLUSTERING ORDER BY (year DESC, day_of_year DESC, time DESC,       sensor_id DESC);

In the aforementioned document, they make use of this "limit partition row by date" concept. However, it is unclear to me whether or not the date in their examples is part of the partition key.

dacox
  • 412
  • 1
  • 5
  • 13
  • I'm curious, temperature, humidity, and light are defined as integers as a way to store fixed precision? In other words, is precision hardcoded in the application code? – Sergei Rodionov Apr 19 '16 at 17:55
  • This is not my actual data model, just an example of a similar one. – dacox Apr 19 '16 at 18:12

2 Answers2

1

According to the tutorial, if we choose to have weather_station_id as the only partition, the row will be exhausted. i.e C* has a practical limitation of 2 billion columns per partition.

So IMO, your data-model is bad.

However, it is unclear to me whether or not the date in their examples is part of the partition key.

The tutorial used

PRIMARY KEY ((weatherstation_id,date),event_time)

So, yes they considered data to be part of partition key.

we want all readings for a station to be on the same node.

I am not sure, why you wan't such a requirement. You can always fetch weather data using multiple queries for more than one year.

select * from weather_station_data where weather_station_id=1234 and year= 2013; select * from weather_station_data where weather_station_id=1234 and year= 2014;

So consider changing your structure to

PRIMARY KEY ((weather_station_id, year), day_of_year, time, sensor_id)

Hope it helps!

chaitan64arun
  • 753
  • 8
  • 18
  • It does help, but I am still curious. Is the only way to limit the length of the row to modify the partition key? I feel like there must be a way to map each station to its own partition, but still create new rows for each day, or month, etc. – dacox Apr 21 '16 at 16:09
0

In my opinion the datastax model isn't really great. The problem with this model:

  • They are using the weather station as partition key. All rows with the same partition key are stored on the same machine. This means: If you have 10 years raw data (100ms steps), you will break cassandras limit really fast. 10 years × 365 days × 24 hours × 60 min × 60 seconds x 10 (for 100ms steps) x 7 columns. The limit is 2 Billion. In my opinion you will not use the benefits of cassandra if you build this data model. You can also use, for each weather station, a mongo, mysql or another database.

The better solution: Ask yourself how you will query this data. If you say: I query all data per year, use the year also as partion key. If you need also query data of more than one year, you can create two queries with a different year. This works and the performance is better. (The bottleneck is maybe only the network to your client)

  • One little more tipp: Cassandra isn't like mysql. It's a denormalized database. This means: It's not dirty to save your data more than one time. This means: It is important for your to query your data per year, it's also important to query your data per hour, per day of year or per sensor_id, you can create column families with different partition key and parimary key order. It's okay to duplicate your data. Cassandra is optimized for write performance, not for read. This means: It's often better to write the data in the right order instead of reading it in the right order. In cassandra 3.0 there is a new feature, called materialized views, for automatic duplicating. And if you think: Ohhh no, i will duplicate the needed storage. Remember: Storage is really cheap. It's okay to buy ten HDDs with 1tb. It cost nothing. The performance is important.

I have one question to your: Can you aggregate your data? Cassandra has a column type called counter. You can create a java/scala application where your aggregate your data while they are produced. You can use a streaming framework for this: Flink or Spark. (If you need a bit more than only counting.). One scenario: You aggregating your data for each hour and day. You got your data in your streaming app. Now: You have an variable for hourly data. You count up or down or whatever. If the hour is finishes, your put this row in your hourly column family and daily column family. In your daily column family your using a counter. I hope, you understand what i mean.

Citrullin
  • 2,093
  • 11
  • 26
  • Hi Phillip, thanks. I'm aware of the 2billion hard limit for each partition key. My question was about ways of limiting row length by bucketing by day or year, for example. In the datastax document, they limit the row length by date in the second example. However, it appears that they are doing so by making it part of the partition key. My question was whether or not it is possible create a new row for each day, but still have it mapped to a node based solely on the weather station id for partitioning. – dacox Apr 20 '16 at 15:42
  • Do you want a way around the 2 billion limit? It's a soft 2 billion limit. But before you got the 2 billions rows x columns, you will run into a timeout on read level. Better for your: Use the station id and the year as partition key. This results only to 29030400 rows (1 second). If you need more than one year: Write one more query. That isn't big trouble. – Citrullin May 08 '16 at 00:57
  • What you refer as a problem in your first bullet point, is part of the datastax article. The second method in the article describes how you can partition you data using the date. – Demetris Jan 02 '17 at 08:28