We are currently investigating Cassandra as the database for a large time series system.
I have read through https://academy.datastax.com/resources/getting-started-time-series-data-modeling about modelling time series data in Cassandra.
What we have is high velocity timeseries data coming in for many weather stations. Each weather station has a number of "sensors" that each collect three metrics: temperature, humidity, and light.
We are trying to store each series as a wide row. However, we expect to get billions of readings per station over the life of the project, so we would like to limit the row size.
We would like there to be a single row for each (weather_station_id, year, day_of_year)
, that is, a new row for every day. However, we still want the partition key to be weather_station_id
- that is, we want all readings for a station to be on the same node.
We currently have the following schema, but I would like to get some feedback.
CREATE TABLE weather_station_data (
weather_station_id int,
year int,
day_of_year int,
time timestamp,
sensor_id int,
temperature int,
humidity int,
light int,
PRIMARY KEY ((weather_station_id), year, day_of_year, time, sensor_id)
) WITH CLUSTERING ORDER BY (year DESC, day_of_year DESC, time DESC, sensor_id DESC);
In the aforementioned document, they make use of this "limit partition row by date" concept. However, it is unclear to me whether or not the date in their examples is part of the partition key.