Big data with very fast access

Question

I am facing to a problem: database for process plants. There are up to 50,000 sensors at sampling rate of 50 ms. All measured values need to be stored at least 3 years and must support real-time queries (i.e. users can see historical data with delay less than 1 second). I recently read an article about Time-series Database, many options are on hand: OpenTSDB, KairosDB, InfluxDB, ...

I am confused which one would be proper for the purpose? Any one know about this please help me!

UPDATE 15.06.25

Today I run a test based on OpenTSDB. I used Virtual Box to create a cluster of 3 CentOS x64 VMs (1 master, 2 slaves). The host configuration is 8 GB RAM, core i5. The master VM configuration is 3 GB RAM, and the slaves configuration is 1.5 GB RAM. I write a python program to send data to OpenTSDB as below:

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("192.168.10.55", 4242))

start_time = time.time()
start_epoch = 1434192418;

for x in range(0, 1000000):
    curr_epoch = start_epoch + x

    tag1 = "put TAG_1 %d 12.9 stt=good\n" % (curr_epoch)
    tag2 = "put TAG_2 %d 12.9 stt=good\n" % (curr_epoch)
    tag3 = "put TAG_3 %d 12.9 stt=good\n" % (curr_epoch)
    tag4 = "put TAG_4 %d 12.9 stt=good\n" % (curr_epoch)
    tag5 = "put TAG_5 %d 12.9 stt=good\n" % (curr_epoch)
    tag6 = "put TAG_6 %d 12.9 stt=good\n" % (curr_epoch)
    tag7 = "put TAG_7 %d 12.9 stt=good\n" % (curr_epoch)
    tag8 = "put TAG_8 %d 12.9 stt=good\n" % (curr_epoch)
    tag9 = "put TAG_9 %d 12.9 stt=good\n" % (curr_epoch)
    tag10 = "put TAG_10 %d 12.9 stt=good\n" % (curr_epoch)
    str = tag1 + tag2 + tag3 + tag4 + tag5 + tag6 + tag7 + tag8 + tag9 + tag10

    s.send(str)

print("--- %s seconds ---" % (time.time() - start_time))

I run the python on host, and the work completes after ~220 seconds. So, I got an avg. speed of ~45000 records per second.

UPDATE 15.06.29

This time I used only 1 VM (5 GB RAM, 3 cores, CentOS x64, pseudo-distributed Hadoop). I run 2 python processes on Windows 7 host to send 2 halves of data to the OpenTSDB. The avg. speed of putting data was ~100,000 records per second.

I think you'll get better answers there : http://softwarerecs.stackexchange.com/ — thomasb, Jun 22 '15 at 15:41
Could be less than 100TB if time series is regular, i.e. 50 ms rate is fixed and guaranteed and data folding is enabled with typical historian compression algorithms such as swinging door and de-duplication of repeat values. Lots of values will be unchanged since physical state doesn't change that often provided meaningful precision limit is in place. Could you share part of the dataset - e.g. a list of metrics/sensor types? — Sergei Rodionov, Jun 22 '15 at 17:06
@SergeiRodionov We have 2 types of values: Digital and Analog. In a process plant, there are about 30,000 analog sensors and 20,000 digital sensors. Usually, the sampling rate of an analog signal is 50 ms and sampling rate of a digital signal is 1 second or more. However if a measured value does not change (compared with previous measured value) -> we don't need to store (that is what we have been doing). Each sample consists of timestamp (millisecond precision), tag ID (2 bytes), and value (4 bytes for analog signal, 1 byte for digital). — duong_dajgja, Jun 23 '15 at 01:43
Hbase should be able to handle million small samples if batched together and distributed over a cluster of machines. But if you want the data to be available asap then its going to be more challenging. — FUD, Jun 23 '15 at 06:37

Thefonkleurrr · Answer 1 · 2015-06-25T07:49:42.467

2

To handle the million writes per seconds, you will need to put some serious engineering in place.

Not all databases will be able to store that amount of data in a compact form.

For example ATSD uses 5 to 10 bytes per sample (float data type), depending on observed variance.

There is a type of distributed (clustered) databases built on HBase that will be able to handle this kind of load.

For example, you can try looking at openTSDB and ATSD.

Update 1.

We have run the following test for your particular use case:

30.000 analog sensors writing float type data, resulting in 540.000.000 records

20.000 digital sensors writing short type data (zeros and ones), resulting in 552.000.000 records

The data took up 3.68 gigabytes. The compression was lossless.

Resulting in an average 3.37 bytes per record.

This was a storage efficiency test.

Full disclosure, I work for the company that develops ATSD.

edited Jun 25 '15 at 07:49

answered Jun 22 '15 at 16:36

Thefonkleurrr

181
1
5

Sorry, could you please explain more clearly about your test? What do you mean 30.000 analog entities resulting in 540.000.000 records and 20.000 digital entities resulting in 552.000.000 records? and avg. 3.37 bytes per record -> what does this mean? – duong_dajgja Jun 25 '15 at 01:43
1

The test consisted of 50.000 sensors writing 1 million data points per second. 30.000 of them were analog and 20.000 were digital. Analog sensors were writing float type data and digital were writing short type data. Total of 1.092.000.000 data points were generated at a sampling rate of 50 ms, taking up about 3.68 gigabytes of disk space, which means that each data point took about 3.37 bytes of disk space. – Thefonkleurrr Jun 25 '15 at 07:41
Did you use ATSD? How is about your hardware configuration? – duong_dajgja Jun 25 '15 at 08:24
[ATSD](https://axibase.com/products/axibase-time-series-database/) community edition was used for the test. It was not a throughput test, so a simple virtual machine was used. The goal was to see how much data can be stored, how efficiently it can be stored and for how long. – Thefonkleurrr Jun 25 '15 at 08:47

score 2 · Answer 2 · answered Jun 22 '15 at 18:34

InfluxDB won't handle a sustained million writes per second right now, but that's within the performance target for later this year. The bigger challenge I see is the sheer volume of data you want to store. If you need to keep three years' worth at full resolution with no downsampling, that's hundreds of terabytes of data. If that's not all on SSD, the query performance will not be good. If that is all on SSD, that's an extremely expensive database. Also, with that much raw data, it would be very easy to craft a query that explodes the RAM no matter how much is installed.

I would say check back with the InfluxDB team in 8-12 weeks and we might have a better idea how to handle your problem. My advice, though, is to find a way to split that up. If you really are sampling 50k machines at 50ms intervals, it's a huge amount of data, network traffic, etc.

Yes. In a process plant (e.g. a thermal power plant), 50 sensors is a normal situation (as I know). The sampling rate is between 50 ms and couple of seconds depending on type of signal (analog and digital). The historian needs to be stored at least for 1 year and up to 3 years depending on plant. — duong_dajgja, Jun 23 '15 at 01:50

score 1 · Answer 3 · answered Jul 29 '15 at 15:09

Very interresting use case.

I cannot speak for others since I only strongly evaluated and used KairosDB & Cassandra but I can speak of my own experience.

KairosDB + Cassandra would stand the throughput but for 1M writes per second you would need a cluster with several front-ends (I'd recommend at least 10 nodes, but you have to test) and back-end servers (they can be colocated if needed).

Anyway with a throughput of 200 samples per second for each sensor... Retrieving historical data could be a problem for long queries (1s for historical data is interesting but you have to determine the amount of samples of the query duration).

As far as I can tell storage efficiency will not be as goot as ATSD (probably twice the size, which is already good).

What I personally like with KairosDB & Cassandra and why we adopted it: it's simple, really distributed, KairosDB doesn't mess with the data, the design is elegant and doesn't involve complex systems, Cassandra manages the data and kairosDB offers the TSDB services, and performances and efficiency are not bad at all... Moreover Cassandra is much easier to manage as a cluster than its counterparts.

Yes Indeed ! I think I was a bit tired :) – Loic Oct 14 '15 at 09:55 — Loic, Oct 14 '15 at 09:55

score 0 · Answer 4 · answered Jun 24 '15 at 23:25

0

MemSQL can handle 1M writes a second. I would recommend to use Kafka and MemSQL column stores for that. Depending on the type of queries subsecond response is very possible.

answered Jun 24 '15 at 23:25

Nikita Shamgunov

19
1

Big data with very fast access

4 Answers4