24

I'm a beginner with influxDB and after reading the Schema design documentation a question remain.

How to decide if you should use one measurement with multiple fields or multiple measurement with single field ?

I have multiple iot device which send every minute datas (temperature,humidity, pressure). All this datas have the exact same timestamp.

So i was wondering if d rather create one measurement like this :

    timestamp,iotid,temperature,humidity,pressure
-------------------------------------------------
    1501230195,iot1,70,         45,      850

Or 3 measurements (one for each value) , with the same tags but only one field in it ?

timestamp,iotid,temperature
----------------------------
    1501230195,iot1,70

timestamp,iotid,humidity
-------------------------
    1501230195,iot1,45

timestamp,iotid,pressure
-------------------------
    1501230195,iot1,850

Query-wise, i could retrieve only one value but also the 3 at the same time.

grunk
  • 13,700
  • 12
  • 61
  • 103

4 Answers4

19

Bit of an old question but this is probably relevant to anyone working on TSDBs.

When I first started, my appoach used to be that every data point went into a single field measurement. The assumption was that I'd combine the data I needed in a SQL statement at a later date. However, as anyone who's used a TSDB like influx knows that there are some serious limitations with one can do in the retrieval of data because of the design choices used in implementing a TSDB.

As I've moved forward in my project, here are the rules of thumb I have developed:

A measurement should contain all the dimensions required for it to make sense but no more.

Example: imagine a gas flow meter which gives 3 signals:

  • volumetric flow
  • temperature
  • total flow

In this scenario, volumetric flow and temperature should be two fields of a single measurement, and total flow should be its own measurement.

(if the reader doesn't like this example, think of a home electric meter that outputs amps and volts, and kw and pf).

Why would it be bad to store volumetric and temp in different series?

  1. Timing: if you store those two measurements in different series, they will have different index values (timestamp). Unless you take care to make sure they have explicitly specified timestamps, you run the risk of them being slightly offsampled. This can very well end up being a Bad Thing (tm) because you might be introducing a systematic measurement bias in your data. Even if it's not a bad thing, it's going to be super annoying if you ever want to reuse this data later on (e.g. to dump it in a csv file).

  2. Utility: if you want to deduce volumetric flow rate, you will have to get constant * temp * volume to get a correct value. Doing this with two separate measurements becomes a nightmare because, for instance, influxdb does not even support the operation. But even if it did, you'd have to make sure missing values of one of the fields aren't incorrectly handled and that grouping and aggregation is done right.

Why would it be bad to store all three in a single measurement?

You may very well have a use case in which you want to audit all three values at all times, but chances are this is not the case and you don't care about measuring total volume at the same kind of frequency that you'd like to measure flow itself.

Putting all the fields in a single measurement will force you to either put nulls in certain fields, or to always log a variable that barely changes. Either way, it's not efficient.

The important insight is that multi-dimensional entities require all their dimensions at the same time to make sense.

MB.
  • 1,027
  • 9
  • 17
  • Also, [there is no way to delete a single field from a measurement](https://github.com/influxdata/influxdb/issues/6150) so think twice before putting unrelated values in fields in the same measurement because once you want to erase one you need to erase all then recreate the others. – Jérôme Apr 10 '20 at 10:10
13

There is no right or wrong to go with either schema design but going with one measurement one field value is the more appropriate approach.

Why?

Storing multiple field values into a measurement is a very relational database thing. That is, a measurement should not be seen as a database table as it is a very different thing.

A measurement should be reserved explicitly for describing a type of data, like temperature or CPU usage.

If we design our schema using the one field value per measurement then we can describe the data in real English like;

At a certain point in time, the temperature is measured as data value=30. Noticed the term used here, point, data and measurement.

Whereas if you put multiple field values into a specific measurement then you will find it difficult to present the data in real English.

influxdb is a time series database so it is obvious that we should do it the time-series way.

Also, some of the time series data are actually measured down to the precision level of micro-seconds. In such fine grain timing, even for milliseconds it is unlikely for a set of data to share the same timing. Hence designing it as one measurement containing a sequence of data point is always the better choice.

Samuel Toh
  • 14,302
  • 3
  • 19
  • 35
  • 3
    So... we should follow this just so we can express data schema better in English...? – kovac Jan 31 '19 at 05:56
  • Obviously not. It really depends on your use case. For instance there is currently no way to do `cross-measurement` joins unless you use `flux` the new query language. You may also put different set of data in another measurement due to `retention policy`. However I believe it is best to organise your data in a sensible way, this is so that data can be discovered easily. – Samuel Toh Jan 31 '19 at 06:31
  • Exactly, I was wondering about some real technical limitations like cross-measurement joins, max number of fields, etc. One measurement can have multiple dimensions which seemed perfectly reasonable to have them as fields. – kovac Jan 31 '19 at 06:55
  • 1
    Why not a single measurement, a single fields and use tagsets to separate values? – Jérôme Apr 10 '20 at 20:41
3

This likely depends on your data, try both and see the storage requirements. For example if humidity does not change much, then it makes sense to separate it. But if some variables change in a similar time intervals, then it makes sense to combine them. It may also depend on your query patterns.

denfromufa
  • 4,995
  • 11
  • 66
  • 130
3

I thought to mention that there is a valid third option:

timestamp,iotid,measure,value
----------------------------
1501230195, iot1, temperare, 70
1501230195, iot1, humidity, 45
1501230195, iot1, pressure, 850
Francesco Meli
  • 1,985
  • 1
  • 16
  • 36
  • Hi @Francesco, do you know briefly if there are any advantages to doing it this way which may not be immediately obvious to a beginner (such as myself) at the start? I was thinking in particular when we come to querying the data in Grafana (variables, templating, group by, etc)... Thanks! – teeeeee Jun 07 '20 at 23:34