5

I am looking to finalize on Big table vs Big Query for my usecase of timeseries data.

I had gone through https://cloud.google.com/bigtable/docs/schema-design-time-series

This is for storing an Omniture data which contains information like website visitor key(some Long key), his cookie id(some Long key), timestamp series data web hits for his IP, cookie

What can be used as the rowkey for Big table? I cannot be using timestamp or CookieId as a prefix, as I learn from the best practices. But should have an identifier(preferably alphabet?) and then followed by timeseries suffix. The data has a volume of 500 Million with 52 columns stored in SQL table today. I think the data might be updated based on OLTP processing. But the table would be later queried on timeseries data for like OLAP processing.

a) Would Big table would be a best option here, or should I be using Big Query since just querying later based on timeseries data would help me more? b) If using Big table, what would be the best row key, since timeseries is the only meaning filter i see for my data. I believe, using other fields in the table like visitorkey, cookieid ids(Long ids) as prefix with timestamp would still cause entire data to be filling up 1 node in Bigtable, instead of distributing.

Please let me know.

Roshan Fernando
  • 407
  • 6
  • 24
  • 2
    I may be a noob to this but we had the same sort of question a few months ago here where I work. We concluded that BigTable is more for production applications that need fast retrieval, not so much of analytics use cases. It looks like you're looking for a Data Science type of case. I suggest BigQuery. If you need the results in production you can probably think of BigTable as a part of the architecture but not as the data warehouse – FKrauss Sep 19 '18 at 08:38
  • 1
    Yes, my usecase is for Data warehouse/Science. I am of the same view to use BigQuery, – Roshan Fernando Sep 19 '18 at 14:05
  • 2
    FKrauss - indeed we often see people use Bigtable as part of the production architecture, but not the data warehouse/query system. My answer goes into a few more details on why, which you may find interesting! I just want to push back a bit on the idea that Bigtable isn't for analytics use cases. It certainly is! But it's more in the model of large scale MapReduce- or Dataflow-based analytics, since as you've discovered it doesn't always conform well to more SQL-y workloads. – Douglas McErlean Sep 19 '18 at 14:16

1 Answers1

9

(I'm an engineer on the Cloud Bigtable Team)

As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Using visitorKey + cookie as a prefix before the timestamp sounds to me like it would avoid hotspotting issues, as there are almost certainly many more visitors to your site than there would be nodes in your cluster. Bigtable serves these sorts of time-series use cases all the time!

However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:

  • Are you planning to perform lots of ad hoc queries like "SELECT A FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable can't support this query without performing a full table scan. And in general Bigtable is geared more towards streaming back a simple subset of the data quickly, say, to a Dataflow job, rather than embedding complex processing in the queries themselves.
  • Will you require multi-row OLTP transactions? Again, use BigQuery, as Bigtable only supports transactions within a single row.
  • Are you streaming in new events at high QPS? Bigtable is much better for these sorts of high-volume updates. Remember that Bigtable's original purpose was as a random access sink for web crawler updates in Google's search index!
  • Do you want to perform any sort of large-scale complex transformations on the data? Again, Bigtable is likely better here, as you can stream data out and back in faster and let custom business logic in a Dataflow job do whatever you want to it.

You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable

  • 2
    Adhoc queries based on different columns, No multi-row OLTP transactions, No streaming new events. Might have more inserts on certain parts of a day. Would be queried based on timestamp data - daily or between some day to some day.Would be some transformations and also joins. Looking at Big Query with date partition. Please confirm. Please shed more light on hotspotting. How does visitorKey + cookie + timestamp avoid hotspotting, compared to just having timestamp in BT? I don't think i can filter based on a particular row key given this combination. – Roshan Fernando Sep 19 '18 at 22:04
  • 3
    BigQuery sounds like the right tool for your workload due to the ad hoc queries and joins. For future reference around hotspotting in Bigtable, the concern is that we don't want all incoming writes to hit the same row range, because then the load can't be spread across servers. Prefixing your row key with a timestamp, especially a fine-grained one, is the classic example of this antipattern. With visitorKey + cookie + timestamp as your key there are ||visitorKey+cookie|| separate row ranges receiving updates at any given timestamp, so they can be split off across multiple servers. – Douglas McErlean Sep 21 '18 at 17:13
  • 1
    Thanks a bunch for clarification – Roshan Fernando Sep 21 '18 at 18:04
  • @DouglasMcErlean what kind of 1) `delay` in consistency and 2) performance drop (2x/10x?) can we expect when using BigQuery+BigTable? – rodi Mar 26 '21 at 23:17
  • 1) If you're using federated queries, none. If you're writing to Bigtable and then manually mirroring the contents to BigQuery, it depends entirely on how frequently you do the mirroring. 2) Federated queries are known to be slower than using BigQuery's native storage, because the format isn't optimized for BigQuery. But I'm not sure exactly how big the slowdown is as I'm not a BigQuery expert. If you mirror, this slowdown goes away because you're using BigQuery with its native storage format. – Douglas McErlean Mar 28 '21 at 15:02