87

I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.

Most (99.9%) of the time I will perform only read requests.

What is the best way to store this data in a db?

  • 1 big table with 7.3B rows?
  • 1000 tables (one for each stock symbol) with 7.3M rows each?
  • any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)

I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.

Edit:

This is a sample row:

'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0

Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.

Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"

About the hardware: I plan to use Amazon RDS so I'm flexible on that

Victor
  • 20,476
  • 27
  • 76
  • 108
  • 5
    Describe the expected typical query – William Pursell Mar 22 '12 at 01:30
  • 10
    "I think you should use MongoDB because it's web scale." – ta.speot.is Mar 22 '12 at 01:31
  • 8
    You probably want one big table, partitioned by stock symbol. – ta.speot.is Mar 22 '12 at 01:32
  • 1
    Dataset is huge! You may want to search around for datamining and analytics to see what you find. – Mike Purcell Mar 22 '12 at 01:33
  • 1
    I'm really glad you put quotes around that, @ta.speot.is. When I first read it, I thought "Oh no...." – Corbin Mar 22 '12 at 01:33
  • 2
    And a "standard RDBMS" with a single table is insufficient for this? (I only deal in the millions but "works for me". Might as well just try it and see. Remember to index/cluster/partition as required.) –  Mar 22 '12 at 01:43
  • Index, Index, Index. Index your indices if you can! – Corey Ogburn Mar 31 '12 at 08:45
  • 1
    Amazon RDS is extremely expensive for this unless you have very light usage. A Dell box to run this (48 core bulldozer with 256 GB ram costs about 15K) – Stephan Eggermont Apr 01 '12 at 13:04
  • what is you final implementation, I am facing the same problem. – Benny Jan 06 '16 at 06:42
  • Hello @Victor, first of all thanks for you question it helped me a lot. I'm working on my own analysis of financial data and I'm curios how did you manage to get all that data from? Did you use some API for example [IEX API](https://iexcloud.io/docs/api) or something else? Any advice appropriated, thanks. – fillky Jan 05 '20 at 18:38

13 Answers13

53

So databases are for situations where you have a large complicated schema that is constantly changing. You only have one "table" with a hand-full of simple numeric fields. I would do it this way:

Prepare a C/C++ struct to hold the record format:

struct StockPrice
{
    char ticker_code[2];
    double stock_price;
    timespec when;
    etc
};

Then calculate sizeof(StockPrice[N]) where N is the number of records. (On a 64-bit system) It should only be a few hundred gig, and fit on a $50 HDD.

Then truncate a file to that size and mmap (on linux, or use CreateFileMapping on windows) it into memory:

//pseduo-code
file = open("my.data", WRITE_ONLY);
truncate(file, sizeof(StockPrice[N]));
void* p = mmap(file, WRITE_ONLY);

Cast the mmaped pointer to StockPrice*, and make a pass of your data filling out the array. Close the mmap, and now you will have your data in one big binary array in a file that can be mmaped again later.

StockPrice* stocks = (StockPrice*) p;
for (size_t i = 0; i < N; i++)
{
    stocks[i] = ParseNextStock(stock_indata_file);
}
close(file);

You can now mmap it again read-only from any program and your data will be readily available:

file = open("my.data", READ_ONLY);
StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY);

// do stuff with stocks;

So now you can treat it just like an in-memory array of structs. You can create various kinds of index data structures depending on what your "queries" are. The kernel will deal with swapping the data to/from disk transparently so it will be insanely fast.

If you expect to have a certain access pattern (for example contiguous date) it is best to sort the array in that order so it will hit the disk sequentially.

Andrew Tomazos
  • 58,923
  • 32
  • 156
  • 267
  • 12
    Spend a few hundred to put it on SSD instead of hard disk. Random reads are about a hundred times faster. Or spend 10K on ram. Another hundred times faster – Stephan Eggermont Apr 01 '12 at 07:47
  • 1
    @Andrew Tomazos thanks dude, this one is "the" answer – Pavneet_Singh Sep 17 '16 at 17:08
  • 1
    StockPrice sizeof would be char[4] = 4 bytes int = 4 bytes short = 2 bytes float = 4 bytes float = 4 bytes float = 4 bytes float = 4 bytes float = 4 bytes int = 4 bytes int = 4 bytes int = 4 bytes ------------ 42 bytes around 306.6 billion bytes = ~ 285.5435013771057 GB memory ... good luck with that – ZagNut Sep 21 '16 at 14:21
  • 3
    @ZagNut: If your implication is that you need 300GB of physical memory, then that is not correct - mmap doesn't copy the entire thing into memory, it pages it in/out as needed (in the same fashion as the swap file). – Andrew Tomazos Feb 04 '20 at 08:47
39

I have a dataset of 1 minute data of 1000 stocks [...] most (99.9%) of the time I will perform only read requests.

Storing once and reading many times time-based numerical data is a use case termed "time series". Other common time series are sensor data in the Internet of Things, server monitoring statistics, application events etc.

This question was asked in 2012, and since then, several database engines have been developing features specifically for managing time series. I've had great results with the InfluxDB, which is open sourced, written in Go, and MIT-licensed.

InfluxDB has been specifically optimized to store and query time series data. Much more so than Cassandra, which is often touted as great for storing time series:

InfluxDB vs Cassandra query speed

Optimizing for time series involved certain tradeoffs. For example:

Updates to existing data are a rare occurrence and contentious updates never happen. Time series data is predominantly new data that is never updated.

Pro: Restricting access to updates allows for increased query and write performance

Con: Update functionality is significantly restricted

In open sourced benchmarks,

InfluxDB outperformed MongoDB in all three tests with 27x greater write throughput, while using 84x less disk space, and delivering relatively equal performance when it came to query speed.

InfluxDB vs. MongoDB on-disk storage requirements and compression

Queries are also very simple. If your rows look like <symbol, timestamp, open, high, low, close, volume>, with InfluxDB you can store just that, then query easily. Say, for the last 10 minutes of data:

SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'

There are no IDs, no keys, and no joins to make. You can do a lot of interesting aggregations. You don't have to vertically partition the table as with PostgreSQL, or contort your schema into arrays of seconds as with MongoDB. Also, InfluxDB compresses really well, while PostgreSQL won't be able to perform any compression on the type of data you have.

Community
  • 1
  • 1
Dan Dascalescu
  • 110,650
  • 40
  • 276
  • 363
  • 1
    There are other time series databases such as kdb+, m3db or [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics/), which can easily manage trillions of rows on a single node under production workload. See https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/CaseStudies#wixcom – valyala Dec 05 '20 at 21:24
  • @valyala: VictoriaMetrics looks great! I've had [numerous issues with InfluxDB](https://pinboard.in/u:dandv/t:InfluxDB/t:against) since I posted this; though [VictoriaMetrics doesn't support updates at all](https://github.com/VictoriaMetrics/VictoriaMetrics#data-updates) :-( – Dan Dascalescu Jan 07 '21 at 11:30
29

Tell us about the queries, and your hardware environment.

I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.

Update

Okay, why?

First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because

  • My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.

  • If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.

  • If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because

    • you can better exploit parallelism on relatively small commodity hardware.
    • you can also better implement high reliability and redundancy
    • many of those problems lend themselves naturally to the MapReduce paradigm.

But the fact is, until we know about your workload, it's impossible to say anything definitive.

Charlie Martin
  • 103,438
  • 22
  • 180
  • 253
  • 7
    What advantage does "NoSQL" offer here? Why not a *single large table in a traditional RDBMS*? (With correct indexes, etc.) Everyone goes "NoSQL", "NoSQL", "NoSQL", but... *why*? –  Mar 22 '12 at 01:45
  • 5
    I have to say that my suggestion would also be a NoSQL approach using Apache Accumulo (that's personal preference). The dataset's small (for Accumulo) and the type of queries required seem perfectly suited to it using its distributed iterator stack. – Binary Nerd Mar 22 '12 at 04:57
  • Thanks for the expanded answer. I can +1 that. –  Mar 22 '12 at 05:02
  • 2
    Sometimes some of the comments here just confuse me. '-1 for use of a database where it doesn't make sense?' The whole answer argues *against* a traditional database. – Charlie Martin Nov 21 '12 at 19:49
18

Okay, so this is somewhat away from the other answers, but... it feels to me like if you have the data in a file system (one stock per file, perhaps) with a fixed record size, you can get at the data really easily: given a query for a particular stock and time range, you can seek to the right place, fetch all the data you need (you'll know exactly how many bytes), transform the data into the format you need (which could be very quick depending on your storage format) and you're away.

I don't know anything about Amazon storage, but if you don't have anything like direct file access, you could basically have blobs - you'd need to balance large blobs (fewer records, but probably reading more data than you need each time) with small blobs (more records giving more overhead and probably more requests to get at them, but less useless data returned each time).

Next you add caching - I'd suggest giving different servers different stocks to handle for example - and you can pretty much just serve from memory. If you can afford enough memory on enough servers, bypass the "load on demand" part and just load all the files on start-up. That would simplify things, at the cost of slower start-up (which obviously impacts failover, unless you can afford to always have two servers for any particular stock, which would be helpful).

Note that you don't need to store the stock symbol, date or minute for each record - because they're implicit in the file you're loading and the position within the file. You should also consider what accuracy you need for each value, and how to store that efficiently - you've given 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64 bits of storage: read it as a long (or whatever your 64-bit integer value will be) and use masking/shifting to get it back to three integers. You'll need to know what scale to use, of course - which you could probably encode in the spare 4 bits, if you can't make it constant.

You haven't said what the other three integer columns are like, but if you could get away with 64 bits for those three as well, you could store a whole record in 16 bytes. That's only ~110GB for the whole database, which isn't really very much...

EDIT: The other thing to consider is that presumably the stock doesn't change over the weekend - or indeed overnight. If the stock market is only open 8 hours per day, 5 days per week, then you only need 40 values per week instead of 168. At that point you could end up with only about 28GB of data in your files... which sounds a lot smaller than you were probably originally thinking. Having that much data in memory is very reasonable.

EDIT: I think I've missed out the explanation of why this approach is a good fit here: you've got a very predictable aspect for a large part of your data - the stock ticker, date and time. By expressing the ticker once (as the filename) and leaving the date/time entirely implicit in the position of the data, you're removing a whole bunch of work. It's a bit like the difference between a String[] and a Map<Integer, String> - knowing that your array index always starts at 0 and goes up in increments of 1 up to the length of the array allows for quick access and more efficient storage.

Jon Skeet
  • 1,261,211
  • 792
  • 8,724
  • 8,929
  • Again this depends on how he is using the data. If his query is to pull a particual data across the board (stock symbol wise) then this would incur reading every file and having specific date encodings to pull the correct data from each. Or if he wants the best performing stock per week, then that would be a nightmare with this kind of set up with having to read all records sort and compare. Without such information, we can only guess that this is for fixed storage - maybe as a bulk DW that will feed a reporting DW at some point (ETL source). – Wolf5370 Mar 30 '12 at 06:01
  • 2
    @Wolf5370: Yes, we certainly need to know what the queries will be, but we have at least some indication from the question: 'Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52'. It would be nice to know what the *other* queries would be, as well as relative frequencies and performance requirements. – Jon Skeet Mar 30 '12 at 06:04
  • @JonSkeet it really depends on the workload, but I've got some domain knowledge of this kind of system, and it's rarely just "select one stock over one range": it's much more often "select stocks in this portfolio over this range, compute β then try this list of possible stocks and see what β is then." That's why it drives you toward something OLAP-like. – Charlie Martin Mar 31 '12 at 09:29
  • 2
    @CharlieMartin: Well I was just going by what the question states. However, if you can basically get it all in memory (across a few servers) then it's still pretty easy - ask each server for the relevant stocks in the portfolio, then put the results together. I think my point about using the known aspects of the data (once per minute, but not on weekends or overnight) is still useful in terms of significantly reducing the difficulty of getting it all in memory. – Jon Skeet Mar 31 '12 at 09:34
  • This discussion reminds me of Fred Brooks' quote, "Representation is the essence of programming" and the related problems in Bentley's 'Programming Pearls'. – C S Dec 28 '19 at 01:57
14

It is my understanding that HDF5 was designed specifically with the time-series storage of stock data as one potential application. Fellow stackers have demonstrated that HDF5 is good for large amounts of data: chromosomes, physics.

Community
  • 1
  • 1
sarnold
  • 96,852
  • 21
  • 162
  • 219
  • 2
    +1 for a specific solution. I do, however love SQL DQL (for the most part) and the flexibility it affords... not sure what is required with HDF5 to move out of a "hierarchical view". –  Mar 22 '12 at 01:48
4

Here is an attempt to create a Market Data Server on top of the Microsoft SQL Server 2012 database which should be good for OLAP analysis, a free open source project:

http://github.com/kriasoft/market-data

Konstantin Tarkus
  • 35,208
  • 14
  • 127
  • 117
  • Yeh. Not sure if that particular project is applicable, but would definitely suggest the OP consider OLAP or Data Warehousing fact table structure, both approachs(sometimes used together) are designed to address this kind of data of very large numbers of rows. It really depends on what kind of analysis they intend to perform though. – AaronLS Aug 20 '14 at 21:37
4

First, there isn't 365 trading days in the year, with holidays 52 weekends (104) = say 250 x the actual hours of day market is opened like someone said, and to use the symbol as the primary key is not a good idea since symbols change, use a k_equity_id (numeric) with a symbol (char) since symbols can be like this A , or GAC-DB-B.TO , then in your data tables of price info, you have, so your estimate of 7.3 billion is vastly over calculated since it's only about 1.7 million rows per symbol for 14 years.

k_equity_id k_date k_minute

and for the EOD table (that will be viewed 1000x over the other data)

k_equity_id k_date

Second, don't store your OHLC by minute data in the same DB table as and EOD table (end of day) , since anyone wanting to look at a pnf, or line chart, over a year period , has zero interest in the by the minute information.

Richard
  • 41
  • 1
3

Let me recommend that you take a look at apache solr, which I think would be ideal for your particular problem. Basically, you would first index your data (each row being a "document"). Solr is optimized for searching and natively supports range queries on dates. Your nominal query,

"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"

would translate to something like:

?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]

Assuming "stock" is the stock name and "date" is a "DateField" created from the "date" and "minute" columns of your input data on indexing. Solr is incredibly flexible and I really can't say enough good things about it. So, for example, if you needed to maintain the fields in the original data, you can probably find a way to dynamically create the "DateField" as part of the query (or filter).

aliasmrchips
  • 889
  • 8
  • 8
  • You can also use Amazon EC2 to set up your solr instance... http://www.lucidimagination.com/blog/2010/02/01/solr-shines-through-the-cloud-lucidworks-solr-on-ec2/ – aliasmrchips Mar 28 '12 at 23:10
  • 3
    SOLR works great for searching, but you still need to store the data somewhere, in order to populate the indices. – Mike Purcell Mar 29 '12 at 02:39
  • True. I assume that Victor P has the data somewhere and it will need to be indexed. This will require additional resources... However, all of the proposed approaches do as well. – aliasmrchips Mar 31 '12 at 01:44
  • @aliasmrchips: I think the [InfluxDB approach](http://stackoverflow.com/questions/9815234/how-to-store-7-3-billion-rows-of-market-data-optimized-to-be-read/39475372#39475372) does better - it both stores efficiently (high throughput, 80x better compression than Mongo), and queries easily. – Dan Dascalescu Sep 13 '16 at 18:26
3

I think any major RDBMS would handle this. At the atomic level, a one table with correct partitioning seems reasonable (partition based on your data usage if fixed - this is ikely to be either symbol or date).

You can also look into building aggregated tables for faster access above the atomic level. For example if your data is at day, but you often get data back at the wekk or even month level, then this can be pre-calculated in an aggregate table. In some databases this can be done though a cached view (various names for different DB solutions - but basically its a view on the atomic data, but once run the view is cached/hardened intoa fixed temp table - that is queried for subsequant matching queries. This can be dropped at interval to free up memory/disk space).

I guess we could help you more with some idea as to the data usage.

Wolf5370
  • 1,344
  • 11
  • 11
3

You should compare the slow solutions with a simple optimized in memory model. Uncompressed it fits in a 256 GB ram server. A snapshot fits in 32 K and you just index it positionally on datetime and stock. Then you can make specialized snapshots, as open of one often equals closing of the previous.

[edit] Why do you think it makes sense to use a database at all (rdbms or nosql)? This data doesn't change, and it fits in memory. That is not a use case where a dbms can add value.

Stephan Eggermont
  • 15,354
  • 1
  • 33
  • 64
  • Actually, there are several reasons, not the least being that if you have 256 GB of memory it'd be nice if there was some room for temp space, the operating system, and so on. Then there are issues like checkpointing, logging, and fault tolerance -- once you start computing any intermediate results you're back to needing to manage storage. I agree that an RDBMS isn't the best choice -- but something smarter than "load the big array into memory" is absolutely needed. – Charlie Martin Mar 31 '12 at 09:33
  • checkpointing, logging and fault tolerance is extremely simple for near-static data. It sounds like an ideal fit for a prevayler style solution – Stephan Eggermont Apr 01 '12 at 13:06
  • Again, without better knowledge of the application it isn't possible to say for sure, but in general, the application isn't as static as you think, because you want to maintain result sets and because you are doing costly calculations with, again, checkpointing and precomputed partial results. – Charlie Martin Apr 02 '12 at 13:35
2

You will want the data stored in a columnar table / database. Database systems like Vertica and Greenplum are columnar databases, and I believe SQL Server now allows for columnar tables. These are extremely efficient for SELECTing from very large datasets. They are also efficient at importing large datasets.

A free columnar database is MonetDB.

ZagNut
  • 1,363
  • 12
  • 18
2

If you have the hardware, I recommend MySQL Cluster. You get the MySQL/RDBMS interface you are so familiar with, and you get fast and parallel writes. Reads will be slower than regular MySQL due to network latency, but you have the advantage of being able to parallelize queries and reads due to the way MySQL Cluster and the NDB storage engine works.

Make sure that you have enough MySQL Cluster machines and enough memory/RAM for each of those though - MySQL Cluster is a heavily memory-oriented database architecture.

Or Redis, if you don't mind a key-value / NoSQL interface to your reads/writes. Make sure that Redis has enough memory - its super-fast for reads and writes, you can do basic queries with it (non-RDBMS though) but is also an in-memory database.

Like others have said, knowing more about the queries you will be running will help.

Suman
  • 8,407
  • 5
  • 43
  • 61
1

If your use case is to simple read rows without aggregation, you can use Aerospike cluster. It's in memory database with support of file system for persistence. It's also SSD optimized.

If your use case needs aggregated data, go for Mongo DB cluster with date range sharding. You can club year vise data in shards.

Gaurav Kumar
  • 156
  • 1
  • 4