What are the best options to store tens of millions of records in a database?

Question

I'm working for a certain trader and he wants to build a relatively huge archive of stock quotes for further analysis. We have an access to information provider, and the data always have the same structure: ticker name, timestamp, and OHLC so it fits relational database approach. So far I made a working prototype using Ruby on Rails (Ruby 2.6, Rails 6.0) and PostgreSQL. It could fetch the data from the provider pretty fast, but storing millions of records to the database is very slow. I switched to plain SQL from ActiveRecord style and it made it 3 times faster, but still it is very long process to store just a little amount of the needed data. So I need to improve the performance of the prototype somehow, but not sure what to do exactly. I can use something like Redis for sure, but I need to store the data to persistent storage anyway. What should I do?

Your question is extremely broad. We don't know your hardware environment, the schema, or your expertise, so any answer that is detailed would result in writing a book. We can only give you broad answers in return, which is not the SO way. Please see "[ask]", "[Stack Overflow question checklist](https://meta.stackoverflow.com/questions/260648)" and _all_ their linked pages. — the Tin Man, May 10 '20 at 05:40

score 1 · Answer 1 · answered May 09 '20 at 20:21

This question will probably be flagged, but I'll give you an answer anyway:

ActiveRecord isn't what's necessarily slow. ActiveModel is slow.

ActiveModel is what maps your result set and database types into a ruby-friendly object, and what takes your data (string-y things) and converts it into database types for inserting/updating.

If you're rendering JSON, make sure you're using the C renderer OJ or similar: https://github.com/ohler55/oj — it makes a huge difference for larger datasets and responses.

If you're comfortable with a result set being a hash that you can loop through, you can avoid hitting ActiveModel with connection#exec_query:

stocks_query = Stock.all
results = Stock.connection.exec_query(stocks_query.to_sql).to_a 

results.each do |result|
  puts result 
end

If you're inserting millions of rows, or even updating millions of rows, you should probably use the right tool for the job: raw SQL. But that gets unwieldy and you like ruby, so you can use something like https://github.com/zdennis/activerecord-import which will allow you to write ruby and it'll take care of the rest.

Tens of millions of records isn't that many. PostgreSQL is probably your best bet without introducing a ton of unnecessary complexity. You just need to know the bottlenecks of communicating with the database layer with Rails, and how to work around them. Make sure that your database is indexed correctly as well. https://github.com/plentz/lol_dba is a good place to start if you're unsure.

emi · Accepted Answer · 2020-05-09T23:01:03.790

Most DBMSes support BULK INSERT that you can leverage, or better yet you can use PG's COPY command which is optimized for loading large numbers of rows. This, of course, would mean writing raw SQL.

If for some reason you don't want to use COPY and you'd rather INSERT your rows, it might be a smart thing to disable auto-commit in PG with SET AUTOCOMMIT TO OFF and then committing after every row has been inserted -- something along the lines of:

BEGIN;

INSERT INTO table (column1, column2, …)
VALUES
    (value1, value2, …),
    (value1, value2, …) ,...;

COMMIT;

The idea here will be to initially remove all your indexes and foreign key constraints if you have them and then putting them back after you're done. This should speed things up.

Also if you can (and it's safe) get data ready in a file (not sure of how else you can make sure the data is ready to be inserted when the DB is ready for the next row without complicating thing).

PG has a guide for things like this. There's more info on these answers too.

What are the best options to store tens of millions of records in a database?

2 Answers2