rails - Exporting a huge CSV file consumes all RAM in production

Question

So my app exports a 11.5 MB CSV file and uses basically all of the RAM that never gets freed.

The data for the CSV is taken from the DB, and in the case mentioned above the whole thing is being exported.

I am using Ruby 2.4.1 standard CSV library in the following fashion:

export_helper.rb:

CSV.open('full_report.csv', 'wb', encoding: UTF-8) do |file|
  data = Model.scope1(param).scope2(param).includes(:model1, :model2)
  data.each do |item|
    file << [
      item.method1,
      item.method2,
      item.methid3
    ]
  end
  # repeat for other models - approx. 5 other similar loops
end

and then in the controller:

generator = ExportHelper::ReportGenerator.new
generator.full_report
respond_to do |format|
  format.csv do
    send_file(
      "#{Rails.root}/full_report.csv",
      filename: 'full_report.csv',
      type: :csv,
      disposition: :attachment
    )
  end
end

After a single request the puma processes load 55% of the whole server's RAM and stay like that until eventually run out of memory completely.

For instance in this article generating a million-lines 75 MB CSV file required only 1 MB of RAM. But there is no DB querying involved.

The server has 1015 MB RAM + 400 MB of swap memory.

So my questions are:

What exactly consumes so much memory? Is it the CSV generation or the communication with the DB?
Am I doing something wrong and missing a memory leak? Or is it just how the library works?
Is there way to free up the memory without restarting puma workers?

Thanks in advance!

Use this answer for the best solution: https://stackoverflow.com/questions/356578/how-to-output-mysql-query-results-in-csv-format — Pavel Mikhailyuk, Oct 16 '18 at 11:44
Don't use @eikes suggestion about streaming: nginx + `send_file` >> streaming — Pavel Mikhailyuk, Oct 16 '18 at 11:46

score 8 · Accepted Answer · edited Sep 04 '20 at 21:08

Instead of each you should be using find_each, which is specifically for cases like this, because it will instantiate the Models in batches and release them afterwards, whereas each will instantiate all of them at once.

CSV.open('full_report.csv', 'wb', encoding: UTF-8) do |file|
  Model.scope1(param).find_each do |item|
    file << [
      item.method1
    ]
  end
end

Furthermore you should stream the CSV instead of writing it to memory or disk before sending it to the browser:

format.csv do
  headers["Content-Type"] = "text/csv"
  headers["Content-disposition"] = "attachment; filename=\"full_report.csv\""

  # streaming_headers
  # nginx doc: Setting this to "no" will allow unbuffered responses suitable for Comet and HTTP streaming applications
  headers['X-Accel-Buffering'] = 'no'
  headers["Cache-Control"] ||= "no-cache"

  # Rack::ETag 2.2.x no longer respects 'Cache-Control'
  # https://github.com/rack/rack/commit/0371c69a0850e1b21448df96698e2926359f17fe#diff-1bc61e69628f29acd74010b83f44d041
  headers["Last-Modified"] = Time.current.httpdate

  headers.delete("Content-Length")
  response.status = 200

  header = ['Method 1', 'Method 2']
  csv_options = { col_sep: ";" }

  csv_enumerator = Enumerator.new do |y|
    y << CSV::Row.new(header, header).to_s(csv_options)
    Model.scope1(param).find_each do |item|
      y << CSV::Row.new(header, [item.method1, item.method2]).to_s(csv_options)
    end
  end

  # setting the body to an enumerator, rails will iterate this enumerator
  self.response_body = csv_enumerator
end

Thanks for your response! Using `find_each` helped to reduce memory load from 55% to 20%. However, it never gets released. So roughly 4 exports like this would exhaust the whole RAM. Is there a solution to this? — mityakoval, Oct 16 '18 at 12:38
@mityakoval use a lower value for the [`batch_size`](https://apidock.com/rails/ActiveRecord/Batches/ClassMethods/find_each) configuration option. This will help with the memory usage but note that it will have a bad impact on the performance. — smefju, Nov 08 '19 at 08:17

score 3 · Answer 2 · answered Oct 16 '18 at 13:46

3

Apart from using find_each, you should try running the ReportGenerator code in a background job with ActiveJob. As background jobs run in seperate processes, when they are killed memory is released back to the OS.

So you could try something like this:

A user requests some report(CSV, PDF, Excel)
Some controller enqeues a ReportGeneratorJob, and a confirmation is displayed to the user
The job is performed and an email sent with the download link/file.

answered Oct 16 '18 at 13:46

R. Sierra

803
4
14

Yeah, I also thought about it. But unfortunately, it has to be the full-duplex request. – mityakoval Oct 17 '18 at 10:17
@mityakoval, if you need to do it synchronously, use ActiveJob anyway but instead of calling `perform_later` call `perform_now` with a high priority queue. You will have some additional delay but the report would still be created on a separate thread. – R. Sierra Oct 17 '18 at 12:59
Wow! Didn't think of this option. Thanks a lot! – mityakoval Oct 17 '18 at 14:18
Tried this approach, and unfortunately the job is executed in the main thread :_( – mityakoval Oct 18 '18 at 10:57

score 0 · Answer 3 · answered Nov 13 '20 at 08:53

Beware tho, you can easily improve ActiveRecord side, but then when sending response through Rails, it will all end up in memory buffer in the Response object: https://github.com/rails/rails/blob/master/actionpack/lib/action_dispatch/http/response.rb#L110

You also need to take use of live streaming feature to pass the data to the client directly without buffering: https://guides.rubyonrails.org/action_controller_overview.html#live-streaming-of-arbitrary-data

rails - Exporting a huge CSV file consumes all RAM in production

3 Answers3