5

Requirement is like

We get the huge dataset from the database( > 1 billion records) and need to export it to the csv file or excel.

Currently implementation use CSV class CSV.generate

 CSV.generate(headers: true) do |csv|
    csv << header
    @obj.find_each do |c|
     arr = [c.id,c.name,soon]
     csv << array
    end
 end

and sending the output to

Zip::File.open(file, Zip::File::CREATE) do |zip|
        zip.get_output_stream("test.#{@format}") { |f| f.puts(convert_to_csv) }
      end

All this operation is done other delayed jobs This works good when record is < 20,000 But when rows starts growing it gets some memory issues.

What i was thinking is to chunk the record to pieces say 1 million rows into 50 files (1million/20000)(csv1.csv,csv2.csv,csv3.csv,csv4.csv,csv5.csv) and then concat them into single file or zip all files together(faster way)

Can any one give me idea how can I start on it.

Kunal Vashist
  • 2,110
  • 6
  • 24
  • 55
  • can I ask where the csv data will be used? because I think Excel supports a max of 1 million rows, see https://stackoverflow.com/questions/23757991/maximum-number-of-rows-of-csv-data-in-excel-sheet. For this much data, you might better off with some kind of database dump format. – lacostenycoder May 29 '19 at 14:02
  • 1
    One of the first things I would consider in this case is to get rid of unnecessary ORM overhead and use lower level queries to DB - there is not that much point in instantiating of billion of AR entities just to fetch 3 properties from them - using raw dataset will reduce memory usage and increase performance notably... – Konstantin Strukov May 29 '19 at 14:07
  • 1
    _Sidenote:_ `1million/20000` is 50, not 5 btw. Also, [`SELECT INTO OUTFILE`](https://dev.mysql.com/doc/refman/8.0/en/select.html). – Aleksei Matiushkin May 29 '19 at 14:12
  • @AlekseiMatiushkin Didnt know about that feature, thanks! Similar one but for postgres https://stackoverflow.com/questions/1517635/save-pl-pgsql-output-from-postgresql-to-a-csv-file – Martin Zinovsky May 29 '19 at 15:37

3 Answers3

2

Taking a look at the source for CSV.generate gives me the impression that the csv data is kept in memory while the contents are being accumulated. That seems like a good target for optimization, especially if you see that memory is scaling linearly with the data set. Since your data is pretty simple, could you skip CSV and go directly to File instead? You'd have a bit more control about when data was flushed out to disk.

File.open("my.csv") do |file|
  file.puts '"ID","Name","Soon"'
  @obj.find_each do |c|
    file.puts "\"#{c.id}\",\"#{c.name}\",\"#{c.soon}\""
    # flush if necessary
  end
end

You'd need to write to disk and then zip the results later with this approach.

AndyV
  • 3,180
  • 1
  • 17
  • 17
1

Write to the CSV in chunks and find_in_batches and pluck. Something like:

Model.pluck(:id, :name, ...).find_in_batches(10_000) do |ary|
  CSV.open("tmp.csv", "ab") do |csv|
    csv << ary.map{|a| a.join ','}.join("\n")
  end
end
B Seven
  • 39,557
  • 59
  • 208
  • 346
  • the OP is using find_each which, I’m sure you know, uses find_in_batches and then iterates over the batches and yields the results. The point of find_each is precisely avoiding memory issues and I’ve used it quite successfully for larger data sets tha. The OP suggests are problematic. It’s less “obviously ActiveRecord” than you simply assert. – AndyV May 30 '19 at 00:54
  • How can this improve the performance as find_each also iterate in batches (:1000)? – Kunal Vashist Jun 03 '19 at 17:59
  • Because the CSV falls out of scope and does not use a large amount of memory. In the OP's example, the entire CSV is held in memory. – B Seven Jun 03 '19 at 21:06
1

It would depend whether

     arr = [c.id,c.name,soon]

needs to be calculated in Ruby, or you would be able to rewrite it in SQL.

  • If you have to keep it in Ruby, you can try to avoid the ActiveRecord overhead and use a raw query instead. You'd probably have to implement the chunk wise processing by yourself
  • Otherwise, you could check out some database native tool for CSV export. E.g., for MySQL that would be something like SELECT INTO OUTFILE or mysql
claasz
  • 1,939
  • 1
  • 13
  • 16