Reading through a file line by line without loading whole file into memory

Question

I am working with a 50 Gb MySQL export file, and performing a list of scripting operations on it to convert to an SQLite3 loadable form (I got the leads from here : script to convert mysql dump sql file into format that can be imported into sqlite3 db ). I have not studied the structure of the MySQL dump, the data was got from a third party. I can see that it has create table and insert into statements, but given the size it is hard to manually read through and understand the structure. Piping the file through will not work because of the size. Also a bash script to load the file and then process line by line, such as

while read line
<do something>

complains that it is Out of Memory.

So I tried to pick each line, using awk or sed (both work), write the line to a file and then pass it through the list of perl scripts. This is the awk script I am using

$ awk -vvar="$x" 'NR==var{print;exit}' file > temp

where x holds the line number and then temp is sent through the perl commands and finally appended to the output file.

However, although fast initially, it quickly slows down as it starts having to iterate over an increasing number of lines from the start. There are about 40,000 lines.

Has anyone worked with something like this? Is there a faster way of doing this?

Why does the script has to iterate over an increasing number of lines? I guess you do not have 40000 files containing a single line each, but could you give some more details? — Håkon Hægland, Apr 26 '14 at 22:47
The command $ sed -n '10{p;q}' file , would read over the first 9 and give me the 10th line and quit sed. As the number goes up to 40,000 , it has to iterate over more lines to give me the specific one. — ame, Apr 26 '14 at 22:50
Ok, I see.. Would it be an option to split the file into, for instance, 50 1Gb files first? And then do the processing on each chunk, and at last join the chunks again.. — Håkon Hægland, Apr 26 '14 at 22:54
I am restricted with regard to the number of files I can have on the supercomputer, so although 50 is too much I could try to get a maximum number which will be better than 1. Thanks for the lead. — ame, Apr 26 '14 at 23:00
Maybe you could also post the script you are using? (If it is big you could try uploading it to pastebin.com and give the link here).. — Håkon Hægland, Apr 26 '14 at 23:22
This seems to be a question that many more people could help with if you gave examples of the MySQL export data and the SQLite3 loadable format that you want. I assume you want one line of SQLite data output for each line of MySQL read from the input? The solution you link to looks *horrible* to me, and I am certain that it could be written in a single Perl program. — Borodin, Apr 27 '14 at 00:14
Are you doing `SELECT * FROM table INTO OUTFILE 'table.txt` in MySQL, and then `.import table.txt table` in SQLite? — Borodin, Apr 27 '14 at 01:16
I haven't written the MySQL dump, and have not been able to follow the structure completely given the size of the file. The question has been updated to reflect this. Also, the suggestion to split the file up worked for the script, although sqlite3 still complains about the format. Thanks! — ame, Apr 27 '14 at 15:54

Jonathan Leffler · Answer 1 · 2014-04-26T23:26:30.907

Simply process one line at a time:

while read -r line
do
    echo "$line" > temp
    …process temp with Perl, etc…
done < file

At least this won't exhibit quadratic behaviour reading the file, which is what your awk script does. It reads the big file exactly once, which is optimal performance in Big-O notation (within a constant factor).

If, as you say, that causes problems in bash, then you should use Perl to read each line. With 40,000 lines in 50 GiB of data, you have about 1¼ MiB of data per line. That is unlikely to cause Perl any problems, though it might perhaps cause Bash problems. You can either revise the existing Perl to read one line at a time, or use a simple wrapper Perl script that does the job of the Bash script above.

`wrapper.pl`

Assuming your current processing script is called script.pl:

#!/usr/bin/env perl
use strict;
use warnings;

my $file = "temp";

while (<>)
{
    open my $fh, ">", $file or die;
    print $fh $_;
    close $fh;
    system "perl", "script.pl", $file;
}

^{Untested code}

Invocation:

perl wrapper.pl <file >output

Reading through a file line by line without loading whole file into memory

1 Answers1

wrapper.pl

`wrapper.pl`