Perl6 : What is the best way for dealing with very big files?

Question

Last week I decided to give a try to Perl6 and started to reimplement one of my program. I have to say, Perl6 is so the easy for object programming, an aspect very painfull to me in Perl5.

My program have to read and store big files, such as whole genomes (up to 3 Gb and more, See example 1 below) or tabulate data.

The first version of the code was made in the Perl5 way by iterating line by line ("genome.fa".IO.lines). It was very slow and unsable for a correct execution time.

my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    my $id;
    my $s;

    for $!file.IO.lines -> $line {
      if $line ~~ /^\>/ {
        say $id;
        if $id.defined {
          %!seq{$id} = sequence.new(id => $id, seq => $s);
        }
        my $l = $line;
        $l ~~ s:g/^\>//;
        $id = $l;
        $s = "";
      }
      else {
        $s ~= $line;
      }
    }
    %!seq{$id} = sequence.new(id => $id, seq => $s);
  }
}


sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

So after a little bit of RTFM, I changed for a slurp on the file, a split on the \n which I parsed with a for loop. This way I managed to load the data in 2 min. Much better but not enough. By cheating, I mean by removing a maximum of \n (Example 2), I decreased the execution time to 30 seconds. Quite good, but not totaly satisfied, by this fasta format is not the most used.

my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    my $id;
    my $s;

    say "Slurping ...";
    my $f = $!file.IO.slurp;

    say "Spliting file ...";
    my @lines = $f.split(/\n/);

    say "Parsing lines ...";
    for @lines -> $line {
      if $line !~~ /^\>/ {
          $s ~= $line;
      }
      else {
        say $id;
        if $id.defined {
          %!seq{$id} = seq.new(id => $id, seq => $s);
        }
        $id = $line;
        $id ~~ s:g/^\>//;
        $s = "";
      }
    }
    %!seq{$id} = seq.new(id => $id, seq => $s);
  }
}

sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

So RTFM again and I discovered the magic of Grammar. So new version and an execution time of 45 seconds whatever the fasta format used. Not the fastest way but more elegant and stable.

my grammar fastaGrammar {
  token TOP { <fasta>+ }

  token fasta   {<.ws><header><seq> }
  token header  { <sup><id>\n }
  token sup     { '>' }
  token id      { <[\d\w]>+ }
  token seq     { [<[ACGTNacgtn]>+\n]+ }

}

my class fastaActions {
  method TOP ($/){
    my @seqArray;

    for $<fasta> -> $f {
      @seqArray.push: seq.new(id => $f.<header><id>.made, seq => $f<seq>.made);
    }
    make @seqArray;
  }

  method fasta ($/) { make ~$/; }
  method id    ($/) { make ~$/; }
  method seq   ($/) { make $/.subst("\n", "", :g); }

}

my class fasta {
  has Str $.file is required;
  has %seq;

  submethod TWEAK() {

    say "=> Slurping ...";
    my $f = $!file.IO.slurp;

    say "=> Grammaring ...";
    my @seqArray = fastaGrammar.parse($f, actions => fastaActions).made;

    say "=> Storing data ...";
    for @seqArray -> $s {
      %!seq{$s.id} = $s;
    }
  }
}

sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

I think that I found good solution to handle these kind of big files, but performances are still under those of Perl5.

As a newbie in Perl6, I would be interested to know if there is better ways to deal with big data or if there is some limitation due to the Perl6 implementation ?

As a newbie in Perl6, I would ask two questions :

Is there other Perl6 mechanisms that I'm not aware yet, or not yet documented, for storing huge data from a file (like my genomes) ?
Did I reach the maximum performances for the current version of Perl6 ?

Thanks for reading !

Fasta Example 1 :

>2L
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...
>3R
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...

Fasta example 2 :

>2L
GACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT...            
>3R
TAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCT...

EDIT I applied advises of @Christoph and @timotimo and test with code:

my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    say "=> Slurping / Parsing / Storing ...";
    %!seq = slurp($!file, :enc<latin1>).split('>').skip(1).map: {
  .head => seq.new(id => .head, seq => .skip(1).join) given .split("\n").cache;
    }
  }
}


sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

The program finished in 2.7s, which is so great ! I also tried this code on the wheat genome (10 Gb). It finished in 35.2s. Perl6 is not so slow finally !

Big Thank for the help !

You're already trying several different mechanisms. It will be tough to answer this without a whole lot more details of what you are trying to do. — Curt Tilmes, Aug 24 '18 at 14:29
**Limits** [Larry on P6 speed in general and grammars in particular](https://www.youtube.com/watch?v=EwQoSZdEl2s&t=11m30s) from 11m30s thru 17m20s. (I'm pretty sure he hasn't ported Fates from STD to NQP.) **Please profile** See [P6 doc](https://docs.perl6.org/language/performance) & [beta of new profiler](https://wakelift.de/2018/08/15/the-first-public-release/) **Use P6+P5?** 1) P6 in P5 https://metacpan.org/pod/Inline::Perl6, 2) P5 in P6 https://github.com/niner/Inline-Perl5. **Improve your P5 OO?** Stevan designed P6 OO, then Moose for P5, now [Moxie](https://metacpan.org/pod/Moxie). — raiph, Aug 24 '18 at 17:19
@CurtTilmes : I added somme code snippets of the differents of my fasta parsing. I hope it will easier for poeple to understand what I'm trying to do. — Beuss, Aug 24 '18 at 19:53
@raiph Thank for the video, it was very instructive ! I have thougth to include Perl5 code for speeding up the process, but because it's my first program in Perl6, I wanted to explore it's capacity before using this solution. I didn't had the time to read the whole manual, but I'm gonna take a look to Profile. — Beuss, Aug 24 '18 at 19:57
Your first example can become much faster by dropping regex use down to zero; on my machine it goes from 40s down to 7.6s with these changes: `$file.IO.slurp.lines` instead of `$file.IO.lines`, `$line.starts-with(">")` instead of `line ~~ /^>/`, `$l = $l.substr(1)` instead of `$l ~~ s:/^>//`. Also, drop the "say $id" from the loop. If you really need the output, maybe `put %!seq.keys.join("\n")` at the end of TWEAK or in a separate method. — timotimo, Aug 25 '18 at 13:26
For anybody who wants to try the script, here's a one-liner that generates some example fasta data: `my $f = "genome.fa".IO.open(:w); my @ids = ("A".."Z").combinations(20).pick(*); while $f.tell < 300_000_000 { $f.put(">" ~ @ids.shift.join()); $f.put(.roll(80).join()) for ^(2..15).pick }` — timotimo, Aug 25 '18 at 13:27
I'm not sure if it belongs in a comment, or in an answer of its own, but here's a blog post I wrote about this question: "Faster FASTA, please": https://wakelift.de/2018/08/31/faster-fasta-please/ — timotimo, Aug 31 '18 at 21:18

Christoph · Accepted Answer · 2018-09-01T19:24:53.040

One simple improvement is to use a fixed-width encoding such as latin1 to speed up character decoding, though I'm not sure how much this will help.

As far as Rakudo's regex/grammar engine is concerned, I've found it to be pretty slow, so it might indeed be necessary to take a more low-level approach.

I did not do any benchmarking, but what I'd try first is something like this:

my %seqs = slurp('genome.fa', :enc<latin1>).split('>')[1..*].map: {
    .[0] => .[1..*].join given .split("\n");
}

As the Perl6 standard library is implemented in Perl6 itself, it is sometimes possible to improve performance by just avoiding it, writing code in an imperative style such as this:

my %seqs;
my $data = slurp('genome.fa', :enc<latin1>);
my $pos = 0;
loop {
    $pos = $data.index('>', $pos) // last;

    my $ks = $pos + 1;
    my $ke = $data.index("\n", $ks);

    my $ss = $ke + 1;
    my $se = $data.index('>', $ss) // $data.chars;

    my @lines;

    $pos = $ss;
    while $pos < $se {
        my $end = $data.index("\n", $pos);
        @lines.push($data.substr($pos..^$end));
        $pos = $end + 1
    }

    %seqs{$data.substr($ks..^$ke)} = @lines.join;
}

However, if the parts of the standard library used has seen some performance work, this might actually make things worse. In that case, the next step to take would be adding low-level type annotations such as str and int and replacing calls to routines such as .index with NQP builtins such as nqp::index.

If that's still too slow, you're out of luck and will need to switch languages, eg calling into Perl5 by using Inline::Perl5 or C using NativeCall.

Note that @timotimo has done some performance measurements and wrote an article about it.

If my short version is the baseline, the imperative version improves performance by 2.4x.

He actually managed to squeeze a 3x improvement out of the short version by rewriting it to

my %seqs = slurp('genome.fa', :enc<latin-1>).split('>').skip(1).map: {
    .head => .skip(1).join given .split("\n").cache;
}

Finally, rewriting the imperative version using NQP builtins sped things up by a factor of 17x, but given potential portability issues, writing such code is generally discouraged, but may be necessary for now if you really need that level of performance:

use nqp;

my Mu $seqs := nqp::hash();
my str $data = slurp('genome.fa', :enc<latin1>);
my int $pos = 0;

my str @lines;

loop {
    $pos = nqp::index($data, '>', $pos);

    last if $pos < 0;

    my int $ks = $pos + 1;
    my int $ke = nqp::index($data, "\n", $ks);

    my int $ss = $ke + 1;
    my int $se = nqp::index($data ,'>', $ss);

    if $se < 0 {
        $se = nqp::chars($data);
    }

    $pos = $ss;
    my int $end;

    while $pos < $se {
        $end = nqp::index($data, "\n", $pos);
        nqp::push_s(@lines, nqp::substr($data, $pos, $end - $pos));
        $pos = $end + 1
    }

    nqp::bindkey($seqs, nqp::substr($data, $ks, $ke - $ks), nqp::join("", @lines));
    nqp::setelems(@lines, 0);
}

This is very helpful! But it makes me sad, because it seems to boil down to “Perl6 isn’t ready for prime time yet”. — Dan Bron, Aug 25 '18 at 12:44
Your first answer can be made drastically faster by using `.skip(1)` instead of `[1..*]`, and `.head` and `.skip(1)` inside the loop body; additionally it requires the `.split("\n")` to be "enhanced" to a `.split("\n").cache` so the head and skip methods work on it. That took it from 47s down to 12s on my machine. I have some more ideas, more on that in later comments or maybe an answer of its own — timotimo, Aug 25 '18 at 13:10
a quick profile of the second code revealed that a big source of time spent is in the `..^` Range constructor operator. using `$pos, $end - $pos` instead of `$pos ..^ $end` gets me from 16.2s down to 8.75s, so almost halved the time. — timotimo, Aug 25 '18 at 13:38
using native int for all the position numbers got the code down to pretty much exactly 8s; i'll try the nqp ops next. — timotimo, Aug 25 '18 at 13:44
FWIW, the file I measured performance with was only ~160 megabytes big, with 20 characters for the IDs and 230_230 sequences in total, with 1_957_826 lines of actual genome sequence data in total. — timotimo, Aug 25 '18 at 14:01
Doesn't moarvm automatically do something similar to assuming Latin1 encoding until input breaks that assumption? In other words, isn't the `:enc` largely or entirely redundant from a performance perspective for a file that is actually Latin1? — raiph, Aug 25 '18 at 14:19
@raiph what it does is try to store data read from utf8 sources with 8 bits per grapheme, until something is encountered that doesn't fit, at which point it will change over to 32bit per grapheme. i do believe, that the utf8 decoder has seen more optimization effort put into it than the latin1 one, when i tried switching the encoding, it made barely any difference. — timotimo, Aug 25 '18 at 14:43
converting the version with native ints to use nqp ops (those aren't officially supported btw, code using these ops can spontaneously break when rakudo changes) gets the program finishing in just 2.9s, of which 0.34s is system time according to time, and the profiler estimates about 18% of the total time to be spent inside "slurp" itself. Doesn't sound terribly bad. — timotimo, Aug 25 '18 at 14:47
@timotimo It's seem that I have still a lot to learn ! thanks for your answers, I'm gonna try your suggestions ! — Beuss, Aug 27 '18 at 09:56
@timotimo:I expanded my answer using the results from your blog post — Christoph, Sep 01 '18 at 17:43

Perl6 : What is the best way for dealing with very big files?

1 Answers1

Linked