3

I'm trying to modify a perl script. Here is the part I am trying to modify:

Original:

        system ("tblastn -db $BLASTDB -query $TMP/prot$$.fa \\
             -word_size 6 -max_target_seqs 5 -seg yes -num_threads $THREADS -lcase_masking \\
             -outfmt \"7 sseqid sstart send sframe bitscore qseqid\"\\
             > $TMP/blast$$") && die "Can't run tblastn\n";

I am trying to replace the system ("tblastn.....") with the following:

system ("cat $TMP/prot$$.fa | parallel --block 50k --recstart '>' --pipe \\ tblastn -db $BLASTDB -query - -word_size 6 -outfmt \'7 sseqid sstart send sframe bitscore qseqid\' -max_target_seqs 5 -seg yes -lcase_masking > $TMP/blast$$") && die "Can't run tblastn\n";

This replaces the normal tblastx program with GNU parallel, which pipes the tblastx command. Running the above command in bash (replacing the temp inputs with actual files) works perfectly, but when the perl script tries executing it, the error log (for tblastx) says it terminated too soon, after sseqids. The same error happens if you run the same command without the escape characters in bash.

Because of this I'm assuming the error is due to the single quote around the "7 ssequids sstart..." is not being parsed properly. I'm not sure how to do nested quotes properly in perl. I thought I was doing it right since it works via bash but not via the perl script. I looked at alot of perl documentation and everything says that the escape character \ should work with quotes or double quotes., yet for my instance it doesn't work.

Can someone provide input on why the quotes are not being processed?

Blaze
  • 31
  • 3
  • I see you have slashed single quotes `\'` in your code. Did you try to remove the slashes, leaving just the single quotes? – Marco Bernardini May 06 '15 at 05:09
  • Yes, I tried it without escape characters in the modified script as well. Same error, the script ends at the "7" since it sees a space. – Blaze May 06 '15 at 05:15
  • 1
    Another debug tip: instead of directly use the `system` call, create a string with all the stuff and print it, so you can see what would be passed to system. I've seen tblastn is pretty complex: here http://www.ncbi.nlm.nih.gov/books/NBK279682/ double quotes are used for `outfmt`. – Marco Bernardini May 06 '15 at 05:24
  • 1
    You could use an alternate quote delimiter for the perl side of things, like `system( qq[cat $TMP/prot$$...] )` to save you having to escape the inner `"`s. – Jim Davis May 06 '15 at 06:07

2 Answers2

1

The problem here is almost certainly quote interpolation. Each time you 'shell out' you unwrap another layer of quotes. What it does inside the quotes is a question of whether you're doing a double quote " - it interpolates, or a single quote ' and it then treats it as a literal, before passing to the next shell.

See perlop for how perl does quoting. I would suggest you try assembling a command like this:

my $parallel = q{parallel --block 50k --recstart '>' --pipe};
my $outfmt = q{'7 sseqid sstart send sframe bitscore qseqid'};

print $parallel,"\n";
print $outfmt,"\n";

my $command = "cat $TMP/prot$$.fa | $parallel \\ tblastn -db $BLASTDB -query - -word_size 6 -outfmt $outfmt -max_target_seqs 5 -seg yes -lcase_masking > $TMP/blast$$";

print $command; 
system ( $command );

(obviously checking that your 'command' looks right before passing it to system)

But can I suggest a different approach? How about instead of embedding cat and parallel you can do this natively in perl.

I'm afaid I'm not entirely familiar with the command you're running, but it'd be something like:

#!/usr/bin/perl

use strict;
use warnings;

open( my $input, "<", "$TMP/prot$$.fa" ) or die $!;

my $fork_manager = Parallel::ForkManager->new($THREADS);

while ( my $line = <$input> ) {
    $fork_manager->start and next;
    chomp $line;
    system(
        "tblastn -db $BLASTDB -query $line \\
                 -word_size 6 -max_target_seqs 5 -seg yes  -lcase_masking \\
                 -outfmt \"7 sseqid sstart send sframe bitscore qseqid\"\\
                 > $TMP/blast$$"
    ) && die "Can't run tblastn\n";
    $fork_manager->finish;
}
close ( $input );

If output coalescing is desirable, I'd probably switch to using threading:

#!/usr/bin/perl

use strict;
use warnings;
use IPC::Open2;
use threads;
use Thread::Queue; 

my $num_threads = 8; 

my $work_q = Thread::Queue -> new(); 
my $results_q = Thread::Queue -> new(); 

sub worker {
    open2 ( my $blast_out, my $blast_in, "tblastn -db $BLASTDB -query - -word_size 6 -outfmt '7 sseqid sstart send sframe bitscore qseqid' -max_target_seqs 5 -seg yes -lcase_masking");
    while ( my $query = $work_q -> dequeue ) {
        print {$blast_in} $query;
        $results_q -> enqueue ( <$blast_out> ); #one line - you'll need something different for multi-line results.
    }
    close ( $blast_out );
    close ( $blast_in ); 
}

sub collate_results {
    open ( my $output, "$TMP/results.$$" ) or die $!; 
    while ( my $result = $results_q -> dequeue ) {
        print {$output} $result,"\n"; 
    }
    close ( $output ); 
}

my @workers; 
for (1..$num_threads) {
    push ( @workers, threads -> create ( \&worker ) ); 
}

my $collator = threads -> create ( \&collate_results ); 

open( my $input, "<", "$TMP/prot$$.fa" ) or die $!;
while ( my $line = <$input> ) {
    chomp $line;
    $work_q -> enqueue ( $line ); 
}
close ( $input );
$work_q -> end;

foreach my $thr ( @workers ) { 
    $thr -> join(); 
}
$results_q -> end;

$collator -> join; 

Now I appreciate that both these may look a little more convoluted and complicated. But they're more examples of how to extent perl into doing parallel, because by doing so you have a lot more scope and flexibility than you would by running perl, but 'shelling out' to do things.

Sobrique
  • 51,581
  • 6
  • 53
  • 97
1

Quoting is a bitch. Quoting twice is a bitch².

First check that what you think is being run is actually being run. Here a print STDERR can do wonders.

In your case I think this will solve it:

my $TMP = $ENV{'TMP'};
my $BLASTDB = $ENV{'BLASTDB'};
my $cmd = qq{cat $TMP/prot$$.fa | parallel --block 50k --recstart '>' --pipe tblastn -db $BLASTDB -query - -word_size 6 -outfmt \\''7 sseqid sstart send sframe bitscore qseqid'\\' -max_target_seqs 5 -seg yes -lcase_masking > $TMP/blast$$};
print STDERR $cmd,"\n"; # Remove this when it works.
system($cmd) && die "Can't run tblastn\n";

If you are going to read $TMP/blast$$ back in and delete it, you might do this instead:

my $TMP = $ENV{'TMP'};
my $BLASTDB = $ENV{'BLASTDB'};
open(my $fh, "-|", qq{cat $TMP/prot$$.fa | parallel --block 50k --recstart '>' --pipe tblastn -db $BLASTDB -query - -word_size 6 -outfmt \\''7 sseqid sstart send sframe bitscore qseqid'\\' -max_target_seqs 5 -seg yes -lcase_masking}) || die "Can't run tblastn\n";
while(<$fh>) { ... }
close $fh;

That will avoid creating a temporary file and if $TMP can be written to by an attacker, this also plugs a security hole. As an added bonus you will be getting data earlier since you will not have to wait until every job is complete.

Ole Tange
  • 27,221
  • 5
  • 71
  • 88