0

everyone. I'm trying to filter a big xml file (from a BLAST) to keep only some <Interaction> nodes defined by a list of <Iteration_iter-num> values that I define from a file. Here is a simplified example (the real Blast.xml have more than 80000 Iterations):

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastx</BlastOutput_program>
   <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>3037</Iteration_iter-num>
      <Iteration_query-ID>Query_3037</Iteration_query-ID>

    </Iteration>
    <Iteration>
      <Iteration_iter-num>5673</Iteration_iter-num>
      <Iteration_query-ID>Query_5673</Iteration_query-ID>

    </Iteration>
    <Iteration>
      <Iteration_iter-num>11397</Iteration_iter-num>
      <Iteration_query-ID>Query_11397</Iteration_query-ID>

    </Iteration>
    <Iteration>
      <Iteration_iter-num>15739</Iteration_iter-num>
      <Iteration_query-ID>Query_15739</Iteration_query-ID>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

and I have a file with the iterations to keep (saved as keep_iter):

5673
11397

For this kind of low scale problem I managed to do the filtering with xmlstarlet, creating first a version of the file to store the string for the comparison (saved as filter):

Iteration_iter-num!=5673 and Iteration_iter-num!=11397

This works as a charm with:

cat Blast.xml | xmlstarlet ed -d "/BlastOutput/BlastOutput_iterations/Iteration[`cat filter`]" > finalBlast.xml

Basically, I removed all the Iteration nodes that were not in the filter file Obtaining:

   <?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastx</BlastOutput_program>
   <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>5673</Iteration_iter-num>
      <Iteration_query-ID>Query_5673</Iteration_query-ID>

    </Iteration>
    <Iteration>
      <Iteration_iter-num>11397</Iteration_iter-num>
      <Iteration_query-ID>Query_11397</Iteration_query-ID>

    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

The problem is that I really have a keep_iter file with 20000 values to filter. When I create the filter file and run the xmlstarlet command above, the argument is obviously too long.

Any suggestion to filter such a Blast.xml file to keep only those Iteration nodes which iteration number is listed in the keep_iter file (with 20k values)? I want to keep the original xml structure.

anpefi
  • 48
  • 6

1 Answers1

0

For large files like this I would consider a more streaming approach, for example using something like Perl's XML::Twig

#!/usr/bin/env perl

use XML::Twig;

my %keep = ();
open(KEEP, "keep_iter") or die "Couldn't open keep_iter";
while(<KEEP>) {
  chomp;
  $keep{$_} = 1;
}
close(KEEP);

my $t = XML::Twig->new(
  twig_roots => { 'Iteration' => \&process_iter },
  twig_print_outside_roots => 1,
  keep_spaces => 1,
);

$t->parsefile('Blast.xml');

sub process_iter {
  my ($t, $iter) = @_;
  if($keep{$iter->first_child_text('Iteration_iter-num')}) {
    $t->flush; # if it was in keep_iter, keep it
  } else {
    $t->purge; # otherwise don't
  }
}
Ian Roberts
  • 114,808
  • 15
  • 157
  • 175
  • Really useful script! It is much simpler that I thought using Perl. My knowledge of Perl is very limited and that is the reason why I was exploring a solution within bash and xmlstarlet. Now I will read about XML::Twig to fully understand the script. Thanks for quick and useful answer. – anpefi Oct 08 '13 at 13:21
  • @anpefi it's a useful little module. The combination of twig_roots and print_outside_roots means that it works its way through the document printing out anything that's not inside an `Iteration` element. When it finds an `` start tag it parses everything up to the matching closing tag into a tree structure and passes that to the specified handler function where you can manipulate the tree and either `flush` it (if you want to print the result) or `purge` it (if you don't). Then it throws away that tree and starts on the next one, so you can handle huge documents efficiently. – Ian Roberts Oct 08 '13 at 14:11