13

I have the following FASTA file:

>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT

My desired output:

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

This is my code:

awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa

The output I get with this code is:

>header1
60
57
>header2
3
>header3
7

I need a small modification in order to deal with multiple sequence lines.

I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.

kvantour
  • 20,742
  • 4
  • 38
  • 51
cucurbit
  • 987
  • 1
  • 10
  • 28

3 Answers3

19

An awk / gawk solution can be composed by three stages:

  1. Every time header is found these actions should be performed:

    • Print previous seqlen if exists.
    • Print tag.
    • Initialize seqlen.
  2. For the sequence lines we just need to accumulate totals.
  3. Finally at the END stage we print the remnant seqlen.

Commented code:

awk '/^>/ { # header pattern detected
        if (seqlen){
         # print previous seqlen if exists 
         print seqlen
         }

         # pring the tag 
         print

         # initialize sequence
         seqlen = 0

         # skip further processing
         next
      }

# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa

A oneliner:

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa

For the totals:

awk '/^>/ { if (seqlen) {
              print seqlen
              }
            print

            seqtotal+=seqlen
            seqlen=0
            seq+=1
            next
            }
    {
    seqlen += length($0)
    }     
    END{print seqlen
        print seq" sequences, total length " seqtotal+seqlen
    }' file.fa
Juan Diego Godoy Robles
  • 12,742
  • 2
  • 36
  • 47
  • Yeah, this works. But also I need a "final line" with the total of sequences and the total length, following the example: 3 sequences, total length 127. (Sorry, in the question this is behind a # ) – cucurbit Jun 02 '14 at 10:57
  • Just minor formatting change. `awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}'` – Jotne Jun 02 '14 at 10:57
  • A very old topic, but it is just what I need! Is it also possible to adjust the one liner awk command that it prints the number of nucleotides on the same line as the header, and not on a new line, like: >header1 60 – Gravel May 17 '17 at 10:37
  • Sure @Gravel just place `; printf $0" " ;` instead of `; print ;` – Juan Diego Godoy Robles May 22 '17 at 14:29
  • this will include newline characters in the count – Brian Wiley Sep 08 '20 at 06:45
  • sorry maybe its counting carriage return. just insert `BEGIN{RS="\r\n"}` at the beginning – Brian Wiley Sep 08 '20 at 07:08
1

A quick way with any awk, would be this:

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta

You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files

bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

kvantour
  • 20,742
  • 4
  • 38
  • 51
0

I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.

It also parses out the sequence id from the header line, based on whitespace (chrM in >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable target like so: $ awk -f seqlen.awk -v target=chrM seq.fa.

BEGIN {
  OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
  if (seqlen) {
    # Only print info for this sequence if no target was given
    # or its id matches the target.
    if (! target || id == target) {
      print id, seqlen;
    }
  }
  # Get sequence id:
  # 1. Split header on whitespace (fields[1] is now ">id")
  split($0, fields);
  # 2. Get portion of first field after the starting ">"
  id = substr(fields[1], 2);
  seqlen = 0;
  next;
}
{
  seqlen = seqlen + length($0);
}
END {
  if (! target || id == target) {
    print id, seqlen;
  }
}
Nick S
  • 349
  • 3
  • 14