11

I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:

>accession1
ATGGCCCATG
GGATCCTAGC
>accession2
GATATCCATG
AAACGGCTTA

I'd like to convert it into this:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

I found a potential solution on this site, which looks like this:

cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta

However, this places an extra line break between each entry, so file looks like this:

>accession1
ATGGCCCATGGGATCCTAGC

>accession2
GATATCCATGAAACGGCTTA

I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";} was the culprit...potentially print "\n" is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:

awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta

However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:

{empty line} 
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Anyone have a solution to get my file in the correct format? Thanks!

Timur Shtatland
  • 7,599
  • 2
  • 20
  • 30
chimeric
  • 685
  • 1
  • 8
  • 14

9 Answers9

12

This awk program:

% awk '!/^>/ { printf "%s", $0; n = "\n" } 
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta

Will yield:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.

On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.

End with a newline, if required.

Note:

By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in , which is what you would do in and in most other traditional languages.

--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide

Johnsyweb
  • 121,480
  • 23
  • 172
  • 229
  • 1
    if the input ends with line `>foobar`, the script will print an extra empty line. (it may not be the case, however). and `printf $0` is not safe if the line contains printf format string. – Kent Apr 06 '13 at 23:47
  • awesome, totally works. thanks!! just curious -- for the very first line of the fasta file, i would have expected it to throw an error when it reads `/^>/ { print n $0 }`, because n doesn't exist yet. however, it doesn't seem to care that n doesn't exist. why is this? – chimeric Apr 06 '13 at 23:53
  • @chimeric: I've added a note to address the "uninitialised" variable, I hope that helps. – Johnsyweb Apr 07 '13 at 03:16
8

The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:

 awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file

Explanation:

For lines beginning with >, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >, print the line without a trailing newline character. Since the last line in the file won't begin with >, use the END block to print a final newline character.

Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >. Try:

awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file
Steve
  • 41,445
  • 12
  • 83
  • 96
4

There is another awk one-liner, should work for your case.

awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file
Kent
  • 173,042
  • 30
  • 210
  • 270
3

I would use sed for this. Using GNU sed:

sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file

Results:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

Create a label, a. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >, perform the substitution s/\n\([^>]\)/\1/. If the substitution was successful since the last input line was read, then branch to label a. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.

Steve
  • 41,445
  • 12
  • 83
  • 96
1

You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

kvantour
  • 20,742
  • 4
  • 38
  • 51
0

Another variation :-)

awk '!/>/{printf( "%s", $0);next}
     NR>1{printf( "\n")} 
     END {printf"\n"}
     7' YourFile
NeronLeVelu
  • 9,372
  • 1
  • 21
  • 41
0

Use this Perl one-liner, which does all of the common reformatting that is necessary in this and similar cases: removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. Note that unlike some of the other answers, this properly handles leading and trailing whitespace/newlines in the file:

# Create the input for testing:

cat > test_unwrap_in.fa <<EOF

>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGT

ACGT

ACGT

>seq3 without blanks or newlines
ACGTACGTACGT

EOF

# Reformat with Perl:

perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' test_unwrap_in.fa > test_unwrap_out.fa

Output:

>seq1 with blanks
ACGTACGTACGT
>seq2 with newlines
ACGTACGTACGT
>seq3 without blanks or newlines
ACGTACGTACGT

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.

chomp : Remove the input line separator (\n on *NIX).
if ( /^>/ ) : Test if the current line is a sequence header line.
$n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; } : Print the final newline after the last sequence.
s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.

Timur Shtatland
  • 7,599
  • 2
  • 20
  • 30
0

Do not reinvent the wheel. If the goal is simply removing newlines in multi-line fasta file (unwrapping fasta file), use any of the specialized bioinformatics tools, for example seqtk, like so:

seqtk seq -l 0 input_file

Example:

# Create the input for testing:

cat > test_unwrap_in.fa <<EOF

>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGT

ACGT

ACGT

>seq3 without blanks or newlines
ACGTACGTACGT

EOF

# Unwrap lines:

seqtk seq -l 0 test_unwrap_in.fa > test_unwrap_out.fa

cat test_unwrap_out.fa

Output:

>seq1 with blanks
ACGT ACGT ACGT
>seq2 with newlines
ACGTACGTACGT
>seq3 without blanks or newlines
ACGTACGTACGT

To install seqtk, you can use for example conda install seqtk.

SEE ALSO:

seqtk usage:

seqtk seq

Usage:   seqtk seq [options] <in.fq>|<in.fa>

Options: ...
         -l INT    number of residues per line; 0 for 2^32-1 [0]
Timur Shtatland
  • 7,599
  • 2
  • 20
  • 30
0

There have been great responses so far.

Here is an efficient way to do this in Python:

def read_fasta(fasta):
    with open(fasta, 'r') as fast:
        headers, sequences = [], []
        for line in fast:
            if line.startswith('>'):
                head = line.replace('>','').strip()
                headers.append(head)
                sequences.append('')
            else :
                seq = line.strip()
                if len(seq) > 0:
                    sequences[-1] += seq
    return (headers, sequences)


def write_fasta(headers, sequences, fasta):
    with open(fasta, 'w') as fast:
        for i in range(len(headers)):
            fast.write('>' + headers[i] + '\n' + sequences[i] + '\n')

You can use the above functions to retrieve sequences/headers from a fasta file without line breaks, manipulate them, and write back to a fasta file.

headers, sequences = read_fasta('input.fasta')
new_headers = do_something(headers)
new_sequences = do_something(sequences)
write_fasta(new_headers, new_sequences, 'input.fasta')
JafetGado
  • 875
  • 7
  • 19