0

I have previously used awk commands to extract fasta sequence data based on a separate file of header IDs. However, these are not working for the specific example below.

Input fasta sequence file (seq.fasta)

>106677020 product=phosphatidylinositol 3-kinase catalytic subunit type 3-like
tttaaaaaaaaaaaaaaaaaaaaaaaaaaaaatttgtatactcCCTTCAGTGGCTAAAACCTCAGCTACAGGCAGCGAATCAATGAACTGTAGAAAACCATGTTTAGTGCTAGTGGCAAGAACCCTATATGGTGTAAGTTTCAAGTCCAGGTTTTCACTACGCAATAATTTATCCATCAGAGTGATTATTTGGAGAATAAGTTGATCCTGCCTTAAATCATccccatttttgaaaattgctACATATTCTGTACCAGCTGTGGTTACAAATGTTAATCTTGCAGGCATTAATGCTGACTTGAAAAGTGTAGCTTTCTCAGGTATGATgccttttacatatatatttgaatcTAAAGGAAGCGGTAGTGGTTCGAAgtttgagaaattaattttaaatgtatcagTGTCTGCTAGCAATGCCTGAAGTctatccattttcttttttctattcccACTTTCTCTGGCAATAGTTTTCATCAATTTCACtaatttatcaacaaaattttgttgacgtttaataaaattttgacgaTTTTGCCATTCTGGTGGTCCATATTGTAAAGCTTGCAAAAACATCTTCATAACTTTAAGGTACATGCTGCGAGGTTTATTATCTTGCTTAGCTAAATCATGGTCTTCACATTCAGTTAGCAggtaccaataaaaataattggccAATGTACTGTTTTGACATGCCCTATGAATTAAGAAGGAAGCAAGGTCCATTGTTTTGGAATCTTCTACTGAACTCTCTGTTTTAATGTCAACTGATCCAGAAATactttcttcaataattttagtttgagACGATTTTTTGTAGGCTTCTGAAATATCCtcgaaattttcatatttcagagCTTGCACTAATTGTAAAAGATAAAGGAGTAAATCTTCATCTGGAGCCTTTTGTAGTCTGGTAACAGCGTACCTTCGAACTGATGGGTGGGTAAAATAAGGGCTTAATAACTGAAGGGCATCTTCTACTTCCATCGGTGACCACACATGTAACATATGTATCGCTTGTTTCACTTCACCACTTAGGCTCCAATTCACACATTTTAAGAACTTGGTCAAAGCTTTCTTCTGAGAGCTCAAATAAAATCTGAATTTCCACAGTAGATCTTGCTCTTCAGTAGTAAGTTGTTGCGTAGGAGGATAGGCTACTATCCGATTTAAAGTATCTCTAACTGTAGCATTAGGTTTTAAATCTTTATCTGATAATCCACTGCGCCAACTTCGAGCTAAATTATGATGTTTACTTTCCACTAAATTTTCTTGAAGTATTTCTGGGTCATGGACTGTTACAATATCAGGATGGGCATGGAATTGAAAAACTTCATCTCCATCTTGTTCAAACCAAACTATAGAATAGGGTGTCCCATTGACAGTAGCCTGAGGGAATTCAATCATTAGGtataaatattcagaaattcttttctctttatcatttattttctcaatttcacGGAAGGTCAACCTATCCAACCAATCAATGACAGGCATAAACCCATTTCTATGTTTCTTCGCCAATTTTGCAAGTCTTTGCATATTTTCATTACCTGTACCTGGAGTTTTTCCAGGCGTAGTACAATTTTCTGATCCATCAGCAACAACATCAGGCCACACTTTTAAATCCAACATACCTTGACGAAAAACATTATGCTTACCAAACAATGTAATTGAAGAACCTCCAACTGGTCTCATGGTAGATGGGCCAATACAATCATAAATGGTGATTGCCAATATTGCATTACGTGGCAGGTCAGAATACATTATAGGTAGTGTTAACCACTCGCTCCATGTCCAGCGATTTGTAAAATTCTTGTAAGAAGTCAG
>106677021 product=putative transferase CAF17 homolog%2C mitochondrial
AAAGTGTAtacttaatagttaattttaaagtagTAGACAATTTAAGCTAATCGTGATTTACTTGACATTTCTTGATTATAGCTTAAGTTGctctattgtttataatataattggatgaaagtaattatgatataataatttggaagattctgttaatattttgattttaaattaaggatAATGTTGAGATACCCAGTTTCATGTACATTTTTAGGCTATATCAAAAGAAGTATTTTGCTCAGTTGCAGATACAACCACAGTAAAACCGATTTTCGTTTGGAAGAACTCAATCATCGAAAAATTGTAAGATTGTCTGGAGAAGAATCTTCTAACTTTTTGCAAGGTTTGGTCACTAATGATGTCAATAATATCTCTTCTTCAATGTACAccatgtttttaaacaataggGGGCGGATTTTATTTGACTCGATTATCTACCCTGCTAAAGAGAAAGATACCTTCCTGTTGGAGTGTGATTCTCAAGCTATGCACCAATTAATCAAACATCTAAATATGTATaagctcagaaaaaaaattgcaatctCTCTAGCTTCTGAATTGAATGTATGGTGTATTTACAATCCTAAGCTTGTTGATAATTCAAATGAGGCAAAAGTTTCTTCTACCGAGACATTTGATATGAATGctgttgataaaaatttaatgattacacCCGACCCTCGAACTAATTTGTTAGGTTACCGTATTATTGCCAAAGAAGGTGATGAAATACCTAATTTGCCTAAAAGTGATTTGTATACACTATGCCGATATAAATTAGGCATTGGTGAAGGCATTGACGAACTTCTCTTCGAACAAAGTTTTCCTCTTGAAATGAACTGTGATTACCTTAATGGTGTATCTTTTAACAAGGGATGTTACATTGGACAAGAACTGACTGCAAGATCATTTCATACTGGTGTAATTAGGAAACGGTTAATGCCTCTTATTTTTGAGTCTGAGGCACTCGGTATTCCAATAAATACACCTATCGAAGATCCTAATATTACTAGAAAGTCACCCATCGGAAAAGTTAGGACTGTTAAAGGTGTGAATGGTATTGGGTTGATGCGTGTATCCGAAACAATAGAatctaaatctttaaaaataattaattttatggcaAGATCTTATATTCCAGGTTGGTGGCCTGAAGAAACTGTAGAGCAAAcatatgtcaaaataaaaaaataatttattgattatttgta
>106677022 product=uncharacterized LOC106677022
TACAGTTTAAATAGGAGGCAATCTAGTTCCAACGGTCGCAGTACCCCGCCTAGACAAACCCACACCGTTGCCAACATGAAATTCGCTATT
GTTTGCCTGTTGGCAGTCAGTGCAGTGAGCGCCTCTCGCTACAGGAGGTCCCTCGTCGGATGGCCGCTTGGTCTAGCTAGCCACGGAGCGGTCGCTGTAGGACTCTCCCATCCTGGAGCAGTGGCCGTTGGCCTGTCCCATCCAGGAGCAGTAGCCATTGGACCTTCCCACACCGGGTCTGTAGCTGTAGGACCATCACATACCGGATCCATTGCTGTCGGACCATCCCACACAGGATCGATCGCCGTTGGACCTTCCCATACTGGATCAATAGCTGTCGGACCATCCCATACCGGATCAGTAGCTATCGGACCATCTCACACCGGGGCTGTCGTCGCTCCAGGTGTGGTCTTAGCAGCTCCCGCCATTGCTGCACCCCTCATCGCTCCAGTGGCTCCAGCCCTTGCTTTTGGACCCCATGTTGGTCTCCTTGGACTTCATGGAATCCATGGTTAGCTGtctcaaattaattaacattaactaataaagtaaaattttatgacaAATATTCTGCCAAATCTGTTACGTTTGTCTTATGTACAAGtcttgtaaattttagtaaataaatataatcatgtaTCAGTACTACCCAATTATGACAAATACGCCAatataaacaatgta

Input ID file (id.txt)

106677020
106677022

Expected output (output.fasta)

>106677020 product=phosphatidylinositol 3-kinase catalytic subunit type 3-like
tttaaaaaaaaaaaaaaaaaaaaaaaaaaaaatttgtatactcCCTTCAGTGGCTAAAACCTCAGCTACAGGCAGCGAATCAATGAACTGTAGAAAACCATGTTTAGTGCTAGTGGCAAGAACCCTATATGGTGTAAGTTTCAAGTCCAGGTTTTCACTACGCAATAATTTATCCATCAGAGTGATTATTTGGAGAATAAGTTGATCCTGCCTTAAATCATccccatttttgaaaattgctACATATTCTGTACCAGCTGTGGTTACAAATGTTAATCTTGCAGGCATTAATGCTGACTTGAAAAGTGTAGCTTTCTCAGGTATGATgccttttacatatatatttgaatcTAAAGGAAGCGGTAGTGGTTCGAAgtttgagaaattaattttaaatgtatcagTGTCTGCTAGCAATGCCTGAAGTctatccattttcttttttctattcccACTTTCTCTGGCAATAGTTTTCATCAATTTCACtaatttatcaacaaaattttgttgacgtttaataaaattttgacgaTTTTGCCATTCTGGTGGTCCATATTGTAAAGCTTGCAAAAACATCTTCATAACTTTAAGGTACATGCTGCGAGGTTTATTATCTTGCTTAGCTAAATCATGGTCTTCACATTCAGTTAGCAggtaccaataaaaataattggccAATGTACTGTTTTGACATGCCCTATGAATTAAGAAGGAAGCAAGGTCCATTGTTTTGGAATCTTCTACTGAACTCTCTGTTTTAATGTCAACTGATCCAGAAATactttcttcaataattttagtttgagACGATTTTTTGTAGGCTTCTGAAATATCCtcgaaattttcatatttcagagCTTGCACTAATTGTAAAAGATAAAGGAGTAAATCTTCATCTGGAGCCTTTTGTAGTCTGGTAACAGCGTACCTTCGAACTGATGGGTGGGTAAAATAAGGGCTTAATAACTGAAGGGCATCTTCTACTTCCATCGGTGACCACACATGTAACATATGTATCGCTTGTTTCACTTCACCACTTAGGCTCCAATTCACACATTTTAAGAACTTGGTCAAAGCTTTCTTCTGAGAGCTCAAATAAAATCTGAATTTCCACAGTAGATCTTGCTCTTCAGTAGTAAGTTGTTGCGTAGGAGGATAGGCTACTATCCGATTTAAAGTATCTCTAACTGTAGCATTAGGTTTTAAATCTTTATCTGATAATCCACTGCGCCAACTTCGAGCTAAATTATGATGTTTACTTTCCACTAAATTTTCTTGAAGTATTTCTGGGTCATGGACTGTTACAATATCAGGATGGGCATGGAATTGAAAAACTTCATCTCCATCTTGTTCAAACCAAACTATAGAATAGGGTGTCCCATTGACAGTAGCCTGAGGGAATTCAATCATTAGGtataaatattcagaaattcttttctctttatcatttattttctcaatttcacGGAAGGTCAACCTATCCAACCAATCAATGACAGGCATAAACCCATTTCTATGTTTCTTCGCCAATTTTGCAAGTCTTTGCATATTTTCATTACCTGTACCTGGAGTTTTTCCAGGCGTAGTACAATTTTCTGATCCATCAGCAACAACATCAGGCCACACTTTTAAATCCAACATACCTTGACGAAAAACATTATGCTTACCAAACAATGTAATTGAAGAACCTCCAACTGGTCTCATGGTAGATGGGCCAATACAATCATAAATGGTGATTGCCAATATTGCATTACGTGGCAGGTCAGAATACATTATAGGTAGTGTTAACCACTCGCTCCATGTCCAGCGATTTGTAAAATTCTTGTAAGAAGTCAG
>106677022 product=uncharacterized LOC106677022
TACAGTTTAAATAGGAGGCAATCTAGTTCCAACGGTCGCAGTACCCCGCCTAGACAAACCCACACCGTTGCCAACATGAAATTCGCTATT
GTTTGCCTGTTGGCAGTCAGTGCAGTGAGCGCCTCTCGCTACAGGAGGTCCCTCGTCGGATGGCCGCTTGGTCTAGCTAGCCACGGAGCGGTCGCTGTAGGACTCTCCCATCCTGGAGCAGTGGCCGTTGGCCTGTCCCATCCAGGAGCAGTAGCCATTGGACCTTCCCACACCGGGTCTGTAGCTGTAGGACCATCACATACCGGATCCATTGCTGTCGGACCATCCCACACAGGATCGATCGCCGTTGGACCTTCCCATACTGGATCAATAGCTGTCGGACCATCCCATACCGGATCAGTAGCTATCGGACCATCTCACACCGGGGCTGTCGTCGCTCCAGGTGTGGTCTTAGCAGCTCCCGCCATTGCTGCACCCCTCATCGCTCCAGTGGCTCCAGCCCTTGCTTTTGGACCCCATGTTGGTCTCCTTGGACTTCATGGAATCCATGGTTAGCTGtctcaaattaattaacattaactaataaagtaaaattttatgacaAATATTCTGCCAAATCTGTTACGTTTGTCTTATGTACAAGtcttgtaaattttagtaaataaatataatcatgtaTCAGTACTACCCAATTATGACAAATACGCCAatataaacaatgta

I have tried all of the following awk commands, including commands that assume sequence data, most of which come from other posts that have sought to do the same thing. The first command has always worked for me in the past, but it's not clear why it is not in this particular instance; all I get is an empty output file:

awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta > output.fasta
awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' id.txt seq.fasta > output.fasta
awk -F'>' 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' id.txt seq.fasta > output.fasta
  • `/^>/` and `f=($1 in ids)` will not work - your FS is `-F'>'`, so `$1` is everything between `>` and the next `>`. As there is only one `>` on the line, everything is `$1`. Don't trust me, test it out - `print $1`. – KamilCuk Oct 28 '19 at 12:42

1 Answers1

1

You were close! What you want to do, is to set the record separator RS to > before reading the second file, but after reading the first file.

Something like this:

awk 'NR==FNR{ids[$0]; next} ($1 in ids){ printf ">" $0 }' id.txt RS='>' seq.fasta

Because the > is "eaten up" by awk when splitting records, I need to print it explicitly with printf ">" later. Also, I don't print the newlines, as they stay in the buffer.

KamilCuk
  • 69,546
  • 5
  • 27
  • 60