0

I have a file where I'm looking for a pattern "N" on only the even-numbered lines. When a line matches, I want to keep the context -- the odd-numbered line above it.

I understand how to keep the context using -A, -B, -C but the pattern "N" will also possibly match the odd-numbered lines, so the only way I can think of solving the problem is by separating the even and odd lines before using grep, thus removing the context.

Is there a way to do this without having to extract the line numbers that have are matched with grep, and then getting those specific lines from the file after-the-fact? I suspect I might be able to do it with awk, but I'm not sure.

I'm trying to optimize code that I believe already works, because the files it will work on will be humongous and take hours to run.


I'm trying to find any of the DNA sequences that have "N"s in them, and put them in one file, and any sequences that don't have "N"s in them, and put them in another file. The ID lines can also have "N"s however. I want the ID lines to stay connected to each sequence in a line above it in the new files.

Sample Input:

>100000|NODE_2_length_277_cov_4.245487
ATCTTTTAACCCCAAAAACTCAAGTATGTGAGCCAAGTGAACATAACTGCATAAATATCAGGCTCCAAAATAATCTACTGCTTGTTGTGTAGATATAGAGCACACAATTTCTTTTTTAAAGCCCTCCCTTTCACTCTCTCTATCCCACACCCAGAAAAACTCCTATTTAGAGAAAGCCACACCTATCACTAAGAGCAAACCAACCTTTCAAAAAAAAAAAAAAAACACATTAGGAGCAAACTGTTAGGAGCCATTCAAAACCAAAGGAAATGCCAAGACACACACACACACACACACACACAC
>100001|NODE_1_length_426_cov_11.427230
AAATATATAAAAAACCTGTGTTGTGACAACAGGTTGAGAAGTAATGAGAAAATGGACGAATTAGTTCAGGATGTCTCAAAGCAGATTTCTTTCCACTTAATCTCGATGTCCTACGAAAATGCTGACTTAGGTTGTAGTTTATGTTTCTTAGATTCCAATATTTTAAAATGGCCCTTGAAATTATATTAAAAAGCTCATGAACAAGTGCATAATCAATGATAAATGAATATTTATGGTTGAGATTTGGGAATTATTAATCAATATACCTCTATACTCTTGGCTCTCTTGAAGTTTAATTCAAGTGTATTTAATTAGATTCCTACCCCAAATCAACTTTAAGAAGGCTGCTTTTCTTCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCG
Community
  • 1
  • 1

2 Answers2

2

With awk:

seq 10 | 
awk -v pattern='[26]' '
  FNR % 2 == 1 {odd = $0}
  FNR % 2 == 0 && $0 ~ pattern {print odd; print}
'
1
2
5
6

With your sample input:

awk  '
  FNR % 2 == 1 {odd = $0}
  FNR % 2 == 0 {
    if (/N/) 
      file = FILENAME ".with_N"
    else 
      file = FILENAME ".no_N"
    print odd > file
    print     > file
  }
' myfile
glenn jackman
  • 207,528
  • 33
  • 187
  • 305
1

Another solution with fewer keystrokes will be

awk '!(NR%2) && /N/ {print p; print}{p=$0}'

!(NR%2) idiom is for picking even numbered lines; also keeps the previous line without any condition since will be printed only matched lines.

karakfa
  • 62,998
  • 7
  • 34
  • 47
  • 1
    You can even get rid of ``$0 ~ ``. A performance note: Put the cheapest, most decisive condition first in an ``&&`` test, as you did. Now the slightly more expensive ``/N/`` does not need to be evaluated for every line. – joepd Jul 15 '15 at 06:43
  • Thanks! on a small sample test run, it brought down the run time of my original code with ~5 sec to ~0.025 sec!! – Russell Miller Jul 15 '15 at 16:48