0

I am trying to remov certain columns from a text file on lines that match a string, but then leave the rest of the lines untouched.

Say I have a file (thousand of lines in reality)

10 12 a
USA John TGCAGG
USA John TGCATG
5 2 b
CAN Tom TGCACG
CAN Tom TGCAAC
....

And I want to create a new file that removes the 2nd column in lines that contain TGCA but leaves all other lines intact. I would like to see:

10 12 a
USA TGCAGG
USA TGCATG
5 2 b
CAN TGCACG
CAN TGCAAC

I can modify which columns print on lines that match using a regexp to start awk or sed, but I cant get the other lines (which are not modified) to print, or to preserve the order of those lines.

Do I need to use an if statement in awk? Tried using next but I dont think I have that right.

fedorqui 'SO stop harming'
  • 228,878
  • 81
  • 465
  • 523
LP_640
  • 377
  • 1
  • 3
  • 16

2 Answers2

4

I would say:

$ awk '/TGCA/ {$2=$3; NF--} 1' file
10 12 a
USA TGCAGG
USA TGCATG
5 2 b
CAN TGCACG
CAN TGCAAC

That is: when the line contains TGCA, replace the 2nd column with the 3rd and decrease the number of fields. That is, remove the 2nd column.

fedorqui 'SO stop harming'
  • 228,878
  • 81
  • 465
  • 523
  • I dont quite follow how NF was used here or how to set the desired numebr of fields... What if there were 4 columns and I wanted to keep 3 and 4 ? Or lets say I wanted to keep columns 1 and 3 and 4 (but not 2)? Is there a more general way to do this (or explain the code) without reducing fields? – LP_640 Apr 01 '15 at 18:46
  • It is quite tricky to delete a column in awk. To prevent getting trailing spaces I would go for something like this: [how to remove the first two columns in a file using shell (awk, sed, whatever)](http://stackoverflow.com/a/14715189/1983854) – fedorqui 'SO stop harming' Apr 01 '15 at 18:51
  • So if you can afford to just decrease `NF` is it fine. If it has to be more complex, I would still do some kind of column movement and finally decrease `NF` anyway. – fedorqui 'SO stop harming' Apr 01 '15 at 18:56
  • Ok. This will print fields 3 and 4 on lines that match TGCA: `awk '/TGCA/{print substr($0, index($0, $3))}'` . But, only those lines are printed. Any idea how to get the other lines as well ? Is the 1 at the end of the awk routine the important piece of code? – LP_640 Apr 01 '15 at 19:16
  • @LP_640 `1` stands for "True", so that it prints the line anyway. If you already printed it then you can foro example skip it with `next`: `awk '/TGCA/{print substr($0, index($0, $3)); next} 1'`. – fedorqui 'SO stop harming' Apr 01 '15 at 20:11
2

With GNU sed:

sed '/TGCA/ s/\s\+\S*//' filename

This removes from lines that contain TGCA the first occurrence of one or more spaces followed by any number of non-spaces -- which is the second column and the space(s) preceding it.

For BSD sed, this has to be amended because it doesn't understand \s or \S (or \+ -- it is a bit painful). In that case,

sed '/TGCA/ s/[[:space:]]\{1,\}[^[:space:]]*//' filename

does the same.

Wintermute
  • 39,646
  • 4
  • 64
  • 71