filter out unrecognised fields using awk

Question

I have a CVS file where I expect some values such as Y or N. Folks are adding comments or arbitrary entries such as NA? that I want to remove:

Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,

I can use gsub to remove things that I am anticipating such as:

$ cat test.csv | awk '{gsub("NA\\?", ""); gsub("NA \\?",""); gsub("TBD", ""); print}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,

Yet that will break if someone adds a new comment. I am looking for a regex to generalise the match as "not Y".

I tried some negative look arounds but couldn't get it to work on the awk that I have which is GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.1, GNU MP 6.1.2). Thanks in advance!

score 6 · Answer 1 · answered Jan 03 '19 at 17:02

awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if ($i !~ /^(y|Y|n|N)$/) $i="";print}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,

Accepting only Y/N (case-insensitive).

zzevannn · Answer 2 · 2019-01-03T17:12:57.970

2

 awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'

This seems to do the trick. Loops through the 3rd through the last field, and if the field isn't Y, it's replaced with nothing. Since we're modifying fields we need to set OFS as well.

$ cat file.txt
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,

$ awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,

If you wanted to accept "N" too, /^[YN]$/ would work.

edited Jan 03 '19 at 17:12

answered Jan 03 '19 at 16:59

zzevannn

2,519
1
10
20

I like the negative character class :) – tink Jan 03 '19 at 17:03
You are correct @Tomalak , thanks for pointing that out. Updated. – zzevannn Jan 03 '19 at 17:13
i am not sure it is a negative character set. `/^[Y]$/` just means `/^Y$/` which works the same. If we wanted to accept Y or N it would be `/^[YN]$/`. – simbo1905 Jan 03 '19 at 21:22
My previous edit included a negative character class but had issues that another (since deleted) commment pointed out. – zzevannn Jan 03 '19 at 21:35

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

cat test.CSV | awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if($i != "Y") $i=""; print}'

Output:

Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,

Update: So there's no need to use regex if you simply want to determine it's "Y" or not.

However, if you want to use regex, as zzevannn's answer and tink's answer already gave great ideas of regex condition, so I'll give a batch replace by regex instead:

To be exact, and to increase the challenge, I created some boundary conditions:

$ cat test.CSV
Create,20055776,Y,,Y,Y,,Y,,YNA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,YN.Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,NANN,,,,,Y,,,NA ?Y,,,Y,,,,,,TYBD,,,,,,,,,

And the batch replace is:

$ awk 'BEGIN{FS=OFS=","}{fst=$1;sub($1 FS,"");print fst,gensub("(,)[^,]*[^Y,]+[^,]*","\\1","g",$0);}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,

"(,)[^,]*[^Y,]+[^,]*" is to match anything between two commas that other than single Y.
Note I saved $1 and deleted $1 and the comma after it first, and later print it back.

NeronLeVelu · Answer 4 · 2019-01-04T15:59:30.753

0

sed solution

# POSIX
sed -e ':a' -e 's/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;t a' test.csv

# GNU
sed ':a;s/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;ta' test.csv

awk on same concept (avoid some problem of sed that miss the OR regex)

awk -F ',' '{ Idx=$2;gsub(/,[[:blank:]]*[^YN,][^,]*/, "");sub( /,/, "," Idx);print}'

edited Jan 04 '19 at 15:59

answered Jan 04 '19 at 14:57

NeronLeVelu

9,372
1
21
41

little change will do `sed ':a;s/$^Create,[0-9]*\(,Y\{0,1\}$*,\)[^,]*[^Y,][^,]*/\1/;ta' test.csv` – Tiw Jan 04 '19 at 15:10

filter out unrecognised fields using awk

4 Answers4