grep: combine negative match with After (search for protein sequences that do not contain the word fragment)

Question

I have a fasta sequence and I want to filter all those out that have the word Fragment in the header.

I thought I could use grep with -A 1 (because the protein sequence is always on one line) with -i (in case fragment is not written capitalized) and use that with -v, but somehow inversing the result does not work as expected.

>tr|A0A534K2W8|A0A534K2W8_9EURY Epoxide hydrolase 1 (Fragment) OS=Euryarchaeota archaeon OX=2026739 GN=E6K10_05355 PE=4 SV=1 
MSNTPDFNRR...
>tr|A0A4S3JUN3|A0A4S3JUN3_9EURO AB hydrolase-1 domain-containing protein OS=Aspergillus tanneri OX=1220188 GN=ATNIH1004_010243 PE=4 SV=1
MRDKYTPATL...
>tr|B1AQP8|B1AQP8_HUMAN Epoxide hydrolase 1 (Fragment) OS=Homo sapiens OX=9606 GN=EPHX1 PE=1 SV=1
MWLEILLTSV...
>tr|B1AQP9|B1AQP9_HUMAN Epoxide hydrolase 1 (Fragment) OS=Homo sapiens OX=9606 GN=EPHX1 PE=1 SV=1
MWLEILLTSV...
>tr|Q6FGZ3|Q6FGZ3_HUMAN EPHX1 protein (Fragment) OS=Homo sapiens OX=9606 GN=EPHX1 PE=2 SV=1
MWLEILLTSV...
>tr|A0A2G8L4U1|A0A2G8L4U1_STIJA Putative epoxide hydrolase 1-like OS=Stichopus japonicus OX=307972 GN=BSL78_07808 PE=4 SV=1
MVHGWPGSFY...

If I want to keep the sequences with Fragment, it is working fine

grep -i "fragment" -A 1 test.fasta                                             
>tr|A0A534K2W8|A0A534K2W8_9EURY Epoxide hydrolase 1 (Fragment) OS=Euryarchaeota archaeon OX=2026739 GN=E6K10_05355 PE=4 SV=1 
MSNTPDFNRR...
--
>tr|B1AQP8|B1AQP8_HUMAN Epoxide hydrolase 1 (Fragment) OS=Homo sapiens OX=9606 GN=EPHX1 PE=1 SV=1
MWLEILLTSV...
>tr|B1AQP9|B1AQP9_HUMAN Epoxide hydrolase 1 (Fragment) OS=Homo sapiens OX=9606 GN=EPHX1 PE=1 SV=1
MWLEILLTSV...
>tr|Q6FGZ3|Q6FGZ3_HUMAN EPHX1 protein (Fragment) OS=Homo sapiens OX=9606 GN=EPHX1 PE=2 SV=1
MWLEILLTSV...

but if I want to inverse the match, this is the result.

grep -i "fragment" -A 1 -v test.fasta
MSNTPDFNRR...
>tr|A0A4S3JUN3|A0A4S3JUN3_9EURO AB hydrolase-1 domain-containing protein OS=Aspergillus tanneri OX=1220188 GN=ATNIH1004_010243 PE=4 SV=1
MRDKYTPATL...
>tr|B1AQP8|B1AQP8_HUMAN Epoxide hydrolase 1 (Fragment) OS=Homo sapiens OX=9606 GN=EPHX1 PE=1 SV=1
MWLEILLTSV...
>tr|B1AQP9|B1AQP9_HUMAN Epoxide hydrolase 1 (Fragment) OS=Homo sapiens OX=9606 GN=EPHX1 PE=1 SV=1
MWLEILLTSV...
>tr|Q6FGZ3|Q6FGZ3_HUMAN EPHX1 protein (Fragment) OS=Homo sapiens OX=9606 GN=EPHX1 PE=2 SV=1
MWLEILLTSV...
>tr|A0A2G8L4U1|A0A2G8L4U1_STIJA Putative epoxide hydrolase 1-like OS=Stichopus japonicus OX=307972 GN=BSL78_07808 PE=4 SV=1
MVHGWPGSFY...

Any ideas where I go wrong?

Sundeep · Accepted Answer · 2020-06-18T11:49:38.410

The issue is that -v cannot be used with context switches. If you have GNU grep with PCRE, then you can use a complicated regex:

grep --no-group-separator -xiP -A 1 '>((?!fragment).)+'

Note the use --no-group-separator to avoid -- between different matches. -P enables PCRE. -x ensures whole line is matched. >((?!fragment).)+ ensures fragment is not present in lines starting with > (See Variable-length lookbehind-assertion alternatives for regular expressions for more explanation)

But, you are better off with awk for such cases:

# with GNU awk
awk -v IGNORECASE=1 '/^>/ && !/fragment/{f=2} f && f--'
# any awk
awk '/^>/ && tolower($0) !~ /fragment/{f=2} f && f--'

Here, f=2 is 1 greater than number of lines you need after the match. /^>/ && !/fragment/ will only match lines starting with > and NOT containing fragment

See also lines around matching regexp for more such examples.

grep: combine negative match with After (search for protein sequences that do not contain the word fragment)

1 Answers1