1

I'm searching for patterns in a String starting with ATG, ending with TAG, TAA or TGA and length = multiple of 3. ATG and TAG, TAA or TGA can only appear at respectively beginning or end. Which means:

From ATGTTGTGATGT extract ATGTTGTGA

From ATGATGTTGTGATGT extract ATGTTGTGA

Currently I'm using regex (ATG)([ATG]{3})+?(TAG|TAA|TGA).

For ATGATGTTGTGATGT this gets me the wrong result ATGATGTTGTGA. I've tried:

(^ATG)(!?=.*ATG)([ATG]{3})+?(TAG|TAA|TGA)
(^ATG)(!?=(ATG)+)([ATG]{3})+?(TAG|TAA|TGA)

How to tell it to contain ATG only once in the beginning and no more after that?

xtra
  • 1,075
  • 1
  • 12
  • 24

1 Answers1

2

You may use

ATG(?:(?!ATG)[ATG]{3})*?(?:TAG|TAA|TGA)

See the regex demo

Details

  • ATG - an ATG substring
  • (?:(?!ATG)[ATG]{3})*? - a tempered greedy token matching any sequence of 3 chars from the [ATG] character set that is not equal to ATG (that is restricted with the negative lookahead (?!ATG))
  • (?:TAG|TAA|TGA) - either of the three alternatives defined in the non-capturing group: TAG, TAA or TGA.

Java demo:

String rx = "ATG(?:(?!ATG)[ATG]{3})*?(?:TAG|TAA|TGA)";
String s = "ATGTTGTGATGT, ATGATGTTGTGATGT, ATGATGTTGTGATGT";
Pattern pattern = Pattern.compile(rx);
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
        System.out.println(matcher.group(0));
}

Result:

ATGTTGTGA
ATGTTGTGA
ATGTTGTGA
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397