Java regex. Extracting group from a text excluding specific char sequence. (It's working like backward matching)

Question

I have read similar questions to solve my problem, but without any solution. I'm having troubles extracting a group from the following string:

    String str = "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c] &gt;gi|74676333|sp|Q03036.1|IRC4_YEAST  RecName: Full=Uncharacterized protein IRC4;  AltName: Full=Increased recombination centers protein 4 &gt;gi|1165295|gb|AAB64982.1|  Ydr540cp [Saccharomyces cerevisiae]  &gt;gi|51012753|gb|AAT92670.1| YDR540C [Saccharomyces cerevisiae]  &gt;gi|151942499|gb|EDN60855.1| conserved protein [Saccharomyces  cerevisiae YJM789] &gt;gi|190404545|gb|EDV07812.1|  conserved hypothetical protein [Saccharomyces cerevisiae  RM11-1a] &gt;gi|259145774|emb|CAY79038.1| Irc4p [Saccharomyces  cerevisiae EC1118] &gt;gi|285811545|tpg|DAA12369.1| TPA:  Irc4p [Saccharomyces cerevisiae S288c] &gt;gi|323309617|gb|EGA62826.1|  Irc4p [Saccharomyces cerevisiae FostersO] &gt;gi|323338091|gb|EGA79326.1|  Irc4p [Saccharomyces cerevisiae Vin13]  &gt;gi|365766295|gb|EHN07794.1| Irc4p [Saccharomyces cerevisiae  x Saccharomyces kudriavzevii VIN7] &gt;gi|392300658|gb|EIW11749.1|  Irc4p [Saccharomyces cerevisiae CEN.PK113-7D]  &gt;gi|584366859|gb|EWG86852.1| Irc4p [Saccharomyces cerevisiae  R008] &gt;gi|584372222|gb|EWG92158.1| Irc4p [Saccharomyces  cerevisiae P301] &gt;gi|584376691|gb|EWG96547.1| Irc4p  [Saccharomyces cerevisiae R103] &gt;gi|584477456|gb|EWH19199.1|  Irc4p [Saccharomyces cerevisiae P283]";

What I want to do is to parse the string capturing a group with any characters until first occurrence of ">", resulting in the following string:

result = "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c]";

I have tried the following regex pattern using the replaceAll(regex, replacement) method:

str = str.replaceAll("^(.+)&gt;.+", "$1");

Where "^(.+)>.+" should match any character until first occurrence of ">", but the group "^(.+)" follows until the last occurrence of ">"

Then the result is:

from: "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c] &gt;gi|74676333|sp|Q03036.1|IRC4_YEAST  RecName: Full=Uncharacterized protein IRC4;  AltName: Full=Increased recombination centers protein 4 &gt;gi|1165295|gb|AAB64982.1|  Ydr540cp [Saccharomyces cerevisiae]  &gt;gi|51012753|gb|AAT92670.1| YDR540C [Saccharomyces cerevisiae]  &gt;gi|151942499|gb|EDN60855.1| conserved protein [Saccharomyces  cerevisiae YJM789] &gt;gi|190404545|gb|EDV07812.1|  conserved hypothetical protein [Saccharomyces cerevisiae  RM11-1a] &gt;gi|259145774|emb|CAY79038.1| Irc4p [Saccharomyces  cerevisiae EC1118] &gt;gi|285811545|tpg|DAA12369.1| TPA:  Irc4p [Saccharomyces cerevisiae S288c] &gt;gi|323309617|gb|EGA62826.1|  Irc4p [Saccharomyces cerevisiae FostersO] &gt;gi|323338091|gb|EGA79326.1|  Irc4p [Saccharomyces cerevisiae Vin13]  &gt;gi|365766295|gb|EHN07794.1| Irc4p [Saccharomyces cerevisiae  x Saccharomyces kudriavzevii VIN7] &gt;gi|392300658|gb|EIW11749.1|  Irc4p [Saccharomyces cerevisiae CEN.PK113-7D]  &gt;gi|584366859|gb|EWG86852.1| Irc4p [Saccharomyces cerevisiae  R008] &gt;gi|584372222|gb|EWG92158.1| Irc4p [Saccharomyces  cerevisiae P301] &gt;gi|584376691|gb|EWG96547.1| Irc4p  [Saccharomyces cerevisiae R103] &gt;gi|584477456|gb|EWH19199.1|  Irc4p [Saccharomyces cerevisiae P283]";
to: "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c] &gt;gi|74676333|sp|Q03036.1|IRC4_YEAST  RecName: Full=Uncharacterized protein IRC4;  AltName: Full=Increased recombination centers protein 4 &gt;gi|1165295|gb|AAB64982.1|  Ydr540cp [Saccharomyces cerevisiae]  &gt;gi|51012753|gb|AAT92670.1| YDR540C [Saccharomyces cerevisiae]  &gt;gi|151942499|gb|EDN60855.1| conserved protein [Saccharomyces  cerevisiae YJM789] &gt;gi|190404545|gb|EDV07812.1|  conserved hypothetical protein [Saccharomyces cerevisiae  RM11-1a] &gt;gi|259145774|emb|CAY79038.1| Irc4p [Saccharomyces  cerevisiae EC1118] &gt;gi|285811545|tpg|DAA12369.1| TPA:  Irc4p [Saccharomyces cerevisiae S288c] &gt;gi|323309617|gb|EGA62826.1|  Irc4p [Saccharomyces cerevisiae FostersO] &gt;gi|323338091|gb|EGA79326.1|  Irc4p [Saccharomyces cerevisiae Vin13]  &gt;gi|365766295|gb|EHN07794.1| Irc4p [Saccharomyces cerevisiae  x Saccharomyces kudriavzevii VIN7] &gt;gi|392300658|gb|EIW11749.1|  Irc4p [Saccharomyces cerevisiae CEN.PK113-7D]  &gt;gi|584366859|gb|EWG86852.1| Irc4p [Saccharomyces cerevisiae  R008] &gt;gi|584372222|gb|EWG92158.1| Irc4p [Saccharomyces  cerevisiae P301] &gt;gi|584376691|gb|EWG96547.1| Irc4p  [Saccharomyces cerevisiae R103]";

To achieve my result is like doing a loop checking if str.contains(">") and then using the str.replaceAll("^(.+)>.+", "$1"); to eliminate any char sequence like a backwarding matching.

score 3 · Answer 1 · answered Jul 13 '14 at 15:36

3

You need to make the pattern to does a non-greedy match by adding ? quatifier after +,

^(.+?)&gt;.*$

DEMO

Your Java code would be,

str = str.replaceAll("^(.+?)&gt;.*$", "$1");

Then replace the whole string with the first captured group.

answered Jul 13 '14 at 15:36

Avinash Raj

160,498
22
182
229

score 3 · Accepted Answer · edited May 23 '17 at 11:59

The problem is that the .+ in your regex

^(.+)&gt;.+

Regular expression visualization

Debuggex Demo

is greedy, meaning (as you have discovered), that it greedily consumes all instances of > except the last. Changing this to reluctant

^(.+?)&gt;.+

Regular expression visualization

Debuggex Demo

is what you want: it reluctantly captures only up through the first >

Elements that are greedy capture as much as possible, as long as the overall regex can still match.
Elements that are reluctant capture as little as possible, as long as the overall regex can still match.

Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.

Thanks for the feedback. This FAQ will clarify me a lot for future references. — daniel souza, Jul 13 '14 at 16:08

laune · Answer 3 · 2014-07-13T16:07:01.613

1

str = str.replaceAll("^(.+?)&gt;.+", "$1");

Non-greedy!

Alternatively, you could use

 str = str.replaceAll("&gt;.*", "");

which should leave you with all characters up to the first >.

Also

String[] parts = str.split( "&lt;", 2 );

would have been an option, as you don't want to chnge str.

edited Jul 13 '14 at 16:07

answered Jul 13 '14 at 15:36

laune

30,276
3
26
40

Hi @laune, thanks for your feedback. I'll use this alternative way. I'm owing some reply in drools community about something that I wanted to do. – daniel souza Jul 13 '14 at 16:21

Pshemo · Answer 4 · 2014-07-13T15:52:29.327

1

+ quantifier is greedy so it will try to find maximal possible match like .+b will match

abababcd
^^^^^^

instead of

abababcd
^^

If you want to make this quantifier find minimal possible match make it reluctant by adding ? after it.

This time .+?b would match

abababcd
^^

So change your regex to ^(.+?)>.+.

You can also use some simpler mechanism instead of regex. I mean substring and indexOf which can look like

//                     |substring from 0
//                     |      |till index of first "&gt;"
result = str.substring(0, str.indexOf("&gt;"));

edited Jul 13 '14 at 15:52

answered Jul 13 '14 at 15:44

Pshemo

113,402
22
170
242

Thanks for the answer! – daniel souza Jul 13 '14 at 16:03
1

@danielsouza You are welcome. Since there are many correct answers you should probably pick one you liked the most and [accept it](http://meta.stackoverflow.com/a/5235/186652). – Pshemo Jul 13 '14 at 16:04

score 0 · Answer 5 · answered Jul 13 '14 at 15:40

0

Your problem is that .* is greedy and should be made reluctant by adding a ?, but there is an even simpler solution:

str = str.replaceAll("&gt;.*", "");

Just match what you don't want and delete it (by replacing it with nothing).

answered Jul 13 '14 at 15:40

Bohemian

365,064
84
522
658

Java regex. Extracting group from a text excluding specific char sequence. (It's working like backward matching)

5 Answers5