0

I have read similar questions to solve my problem, but without any solution. I'm having troubles extracting a group from the following string:

    String str = "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c] >gi|74676333|sp|Q03036.1|IRC4_YEAST  RecName: Full=Uncharacterized protein IRC4;  AltName: Full=Increased recombination centers protein 4 >gi|1165295|gb|AAB64982.1|  Ydr540cp [Saccharomyces cerevisiae]  >gi|51012753|gb|AAT92670.1| YDR540C [Saccharomyces cerevisiae]  >gi|151942499|gb|EDN60855.1| conserved protein [Saccharomyces  cerevisiae YJM789] >gi|190404545|gb|EDV07812.1|  conserved hypothetical protein [Saccharomyces cerevisiae  RM11-1a] >gi|259145774|emb|CAY79038.1| Irc4p [Saccharomyces  cerevisiae EC1118] >gi|285811545|tpg|DAA12369.1| TPA:  Irc4p [Saccharomyces cerevisiae S288c] >gi|323309617|gb|EGA62826.1|  Irc4p [Saccharomyces cerevisiae FostersO] >gi|323338091|gb|EGA79326.1|  Irc4p [Saccharomyces cerevisiae Vin13]  >gi|365766295|gb|EHN07794.1| Irc4p [Saccharomyces cerevisiae  x Saccharomyces kudriavzevii VIN7] >gi|392300658|gb|EIW11749.1|  Irc4p [Saccharomyces cerevisiae CEN.PK113-7D]  >gi|584366859|gb|EWG86852.1| Irc4p [Saccharomyces cerevisiae  R008] >gi|584372222|gb|EWG92158.1| Irc4p [Saccharomyces  cerevisiae P301] >gi|584376691|gb|EWG96547.1| Irc4p  [Saccharomyces cerevisiae R103] >gi|584477456|gb|EWH19199.1|  Irc4p [Saccharomyces cerevisiae P283]";

What I want to do is to parse the string capturing a group with any characters until first occurrence of ">", resulting in the following string:

result = "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c]";

I have tried the following regex pattern using the replaceAll(regex, replacement) method:

str = str.replaceAll("^(.+)>.+", "$1");

Where "^(.+)>.+" should match any character until first occurrence of ">", but the group "^(.+)" follows until the last occurrence of ">"

Then the result is:

from: "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c] >gi|74676333|sp|Q03036.1|IRC4_YEAST  RecName: Full=Uncharacterized protein IRC4;  AltName: Full=Increased recombination centers protein 4 >gi|1165295|gb|AAB64982.1|  Ydr540cp [Saccharomyces cerevisiae]  >gi|51012753|gb|AAT92670.1| YDR540C [Saccharomyces cerevisiae]  >gi|151942499|gb|EDN60855.1| conserved protein [Saccharomyces  cerevisiae YJM789] >gi|190404545|gb|EDV07812.1|  conserved hypothetical protein [Saccharomyces cerevisiae  RM11-1a] >gi|259145774|emb|CAY79038.1| Irc4p [Saccharomyces  cerevisiae EC1118] >gi|285811545|tpg|DAA12369.1| TPA:  Irc4p [Saccharomyces cerevisiae S288c] >gi|323309617|gb|EGA62826.1|  Irc4p [Saccharomyces cerevisiae FostersO] >gi|323338091|gb|EGA79326.1|  Irc4p [Saccharomyces cerevisiae Vin13]  >gi|365766295|gb|EHN07794.1| Irc4p [Saccharomyces cerevisiae  x Saccharomyces kudriavzevii VIN7] >gi|392300658|gb|EIW11749.1|  Irc4p [Saccharomyces cerevisiae CEN.PK113-7D]  >gi|584366859|gb|EWG86852.1| Irc4p [Saccharomyces cerevisiae  R008] >gi|584372222|gb|EWG92158.1| Irc4p [Saccharomyces  cerevisiae P301] >gi|584376691|gb|EWG96547.1| Irc4p  [Saccharomyces cerevisiae R103] >gi|584477456|gb|EWH19199.1|  Irc4p [Saccharomyces cerevisiae P283]";
to: "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c] >gi|74676333|sp|Q03036.1|IRC4_YEAST  RecName: Full=Uncharacterized protein IRC4;  AltName: Full=Increased recombination centers protein 4 >gi|1165295|gb|AAB64982.1|  Ydr540cp [Saccharomyces cerevisiae]  >gi|51012753|gb|AAT92670.1| YDR540C [Saccharomyces cerevisiae]  >gi|151942499|gb|EDN60855.1| conserved protein [Saccharomyces  cerevisiae YJM789] >gi|190404545|gb|EDV07812.1|  conserved hypothetical protein [Saccharomyces cerevisiae  RM11-1a] >gi|259145774|emb|CAY79038.1| Irc4p [Saccharomyces  cerevisiae EC1118] >gi|285811545|tpg|DAA12369.1| TPA:  Irc4p [Saccharomyces cerevisiae S288c] >gi|323309617|gb|EGA62826.1|  Irc4p [Saccharomyces cerevisiae FostersO] >gi|323338091|gb|EGA79326.1|  Irc4p [Saccharomyces cerevisiae Vin13]  >gi|365766295|gb|EHN07794.1| Irc4p [Saccharomyces cerevisiae  x Saccharomyces kudriavzevii VIN7] >gi|392300658|gb|EIW11749.1|  Irc4p [Saccharomyces cerevisiae CEN.PK113-7D]  >gi|584366859|gb|EWG86852.1| Irc4p [Saccharomyces cerevisiae  R008] >gi|584372222|gb|EWG92158.1| Irc4p [Saccharomyces  cerevisiae P301] >gi|584376691|gb|EWG96547.1| Irc4p  [Saccharomyces cerevisiae R103]";

To achieve my result is like doing a loop checking if str.contains(">") and then using the str.replaceAll("^(.+)>.+", "$1"); to eliminate any char sequence like a backwarding matching.

daniel souza
  • 332
  • 2
  • 11

5 Answers5

3

You need to make the pattern to does a non-greedy match by adding ? quatifier after +,

^(.+?)>.*$

DEMO

Your Java code would be,

str = str.replaceAll("^(.+?)>.*$", "$1");

Then replace the whole string with the first captured group.

Avinash Raj
  • 160,498
  • 22
  • 182
  • 229
3

The problem is that the .+ in your regex

^(.+)>.+

Regular expression visualization

Debuggex Demo

is greedy, meaning (as you have discovered), that it greedily consumes all instances of > except the last. Changing this to reluctant

^(.+?)>.+

Regular expression visualization

Debuggex Demo

is what you want: it reluctantly captures only up through the first >

  • Elements that are greedy capture as much as possible, as long as the overall regex can still match.
  • Elements that are reluctant capture as little as possible, as long as the overall regex can still match.

Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.

Community
  • 1
  • 1
aliteralmind
  • 18,274
  • 16
  • 66
  • 102
1
str = str.replaceAll("^(.+?)>.+", "$1");

Non-greedy!

Alternatively, you could use

 str = str.replaceAll(">.*", "");

which should leave you with all characters up to the first >.

Also

String[] parts = str.split( "<", 2 );

would have been an option, as you don't want to chnge str.

laune
  • 30,276
  • 3
  • 26
  • 40
  • Hi @laune, thanks for your feedback. I'll use this alternative way. I'm owing some reply in drools community about something that I wanted to do. – daniel souza Jul 13 '14 at 16:21
1

+ quantifier is greedy so it will try to find maximal possible match like .+b will match

abababcd
^^^^^^

instead of

abababcd
^^

If you want to make this quantifier find minimal possible match make it reluctant by adding ? after it.

This time .+?b would match

abababcd
^^

So change your regex to ^(.+?)>.+.


You can also use some simpler mechanism instead of regex. I mean substring and indexOf which can look like

//                     |substring from 0
//                     |      |till index of first ">"
result = str.substring(0, str.indexOf(">"));
Pshemo
  • 113,402
  • 22
  • 170
  • 242
0

Your problem is that .* is greedy and should be made reluctant by adding a ?, but there is an even simpler solution:

str = str.replaceAll(">.*", "");

Just match what you don't want and delete it (by replacing it with nothing).

Bohemian
  • 365,064
  • 84
  • 522
  • 658