1

Working on a Java regular expression that will match either "es" or "s" at the end of the string and return the substring without that suffix. Seems easy, but I can't get the 'e' to match with the expressions I'm trying.

Here's the output I should get:

"inches" -> "inch"

"meters" -> "meter"

"ounces" -> "ounc"

but with this regular expression:

Pattern.compile("(.+)(es|s)$", Pattern.CASE_INSENSITIVE);

I'm actually getting:

"inches" -> "inche"

After some research I discovered that the ".+" part of my search is too greedy, and changing it to this:

Pattern.compile("(.+?)(es|s)$", Pattern.CASE_INSENSITIVE);

fixes the problem. My question, though, is why did the 's' match at all? If the 'greedy' nature of the algorithm was the problem, shouldn't it have matched the whole string?

IcedDante
  • 4,991
  • 10
  • 49
  • 81
  • It could be doing line by line, not multiline. Did you only get one result back, or did you "meter" and "ounce" as well? – Derek Sep 28 '15 at 15:23
  • It would seem that you are trying to parse the English language, which I do not think is a regular language. I think you would need to look at Natural Language Processing, unless you are dealing with a very small subset of words. – npinti Sep 28 '15 at 15:23
  • Have a look at [*Greedy vs. Reluctant vs. Possessive Quantifiers*](http://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-quantifiers). – Wiktor Stribiżew Sep 28 '15 at 15:27

2 Answers2

4

When it matches greedily, it matches as much as it can while still meeting the expression. So when it's greedy, it will take everything except the s, because it cannot take the s and still meet the expression. When it matches non-greedily, it matches as little as possible while still meeting the expression. Therefore, it will take everything except the 'es', because that is as little as it can take while still meeting the expression.

Jonah
  • 1,215
  • 3
  • 15
  • 27
3

Short answer

Greedy doesn't mean possessive. Greedy aims to consume/eat as much as possible; but will stop from the moment a string will no longer match otherwise.

Long answer

In regular expressions the Kleene star (*) is greedy, it means it tries to take as much as possible, but not more. Consider the regex:

(.+)(es|s)$

here .+ aims to eat as much as possible. But you can only reach the end of the regex, when you somehow manage to pass (es|s), which is only possible if it ends with at least one s. Or if we align your string inches:

(.+)  (es|e)$
inche s

(spaces added). In other words .+.

When you make it non-greedy, the .+? tries to give up eating as soon as possible. For the string inches, this is after the inch:

(.+?) (es|e)$
inch  es

It cannot give up earlier, because then the h should somehow have to match with (es|e).

Community
  • 1
  • 1
Willem Van Onsem
  • 321,217
  • 26
  • 295
  • 405