The POSIX standard states that for both ERE and BRE:
Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string. For this purpose, a null string shall be considered to be longer than no match at all. For example, matching the BRE "(.)." against "abcdef", the subexpression "(\1)" is "abcdef", and matching the BRE "(a*)*" against "bc", the subexpression "(\1)" is the null string.
My question: how should (a|ab)(c|bcd)(d*)
match against "abcd"?
My reading of the standard above is that the subexpression (a|ab)
should match the leftmost, longest string that keeps the whole match as long as possible, so the subexpression should match "ab". However, when I used GNU regex to search for (a|ab)(c|bcd)(d*)
in "abcd", I get the following for the first subexpression:
echo abcd | sed -E 's/(a|ab)(c|bcd)(d*)/\0, \1, \2, \3/'
abcd, a, bcd,
This example is from this page.
Here is a C++ code using Boost.Regex with regex::extended flag:
#include <boost/regex.hpp>
#include <iostream>
#include <string>
int main()
{
boost::regex_constants::syntax_option_type regex_flags =
boost::regex::extended;
std::string text = "abcd";
boost::regex expression("(a|ab)(c|bcd)(d*)", regex_flags);
boost::smatch matches {};
boost::regex_search(text, matches, expression);
for (const auto match : matches) { std::cout << match << ", "; }
std::cout << std::endl;
return 0;
}
Compiling and running, produces:
abcd, ab, c, d,
Here, consistend with the standard the subexpression (a|ab)
matches with the longest thing it can match, but in GNU regex it doesn't.