4

Is there a subset of regex features that are considered to be the same/available within all major grammars? For example, . seems to be available and has the same meaning everywhere. I suspect *, +, ^, $ are like this as well.

A broader search tends to show comparisons of a few features of a few grammars with notes/caveats that this grammar is like that one, or derived from that one, etc. I know I can do the work, but I am asking if there is an existing reference to a subset like this.

To narrow this question down further (maybe), is there a subset such that expressions using that set would work the same in C++11 no matter which grammar-specifying parameter was passed to std::regex()?

Note to those who have voted to close this as a duplicate: The question you claim is a duplicate has no qualifiers next to several features that are not universal even in the subset of grammars that are supported by C++11. For example - *?:reluctant, *+:possessive, ():capture groups, Lookaheads: (?=...) and perhaps others. Some of these resulted in an EXCEPTION being thrown just by adding them to a std::regex() pattern.

Arbalest
  • 1,025
  • 8
  • 21
  • I know there are some good references around here somewhere, I couldn't find the ones I was thinking of though. As a side note, some characters have the same meaning but not entirely the same set of features. One on the top of my head is `+` means 1+ matches, but if placed after a quantifier (like `+`, `*`, `?`) it will become a [possessive quantifier](http://www.regular-expressions.info/possessive.html) which isn't available in all regex flavors. – Sam Apr 10 '14 at 15:52
  • @Crayon - I am not at all asking what any particular expression means. Were you suggesting that the features without language tags are universal? – Arbalest Apr 10 '14 at 16:51
  • @Sam - I have found a few feature/spec grids but they have all been limited. I also thought surely there is something more extensive out there. Thanks for the point about some features having multiple meanings. – Arbalest Apr 10 '14 at 16:51
  • It's definitely an interesting question, I've come up blank. Do you have a specific reason for using a reference for this? I usually just try to use what I know, and correct for errors if something isn't supported. – Sam Apr 10 '14 at 16:54
  • @Arbalest yes, I thought that page might be useful because the ones without tags should be universal – Crayon Violent Apr 10 '14 at 17:06
  • @Sam - I have several reason: 1) the idea that the first pass at creating an expression ought to be the most universal 2) a quick interview question(s) that are not biased towards a specific environment 3) academic/curiosity and as a guide toward what one ought to master first and... 4) obsessive compulsive disorder! :-) – Arbalest Apr 10 '14 at 17:12
  • All good reasons :) I've favorited this in case you find something, as I am interested and it definitely would be good to always create universal expressions if possible. – Sam Apr 10 '14 at 17:14
  • @Arbalest This question is quite broad since every regex engine has it's own features. Now it happens quite often that they share the same syntax. You will see differences in some intermediate/advanced features, think about lookbehinds, recursive patterns, balancing groups. Sometimes the syntax also differs slightly like with named groups. Anyways, check that reference out. It should be a great start. We have added language tags to show which language supports certain feature/meaning. You might also [check this out](http://www.regular-expressions.info/refflavors.html) – HamZa Apr 10 '14 at 22:07

2 Answers2

1

It's a very good question for regex lexicologists. It happens to have a simple answer.

Consider the regexCRECABLE grammar which I whipped up for C++11 five minutes ago:

& [QUESTION MARK] matches single character
. [DOT] matches period character
{ [OPENING BRACE] indicates optional repetition, 3 or more times
) [CLOSING PARENTHESIS] matches the number 6

Example:

Subject: The mark of the devil is 666.

Matching pattern: &{){.

So, clearly, the answer is no.

Generalized proof:

Regex engine coders are people too. (No, really.) You never know when someone will come out with something that turns the standards upside-down.

zx81
  • 38,175
  • 8
  • 76
  • 97
1

As there are accumulating votes to close this question (not sure why) I am reporting what I have found even though I could do more -- such as repeating this with a different compiler/library combination.

This is what I know after testing using Visual Studio 2013 and trying all 6 grammars supported by C++11 (ECMA, POSIX-basic, POSIX-extended, AWK, grep and egrep)

These seem universal

[az]       set
[a-z]      range
[^a-z]     exclusion
^[a-z]     line begin
[a-z]$     line end
[a-z].     any char
[a-z]+     one or more of preceding
[a-z]*     zero ore more of preceding
[:digit:]  any/all POSIX character classes

leaving POSIX-basic or (regular non-"e") grep out of the mix makes these universal as well

[a-z]?     zero or 1 of preceding
\b[a-z]    word boundary
a|z        OR
a{2}       repetition
a{2,3}     repetition

There could be more but it looks like this question could be closed before I check out the rest.

Arbalest
  • 1,025
  • 8
  • 21