-2

I want to use a regex pattern to match any multiple-line text that starts with "abc" and ends with "xyz", I have two regex patterns to choose:

  1. ^abc(.|\n|\r)*xyz$
  2. ^abc[\s\S]*xyz$

Are they two equivalent to each other except for performance?

Which is better and why?

xmllmx
  • 33,981
  • 13
  • 121
  • 269
  • 1
    Now second one as it doesn't uses captured group :p – Code Maniac Sep 01 '19 at 11:39
  • both are different pattern – Naveen Sep 01 '19 at 11:40
  • You better can use `singleline` flag if you want `.` to match newline – Code Maniac Sep 01 '19 at 11:42
  • My regex engine has no singleline flag. – xmllmx Sep 01 '19 at 11:44
  • Knowing which engine you are using is very important -- please tell us. – MurrayW Sep 01 '19 at 11:44
  • I want to find an engine-independent answer. – xmllmx Sep 01 '19 at 11:46
  • 3
    There is no engine-independent answer to "which is better and why". By the rules of [tag:regex] tag: "Since regular expressions are not fully standardized, all questions with this tag should also include a tag specifying the applicable programming language or tool." – Amadan Sep 01 '19 at 11:49
  • I will still go with second as it is easy to read, as my guess it will definitely going to be faster than first one, if not both will have same speed – Code Maniac Sep 01 '19 at 11:51
  • What do you mean by "better"? – Toto Sep 01 '19 at 11:51
  • @CodeManiac [partially duplicate question](https://stackoverflow.com/questions/4724588/using-alternation-or-character-class-for-single-character-matching) only answers the question about efficiency, not whether the two expressions are exactly equivalent. Voting to reopen. – joanis Sep 01 '19 at 13:20
  • 1
    @joanis there are two answers which clearly explains what is asked in question, `tim's` answer talks about difference between two patterns and @Toto 's answer talks about performance – Code Maniac Sep 01 '19 at 13:35
  • 2
    The answer is: use neither, they are both bad. Use `.`, with the necessary option (usually, `s`, or `m` in Ruby) or without (as in all POSIX flavors). In JS ECMAScript legacy versions, there is `[^]`. `.` = good, `[\s\S]` = so-so, `(.|\n|\r)` = worst. – Wiktor Stribiżew Sep 01 '19 at 14:56
  • @CodeManiac I think your are mostly right, except for the fact that they don't address the specific nuances that apply to `.`, which is what OP here is really interested about. – joanis Sep 01 '19 at 15:35

2 Answers2

1

Your best option is to tell whatever regex library/engine that the dot should match all characters including line separators. Practically every regex implementation I know has this feature: its usually a flag called DOT_ALL or MULTILINE or an option called "dot matches newline" or something similar.

If that's not an option, then go for the second one you posted: character classes are more efficient than using the | operator.

Leo Aso
  • 9,400
  • 3
  • 16
  • 40
  • Are they two equivalent to each other except for performance? – xmllmx Sep 01 '19 at 11:48
  • 1
    In any engine where `.` matches everything except `\n` or except `\r` and `\n`, they are equivalent. And that is true of all the engines I know. – Leo Aso Sep 01 '19 at 11:57
1

For an 'engine independent answer' I would take your second option: ^abc[\s\S]*xyz$ simply because the character class should be more efficient (although this could theoretically be dependent on the engine) than the alternation group.

In other words, the regex engine in question should take less steps to match a result using ^abc[\s\S]*xyz$ than it would take using ^abc(.|\n|\r)*xyz$.

MurrayW
  • 401
  • 1
  • 10