20

I have some useful regular expressions in Perl. Is there a simple way to translate them to .NET's dialect of regular expressions?

If not, is there a concise reference of differences?

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
JoelFan
  • 34,383
  • 31
  • 123
  • 194

3 Answers3

36

There is a big comparison table in http://www.regular-expressions.info/refflavors.html.


Most of the basic elements are the same, the differences are:

Minor differences:

  • Unicode escape sequences. In .NET it is \u200A, in Perl it is \x{200A}.
  • \v in .NET is just the vertical tab (U+000B), in Perl it stands for the "vertical whitespace" class. Of course there is \V in Perl because of this.
  • The conditional expression for named reference in .NET is (?(name)yes|no), but (?(<name>)yes|no) in Perl.

Some elements are Perl-only:

  • Possessive quantifiers (x?+, x*+, x++ etc). Use non-backtracking subexpression ((?>…)) instead.
  • Named unicode escape sequence \N{LATIN SMALL LETTER X}, \N{U+200A}.
  • Case folding and escaping
    • \l (lower case next char), \u (upper case next char).
    • \L (lower case), \U (upper case), \Q (quote meta characters) until \E.
  • Shorthand notation for Unicode property \pL and \PL. You have to include the braces in .NET e.g. \p{L}.
  • Odd things like \X, \C.
  • Special character classes like \v, \V, \h, \H, \N, \R
  • Backreference to a specific or previous group \g1, \g{-1}. You can only use absolute group index in .NET.
  • Named backreference \g{name}. Use \k<name> instead.
  • POSIX character class [[:alpha:]].
  • Branch-reset pattern (?|…)
  • \K. Use look-behind ((?<=…)) instead.
  • Code evaluation assertion (?{…}), post-poned subexpression (??{…}).
  • Subexpression reference (recursive pattern) (?0), (?R), (?1), (?-1), (?+1), (?&name).
  • Some conditional expression's predicate are Perl-specific:
    • code (?{…})
    • recursive (R), (R1), (R&name)
    • define (DEFINE).
  • Special Backtracking Control Verbs (*VERB:ARG)
  • Python syntax
    • (?P<name>…). Use (?<name>…) instead.
    • (?P=name). Use \k<name> instead.
    • (?P>name). No equivalent in .NET.

Some elements are .NET only:

  • Variable length look-behind. In Perl, for positive look-behind, use \K instead.
  • Arbitrary regular expression in conditional expression (?(pattern)yes|no).
  • Character class subtraction (undocumented?) [a-z-[d-w]]
  • Balancing Group (?<-name>…). This could be simulated with code evaluation assertion (?{…}) followed by a (?&name).

References:

kennytm
  • 469,458
  • 94
  • 1,022
  • 977
  • 1
    Awesome, thanks... BTW, I was quite pleasantly surprised at how compatible the 2 dialects are... even look-around, etc. I also had not known that the Regex.Replace method in .NET supported replacing matched parenthesized subexpressions ($1, $2, etc.) like so: str = Regex.Replace(str, @"([a-z]+):(\d+)", m => m.Result("$1 -- $2")) which corresponds to the Perl: s/([a-z]+):(\d+)/$1 -- $2/g – JoelFan Aug 06 '10 at 15:03
  • "Some elements are Perl-only" does not mention character translation (like tr/tgca/acgt/), so it is supported, isn'T it? – mbx Jun 27 '12 at 12:11
  • @mbx: I don't consider character translation as part of regex. – kennytm Jun 27 '12 at 13:50
  • 1
    Great answer. I'd like to add this: When you mix named and unnamed capture groups in a single regex, the _order_ in which they are referenced is different. In Perl, `perl -E "@captures = 'word1 word2 word3' =~ /(?\w+)\s+(\w+)\s+(\w+)/; foreach my $c (@captures){say $c}"` still comes out as `word1 word2 word3`, whereas in .NET regex it would come out as `word2 word3 word1` because unnamed groups are -unexpectedly- prioritized when the capture groups are ordered by the regex engine. This might have implications when complex regexes are translated from one language into the other. – knb Jan 26 '13 at 18:00
3

They were designed to be compatible with Perl 5 regexes. As such, Perl 5 regexes should just work in .NET.

You can translate some RegexOptions as follows:

[Flags]
public enum RegexOptions
{
  Compiled = 8,
  CultureInvariant = 0x200,
  ECMAScript = 0x100,
  ExplicitCapture = 4,
  IgnoreCase = 1,                 // i in Perl
  IgnorePatternWhitespace = 0x20, // x in Perl
  Multiline = 2,                  // m in Perl
  None = 0,
  RightToLeft = 0x40,
  Singleline = 0x10               // s in Perl
}

Another tip is to use verbatim strings so that you don't need to escape all those escape characters in C#:

string badOnTheEyesRx    = "\\d{4}/\\d{2}/\\d{2}";
string easierOnTheEyesRx = @"\d{4}/\d{2}/\d{2}";
Jordão
  • 51,321
  • 12
  • 105
  • 132
  • 3
    @Eric: Neither is a superset of the other. – kennytm Aug 05 '10 at 18:06
  • 2
    No, .NET came after Perl 5, and copied it's winning regex syntax. – Jordão Aug 05 '10 at 18:07
  • 1
    @KennyTM => what does .NET have that Perl does not? Perl has embedded code execution `(?{ code })` and `(??{ code })`, recursion into capture groups... – Eric Strom Aug 05 '10 at 18:14
  • 1
    @Eric: [Balancing groups](http://blog.stevenlevithan.com/archives/balancing-groups). – kennytm Aug 05 '10 at 18:17
  • @KennyTM => i haven't looked into them much, but it looks like that could be done with Perl's embedded code constructs. At any rate, calling .NET's regexes a superset of Perl's is just wrong. – Eric Strom Aug 05 '10 at 18:24
  • Oh yeah, code interpolation won't work in .NET, unless you do some ugly string concatenation or use `string.Format` to create your regexes. And the fact that they're not literals in .NET is also not ideal. – Jordão Aug 05 '10 at 18:24
  • 1
    @Eric: (1) That's cheating ;) (2) I did not call .NET is a superset of Perl. I call they belong the different sets (Insert Venn diagram). – kennytm Aug 05 '10 at 18:28
1

It really depends on the complexity of the regular expression - many ones will work the same out of the box.

Take a look at this .NET regex cheat sheet to see if an operator does what you expect it to do.

I don't know of any tool that automatically translates between RegEx dialects.

Oded
  • 463,167
  • 92
  • 837
  • 979
  • RegexBuddy can take a regex in a large variety of flavors and convert it to another - as long as the required functionality is supported by the target regex flavor. – Tim Pietzcker Aug 05 '10 at 20:08