0

I'm looking at someone else's regex... I can make out I'm dealing with a positive lookbehind, but I'm not sure what it's supposed to match: (?<=[^])\t{2,}|(?<=[>]).

I know [stuff] matches any character among s, t, u, and f. And I know [^stuff] matches any character not among those.

But what does [^] mean? I guess it could mean "anything not of length zero", i.e. "anything". But why wouldn't one just use some expansion on the simple . expression (to also capture newlines)?

Update:

Per Wikter's comment, [^] alone isn't valid. But that still leaves me wondering what this thing is supposed to do...

To me, an intuitive reading is...

(?<=[^]) - look behind for whatever [^] matches

\t{2,} - then find two or more tabs

| - if there's not a match for that...

(?<=[>]) - ...look behind for a > character.

Where is my interpretation missing the mark?

Michael Crenshaw
  • 4,412
  • 2
  • 33
  • 59
  • It is an invalid pattern as in majority of regex flavors other than ECMAScript. It will throw *`Unterminated [] set`* exception. To match any char, use `(?s:.)` (a `.` pattern with `RegexOptions.Singleline` option). – Wiktor Stribiżew Oct 05 '17 at 14:27
  • @WiktorStribiżew thanks, you're right, the expression I gave wouldn't compile. I added more context from the source regex: `(?<=[^])\t{2,}|(?<=[>])`. – Michael Crenshaw Oct 05 '17 at 14:36
  • 1
    Yes, so the answer to "What does [^] match in C# regex?" is it does not match anything since it is an invalid pattern. It is not even tried at all, it fails at the parsing stage. And `[^>]` is a negated character class that matches any char but `>`. – Wiktor Stribiżew Oct 05 '17 at 14:39
  • @WiktorStribiżew feel free to drop that in an answer, and I'll accept it. Any further guidance on my misinterpretation of the regex (updated above) would also be appreciated! – Michael Crenshaw Oct 05 '17 at 14:42
  • 1
    This actually does compile in LINQPad, because .NET interprets the `]` following the `^` as a literal, so this whole thing becomes a big character class of anything except `]`, `)`, `\t`, `{`, etc. until the final `]`, all wrapped up in a lookbehind. But of course, I don't think that's the intent. – p.s.w.g Oct 05 '17 at 14:44
  • @WiktorStribiżew I'm wondering... Wouldn't it (the regex-compiler) *try* to make sense out of it and make a character class of **all** the letters up to the final `]`, after the `>`? It seems to compile in ideone... – SamWhan Oct 05 '17 at 14:45
  • @p.s.w.g yeah, even if that's the intent, it's weird that the characters are ordered in a way that reads like regex. – Michael Crenshaw Oct 05 '17 at 14:47

1 Answers1

3

The [^] does not match anything since it is an invalid pattern. It is not even tried at all, it fails at the parsing stage. The [^>], on the other hand, is a negated character class that matches any char but >.

The [^] is an invalid pattern in the majority of regex flavors other than ECMAScript. It will throw Unterminated [] set exception in .NET.

To match any char, use (?s:.) (a . pattern with RegexOptions.Singleline option).

The (?<=[^])\t{2,}|(?<=[>]) pattern represents a single positive lookbehind that matches a location that is immediately preceded with [^])\t{2,}|(?<=[>] pattern, which is a negated character class matching any single char but ], ), tab, {, 2, ,, }, |, (, ?, <, =, [, >. All the chars from the [^ to the last ] are "negated" because the first ] after ^ is considered a literal ] symbol.

You may see the regex demo here where it matches a location after S:

enter image description here

Basically, you need to always watch out for characters that are not word chars, and to play it safe, you may escape all non-word chars.

Inside a character class, there are only 4 chars that are "special":

^
]
\
-

If you want to avoid misunderstanding, always escape them.

If you want to show off before you boss/customer, note that you do not have to escape them if...

  • - : if it appears at the end/start of the character class, or between a char and a valid range/shorthand character class, and if it is not part of a character class subtraction construct
  • ] : if it appears right at the beginning of the character class AND it is not the only char in the character class
  • ^ - if it is not the first char in the positive character class.

And \ must always be escaped.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • 1
    Where are you seeing [^>] in the given pattern? If you're talking about a single character class that happens to contain a >, say so. – BoltClock Oct 05 '17 at 14:48
  • 1
    @BoltClock It is in the [comment](https://stackoverflow.com/questions/46588130/what-does-match-in-c-sharp-regex/46588503#comment80129827_46588130). Actually, I just wanted to supply an example of a valid simple negated character class. – Wiktor Stribiżew Oct 05 '17 at 14:51
  • 1
    @WiktorStribiżew I think you're looking at `[>]` rather than `[^>]`. – Michael Crenshaw Oct 05 '17 at 14:52