5

I'm using the following regex in c# to match some input cases:

^
(?<entry>[#])?
(?(entry)(?<id>\w+))
(?<value>.*)
$

The options are ignoring pattern whitespaces.

My input looks as follows:

hello
#world
[xxx]

This all can be tested here: DEMO

My problem is that this regex will not match the last line. Why? What I'm trying to do is to check for an entry character. If it's there I force an identifier by \w+. The rest of the input should be captured in the last group.

This is a simplyfied regex and simplyfied input.

The problem can be fixed if I change the id regex to something like (?(entry)(?<id>\w+)|), (?(entry)(?<id>\w+))? or (?(entry)(?<id>\w+)?).

I try to understand why the conditional group doesn't match as stated in original regex.

I'm firm in regex and know that the regex can be simplyfied to ^(\#(?<id>\w+))?(?<value>.*)$ to match my needs. But the real regex contains two more optional groups:

^
(?<entry>[#])?
(\?\:)?
(\(\?(?:\w+(?:-\w+)?|-\w+)\))?
(?(entry)(?<id>\w+))
(?<value>.*)
$

That's the reason why I'm trying to use a conditional match.

UPDATE 10/12/2018

I tested a little arround it. I found the following regex that should match on every input, even an empty one - but it doesn't:

(?(a)a).*

DEMO

I'm of the opinion that this is a bug in .net regex and reported it to microsoft: See here for more information

Sebastian Schumann
  • 2,792
  • 12
  • 33
  • @WiktorStribiżew Yes I know that this will fix my problem as I already said. But Why? There are [Balancing Group Definitions](https://stackoverflow.com/a/17004406/2729609) that are working without that _hack_. – Sebastian Schumann Oct 11 '18 at 06:49
  • 1
    I can actually reproduce this in C#. For some reason, it matches `foo`, but not `[foo]`. – 41686d6564 Oct 11 '18 at 06:53
  • Here's a [live C# example](https://rextester.com/EOP15999) _(demonstrating the problem)_. – 41686d6564 Oct 11 '18 at 07:15
  • @PoulBak Yes the id group forces `\w+` that does not match for `[`. But this group should only be evaluated if the `entry` group has a capture. This group doesn't have a capture for `[foo]` and so the id group shouldn't be evaluated and the whole string should be captured by the `(?.*)` group. But it doesn't work in that way. – Sebastian Schumann Oct 11 '18 at 07:46
  • @PoulBak Not true. Because it also matches `foo]` as you can see in the example in my previous comment. Also, `foo` or `foo]` is actually in `m.Groups["value"]`. – 41686d6564 Oct 11 '18 at 07:49
  • @WiktorStribiżew If you understand _why it happens_ please explain it. I don't understand it. I don't need to know what goes on "under the hood". I only want to understand _why_ this happens. – Sebastian Schumann Oct 11 '18 at 07:57
  • 1
    Sorry, I think I am close to solving it, let me dig a little deeper. – Wiktor Stribiżew Oct 11 '18 at 08:02
  • I tried to delete '(?(entry)(?\w+))'. Now it matches, so I think it ALWAYS evaluates even when 'entry' is empty. – Poul Bak Oct 11 '18 at 08:15
  • Well, I am still far from having a clear understanding of the issue, but adding `(?\z.)?` after `^` fixes the issue. I understand that it is somehow related to the use of capturing groups inside the `then` part of a conditional construct. See [a related question](https://stackoverflow.com/q/38991092/3832970). – Wiktor Stribiżew Oct 11 '18 at 08:41
  • @WiktorStribiżew @AhmedAbdelhameed Removing the `Multiline` option will also fix the problem: [Demo](https://ideone.com/N92V14) – Sebastian Schumann Oct 11 '18 at 08:48
  • @WiktorStribiżew Don't put any more effort to that problem. I'm of the opinion that this is a bug in .net regex and reported a [bug](https://developercommunity.visualstudio.com/content/problem/355026/inconsistent-behaviour-of-regex-for-single-line-in.html) – Sebastian Schumann Oct 11 '18 at 10:09
  • I also think it is a bug. I could not find any good clues when checking the regex source code. – Wiktor Stribiżew Oct 11 '18 at 10:10
  • Just to add to the mystery: This Regex: '(?(i)i)' will match 'i' if the text starts with 'i'(Match length:1). This Regex: '(?(a)i) will match an empty string if text starts with 'i'(Match length: 0). This Regex: '(?(i)a)' will not match if text starts with 'i'. – Poul Bak Oct 14 '18 at 02:18
  • Not a bug...see my answer for an explanation. – ΩmegaMan Oct 21 '18 at 14:24

1 Answers1

0

There is no error in the regex parser, but in one's usage of the . wildcard specifier. The . specifier will consume all characters, wait for it, except the linefeed character \n. (See Character Classes in Regular Expressions "the any character" .])

If you want your regex to work you need to consume all characters including the linefeed and that can be done by specify the option SingleLine. Which to paraphrase what is said

Singline tells the parser to handle the . to match all characters including the \n.


Why does it still fail when not in singleline mode for the other lines are consumed? That is because the final match actually places the current position at the \n and the only option (as specified is use) is the [.*]; which as we mentioned cannot consume it, hence stops the parser. Also the $ will lock in the operations at this point.


Let me demonstrate what is happening by a tool I have created which illustrates the issue. In the tool the upper left corner is what we see of the example text. Below that is what the parser sees with \r\n characters represented by ↵¶ respectively. Included in that pane is what happens to be matched at the time in yellow boxes enclosing the match. The middle box is the actual pattern and the final right side box shows the match results in detail by listening out the return structures and also showing the white space as mentioned.

What is matched before singleline

Notice the second match (as index 1) has world in group capture id and value as .

I surmise your token processor isn't getting what you want in the proper groups and because one doesn't actually see the successful match of value as the \r, it is overlooked.

Let us turn on Singline and see what happens.

enter image description here

Now everything is consumed, but there is a different problem. :-)

ΩmegaMan
  • 22,885
  • 8
  • 76
  • 94
  • Thank you for that explanation. I understand the problem with `\r`. In my cases I only have one line. If you understand that a little bit better than me please explain why `(?(a)a).*` doesn not match for input `xxx` (no newlines or linefeeds; only three characters). [DEMO](http://regexstorm.net/tester?p=%28%3f%28a%29a%29.*&i=xxx). In my opinion `(?(a)a).*` has to match every input even an empty one. This regex looks for an `a` with `?(a)` and consumes it. If there is no `a` all input has to be captured by `.*` But the regex doesn't match any strings that does not contain an `a`. Please explain – Sebastian Schumann Oct 22 '18 at 04:47
  • The confusion lies in the fact that a conditional match `(?( ) )` when it fails, it stops the regex processing cold (as does any match failure). Since you only provide one condition `a`, the parser looks at the first character and does not find an `a`; fine. It then looks to see if the user has specified an or condition to do *when* the first condition failed....no. You did not provide an or condition, so the processing stops. If one adds an `or | condition`, it works as intended. Hence `(?(a)a|.*)` works like a charm. :-) Any failure condition within a match stops the match at that point. – ΩmegaMan Oct 22 '18 at 11:44
  • See [Condition Matching With An Expression](https://docs.microsoft.com/en-us/dotnet/standard/base-types/alternation-constructs-in-regular-expressions#Conditional_Expr) – ΩmegaMan Oct 22 '18 at 11:56
  • No: *and no is the optional pattern to match if expression is not matched* copied from the very first sentence of the link you provided. The _no_-part is optional! Btw. [Balancing group definitions](https://stackoverflow.com/a/17004406/2729609) are using the exact same trick without any _no_-part. The end of the expression is `(?(Open)(?!))`. And trust me you don't need the anchors. I'm using balancing group definitions very often and many of my regexex continue after that definition and are working fine. – Sebastian Schumann Oct 22 '18 at 12:05
  • Apples and oranges with balanced matching, that is a different animal. Even without the *optional* pattern you still have the failure on a missing `a`. What you imply is that a failure is optional and pattern matching should continue. Change the pattern to `a?.*` then if that is what you want. – ΩmegaMan Oct 22 '18 at 12:09
  • No I can't simplify my pattern to `a?.*` because there are other pattern parts that I removed here to avoid confusion. I'm not looking for a solution because I have one already as I said in my question. I'm currently using `(?(a)a|)`. I was looking for an explanation of that behaviour. – Sebastian Schumann Oct 22 '18 at 12:20
  • Sorry I don't understand why I'm comparing apples and oranges. The only part that I mentioned was the conditional match in balancing group definitions. I don't see the point what `(?(Open)(?!))` has anything to do with balancing group definitions. That is a conditional pattern as explained in your link. If I'm wrong here please correct me. – Sebastian Schumann Oct 22 '18 at 12:23
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/182264/discussion-between-megaman-and-vera-rind). – ΩmegaMan Oct 22 '18 at 12:24