3

Following this question:

https://stackoverflow.com/a/24591578/1329812

I am trying to use balanced matching to replace all items within brackets but in the example the brackets are "{{" and "}}". Whereas my brackets would be "<![CDATA[" and "]]>".

I am having trouble modifying the [^{}] section of the regular expression in the accepted answer to the previous question to use my version of brackets instead. I have tried to modify [^{}] to (?!(<!\[CDATA\|\]\]>)).

I have simplified the problem to use 12 as the open bracket and 34 as the close bracket. The following returns "STST" as expected.

using System.Text.RegularExpressions;

Regex.Replace(
12T1212E343434STST12RING34',--input
'12(?!(12|34))*(((?<Open>12)(?!(12|34))*)+((?<Close-Open>34)(?!(12|34))*)+)*(?(Open)(?!))34',--pattern
''--replacement
);

However it does not work if i replace 12 with <!\[CDATA\[" and 34 with "\]\]>.

Finally, I would like to operate on the following CDATA Sample String:

"<![CDATA[t<![CDATA[e]]>]]>stst<![CDATA[ring]]>"

should return

"stst"
Danny Rancher
  • 1,799
  • 3
  • 21
  • 40

1 Answers1

1

Your current 12...34 matching regex is not right since the tempered greedy token used is "corrupt" ((?!(12|34))* is missing the consuming part, .).

You just need to remember about the parts of the regex like that: 1) the leading delimiter pattern, 2) the trailing delimiter pattern, 3) the part in between should match what is not both 1 and 2, 4) the conditional construct that checks if the "technical" group capture stack is empty.

So, the numeric regex can be fixed as

12(?>(?!12|34).|(?<o>)12|(?<-o>)34)*(?(o)(?!))34

(regex demo) and the CDATA one will look like

<!\[CDATA\[(?>(?!<!\[CDATA\[|]]>).|(?<o>)<!\[CDATA\[|(?<-o>)]]>)*(?(o)(?!))]]>

See this regex demo

NOTE: If there can be newline symbols in the string input, use RegexOptions.Singleline option or the inline modifier version, (?s), at the pattern start.

Pattern details:

  • 12 - the leading delimiter pattern
  • (?> - start of the atomic group that will match what is neither leading nor trailing patterns, and will keep track of those delimiting substrings:
    • (?!12|34).| - match any char (if RegexOptions.Singleline option is used, even including a newline) but a char that is a starting point of the 12 or 34 sequences
    • (?<o>)12| - match12` and increment the "o" group capture stack, or
    • (?<-o>)34 - match 34 and decrement the "o" group capture stack
  • )* - and repeat that (keep matching) zero or more occurrences of the patterns inside the atomic group
  • (?(o)(?!)) - the conditional construct that will check if the "o" group capture stack is empty. If it is not empty, backtracking will trigger, and balanced number of leading/trailing delimiters will be searched for.
  • 34 - the trailing delimiter pattern.

Also, [ in <![CDATA[ must be escaped, as [ is a special char outside the character class, and ] in ]]> do not have to be escaped, since outside a character class, ] is not special for a .NET regex.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • incredible! thank you for the further insight with the answer, the text does contain newlines so I am currently using the pattern "(?s)(?!).|(?))]]>)*(?(o)(?!))]]>" – Danny Rancher Sep 21 '17 at 12:45
  • @DannyRancher: Well, if there are no newlines, you may remove `(?s)`, the DOTALL/Singleline modifier that makes all `.` after it match LF symbols, too. – Wiktor Stribiżew Sep 21 '17 at 12:48
  • ok thanks, separately, I'm trying to simplify some larger strings into series of CDATA "brackets" what is the regex for identifying the CDATA brackets and only retaining them? `(?>(?!)`? – Danny Rancher Sep 21 '17 at 13:19
  • @DannyRancher That is something different. What is the sample test case? `]]>stst` and the output is... – Wiktor Stribiżew Sep 21 '17 at 13:21
  • using the sample test case in your comment, i am trying to generate `]]>`. – Danny Rancher Sep 21 '17 at 13:24
  • @DannyRancher [Here is my attempt](https://ideone.com/ZPrIzA). It counts all the nested CDATAs in Group "a", and then the matched number of nested CDATAs are created inside the callback method passed to `Regex.Replace`. – Wiktor Stribiżew Sep 21 '17 at 13:49
  • thank you, although is this not overcomplicated? Some inputs don't seem to be working with the regex created in your answer. I am just trying to simplify them by removing everything except the "brackets" text to identify the issue. Your solutions are very interesting to learn from however. – Danny Rancher Sep 21 '17 at 14:08
  • @DannyRancher: I can only suggest writing a parser if you do not like that regex approach. – Wiktor Stribiżew Sep 21 '17 at 14:10
  • hmm i understand. i think instead i will modify the C# code to join the Match.Values in a StringBuilder before returning. thanks anyway – Danny Rancher Sep 21 '17 at 14:18
  • is there some reason the regex provided as an answer would be unable to identify brackets which differ by a large number of characters? – Danny Rancher Sep 21 '17 at 14:42
  • @DannyRancher No idea, I do not think so. Please share the string. http://pastebin.com can help. – Wiktor Stribiżew Sep 21 '17 at 14:43
  • ok so the problem is that there exists an invalid closing bracket within the data within the legitimate brackets. so imagine the input `())` and we are left with the second `)` instead of removing the whole input. is there any way to make the regular expression greedy to select the second closing bracket? Going back to the simple numeric example this is the same as input `12TE34ST34STRING` and returning `STRING` – Danny Rancher Sep 21 '17 at 14:55
  • @DannyRancher That is input error. I doubt you can add to the pattern and expect consistent behavior with the true and these "invalid" cases. – Wiktor Stribiżew Sep 21 '17 at 15:37