2

I'm stuck on a Regular expression:

I have a input string with numbers and one letter that can contain more numbers and letters within the string and between parenthesis:

Just a few examples

26U(35O40) will be read as 26 and (35 or 40)
22X(34U(42O27)) will be read as 22 xor (34 and (42 or 27))
21O(24U27) will be read as 21 or (24 and 27)
20X10X15 Will be read as 20 xor 10 xor 15

I have read that this can be obtained using balancing groups, however I have tried lot of regular expressions and the closes is the following:

(?<ConditionId>\d+)(?<Operator>X|U|O)?(?<Open>\()(?<ConditionId>\d+)+(?<Operator>X|U|O)?(?<ConditionId>\d+)(?<-Open>\))

I have also thought that maybe I'm making it harder and I should just run several times the same regex, first time for everything outside the parenthesis and the second time for the inner stuff and run it again when matches the inner. Something like that:

(?<ConditionId>\d+)?(?<Operator>U|O|X)?(?<Inner>(?:\().*(?:\)))

Suggestions or help?

Thanks in advance.

Edit 1: I don't have to validate the input, just parse it.

Edit 2: So, the reason behind is to identify a condition by the condition Id and then apply the operator against the other conditions in the input String. In the same order as appear in the input String, a more general example to make it easier to understand would be logic gates:

For a given input of 20x10x15 I will have to identify the conditions by the conditionId and check if the condition is valid and apply the XOR operator on them, something like:

true X true X false = false;
false X false X true = true;
true X (false U true) = true

That is the reason I cannot group everything into a "ConditionId" group and "Operator" group.

Edit 3 This is also a valid example

(23X10)U(30O(20X19)
Nekeniehl
  • 1,463
  • 17
  • 31
  • 1
    Why have you settled on regex for solving this particular issue? You should be able to write a parser using regular string operations and a stack. – willaien May 22 '17 at 15:10
  • Hi, thanks for the suggestion, I came with regex because the conditions are actually a little bit more complicated than the ones I have used for the example, and I thought the correct way would be using regex, because sometimes the hardest is the easiest, and I didn't want to end up fighting with String operations (that can also be harder sometimes). Anyway you are right and I will definitely take a look if I don't get solve the issue with the regex as it was my second option. – Nekeniehl May 22 '17 at 15:18
  • 1
    Do you need to actually *replace*? Not just *extract* parts of the strings? `"22X(34U(42O27))"` => `"22 xor (34 and (42 or 27))"` – Wiktor Stribiżew May 22 '17 at 15:25
  • Hi, thanks for your reply, I just need to extract them, so i can identify later on a Condition by his Id and the operation I have to do against the other conditions. – Nekeniehl May 22 '17 at 15:29
  • Not sure what you need. See [this](http://regexstorm.net/tester?p=%5cd%2b%5bA-Z%5d%3f%28%3f%3a%5c%28%28%3f%3e%5b%5e%28%29%5d%2b%7c%28%3f%3co%3e%5c%28%29%7c%28%3f%3co%3e%5c%29%29*%29%28%3f%28o%29%28%3f!%29%29%5c%29%29%3f&i=26U%2835O40%29%0d%0a22X%2834U%2842O27%29%29%0d%0a21O%2824U27%29%0d%0a20X10X15) – Wiktor Stribiżew May 22 '17 at 15:41
  • 1
    You see, you [can write a pattern](http://regexstorm.net/tester?p=%5e%0d%0a%28%3f%3a%0d%0a%28%3f%3cnb%3e%5b0-9%5d%2b%29%0d%0a%7c%28%3f%3cop%3e%28%3f%3c%3d%5b0-9%29%5d%29%5bUOX%5d%28%3f%3d%5b0-9%28%5d%29%29%0d%0a%7c%28%3f%3cp%3e%28%3f%3c%3d%5e%7c%5bUOX%5d%29%5c%28%29%0d%0a%7c%28%3f%3c-p%3e%5c%29%28%3f%3d%24%7c%5bUOX%29%5d%29%29%0d%0a%29%2b%0d%0a%28%3f%28p%29%28%3f!%29%29%0d%0a%24&i=22X%2834U%2842O27%29%29&o=xm) that will validate that the input is correct. But interpreting the results will be *very hard*, if possible at all. The language is *very simple*, just write a parser by hand. – Lucas Trzesniewski May 22 '17 at 15:45
  • 2
    Hell, you could write a relatively easy translator to convert this and use `Microsoft.CodeAnalysis.CSharp.Scripting` like so: `var expression = "26U(35O40)".Replace("U","&").Replace("O", "|").Replace("X","^"); var value = (int)CSharpScript.EvaluateAsync(expression).Result;` Notably, you will definitely need to ensure that the value is trusted before following such a path, as it would open you up to remote code execution. – willaien May 22 '17 at 15:50
  • Is `22X(34U(42O27))` the most complex example you have? – ΩmegaMan May 22 '17 at 17:34
  • Also what tokens do you want? `22X` or `22` and an `X`? Providing just `22X` as one of the token(s) is easier. – ΩmegaMan May 22 '17 at 17:44
  • Hi, thanks for the reply, the tokens I need is 22 and X and yep, it is one of the most complex, it could be also with parenthesis in the beginning: `(23X10)U(30O(20X19)` – Nekeniehl May 23 '17 at 07:23
  • @willaien that is an awesome Idea because it will solve the problem with the regular expresions and will make it much easier to maintain in the future. I will test it right away to check if take into account the parenthesis. – Nekeniehl May 23 '17 at 07:27

2 Answers2

0

If you use (\d+[A-Z]*[()]?)+ it will return one match on 22X(34U(42O27)) with these captures on Groups[1].Captures

22X( 34U( 42O and 27)

That gives enough information to process the code.

On 20X10X15 the same capture group gives

20X 10X and 15

ΩmegaMan
  • 22,885
  • 8
  • 76
  • 94
0

Assuming your input is already valid, and you want to parse it, here is a rather simple regex to achieve that:

(?:
    (?<ConditionId>\d+)
    |
    (?<Operator>[XUO])
    |
    (?<Open>\()
    |
    (?<Group-Open>\))
)+

Working example - Regex Storm - switch to the table tab to see all captures.

The pattern captures:

  • Numbers into the $ConditionId group.
  • Operators into the $Operator group.
  • Sub expressions in parentheses into the $Group group (needs a better name?). For example, for the string 22X(34U(42O27)), it will have two captures: 42O27 and 34U(42O27).

Each capture contains the position of the matches string. The relations between $Group and its contained $Operators, $ConditionIds and sub-$Groups is expressed only using these positions.

The (?<Group-Open>) syntax is used when we reach a closing parenthesis to capture everything since the corresponding opening parenthesis. This is explained in more detailed here: What are regular expression Balancing Groups?

Kobi
  • 125,267
  • 41
  • 244
  • 277
  • Hi, Thanks for the detailed answer, but this is what I have try at first successfully, the problem is that everything get mixed later on. The reason I cannot group the conditions or the operators is because later on I have to identify the conditions by the Id and apply the operator against the other conditions in the same way is in the input. – Nekeniehl May 23 '17 at 07:10