I am needing to parse a bunch of legacy file-based data that looks like this:
(or
(if (eq ?SSD-enart_Cl:sName rueck1)
then
(or (eq ?SSD_Cl:sName sb405)
(eq ?SSD_Cl:sName sb455)
(eq ?SSD_Cl:sName sb52)
)
)
(if (eq ?SSD-enart_Cl:sName rueck3)
then
(or (eq ?SSD_Cl:sName sb38)
(eq ?SSD_Cl:sName sb405)
(eq ?SSD_Cl:sName sb43)
(eq ?SSD_Cl:sName sb455)
(eq ?SSD_Cl:sName sb48)
)
)
(if
(eq ?SSD-enart_Cl:sName r-SSD-ck4)
then
(<> ?SSD_Cl:qty -1)
)
)
I need a wildcarded regex that will match and return sets of grouped parens that begin with <whitespace>(xxx<whitespace>....)<whitespace>
where xxx
is the wildcarded string and <whitespace>
is not a literal string, but any whitespace, most commonly a tab, space, or linefeed. And I need nested paren-groups within the match(es) to be ignored in terms of the match, but included as part of its outer match. A few scenarios/examples will make this very clear, and all examples are relative to the data shown above.
xxx
=or
, so the regex would look for<whitespace>(or<whitespace>....)<whitespace>
This should return a single match: the entire contents of the data inside (or ... )
, specifically:
(if (eq ?SSD-enart_Cl:sName rueck1)
then
(or (eq ?SSD_Cl:sName sb405)
(eq ?SSD_Cl:sName sb455)
(eq ?SSD_Cl:sName sb52)
)
)
(if (eq ?SSD-enart_Cl:sName rueck3)
then
(or (eq ?SSD_Cl:sName sb38)
(eq ?SSD_Cl:sName sb405)
(eq ?SSD_Cl:sName sb43)
(eq ?SSD_Cl:sName sb455)
(eq ?SSD_Cl:sName sb48)
)
)
(if
(eq ?SSD-enart_Cl:sName r-SSD-ck4)
then
(<> ?SSD_Cl:qty -1)
)
xxx
=if
, so the regex would look for<whitespace>(if<whitespace>....)<whitespace>
This should return exactly 3 matches:
Match #1:
(if (eq ?SSD-enart_Cl:sName rueck1)
then
(or (eq ?SSD_Cl:sName sb405)
(eq ?SSD_Cl:sName sb455)
(eq ?SSD_Cl:sName sb52)
)
)
Match #2:
(if (eq ?SSD-enart_Cl:sName rueck3)
then
(or (eq ?SSD_Cl:sName sb38)
(eq ?SSD_Cl:sName sb405)
(eq ?SSD_Cl:sName sb43)
(eq ?SSD_Cl:sName sb455)
(eq ?SSD_Cl:sName sb48)
)
)
Match #3:
(if
(eq ?SSD-enart_Cl:sName r-SSD-ck4)
then
(<> ?SSD_Cl:qty -1)
)
Note: I don't strictly need the (if
and ending )
included in the strings that come back in the match; just the contents therein. But it's fine either way--whatever is easier.
xxx
=or
, so the regex would look for<whitespace>(or<whitespace>....)<whitespace>
For this example, we need only look at one of the or
s, because I will always be evaluating the string of a given if
, not the whole string. So we can just look at the or
s within the 2nd if
, for example:
(if (eq ?SSD-enart_Cl:sName rueck3)
then
(or (eq ?SSD_Cl:sName sb38)
(eq ?SSD_Cl:sName sb405)
(eq ?SSD_Cl:sName sb43)
(eq ?SSD_Cl:sName sb455)
(eq ?SSD_Cl:sName sb48)
)
)
This should return exactly 1 match:
(or (eq ?SSD_Cl:sName sb38)
(eq ?SSD_Cl:sName sb405)
(eq ?SSD_Cl:sName sb43)
(eq ?SSD_Cl:sName sb455)
(eq ?SSD_Cl:sName sb48)
)
xxx
=eq
, so the regex would look for<whitespace>(eq<whitespace>....)<whitespace>
Again, I will always be drilling down (via c#, not regex) into the nesting, and have, for example, the following or
block within the 2nd if
not the whole string. So we can just look at the eq
s within the or
of the 2nd if
, for example:
(or (eq ?SSD_Cl:sName sb38)
(eq ?SSD_Cl:sName sb405)
(eq ?SSD_Cl:sName sb43)
(eq ?SSD_Cl:sName sb455)
(eq ?SSD_Cl:sName sb48)
)
And I am expecting exactly 5 matches, each of those inner (eq...)
.
Now that I have given examples, here are the principles that can be taken as absolutes:
- In all cases there will be paren-groupings and I need the regex to not try to match nested paren-groupings inside the match, but just the outer match. But the inner groupings should be returned as part of the outer match. Any inside nested parens should be seen by the regex as an ordinary string, and simply returned with the match, and not try to see those as matches. In summary, when regex finds, for example
(if
...it needs to go find the closing)
for that(if
, and ignore any parens inside. - I need to be able to supply my own wildcard, programmatically, into the regex, which could be
if
,and
,or
, etc. The wildcard text will never be special chars, just regular lowercase letters, and in all cases will be preceded by an opening paren, which itself will be preceded by whitespace, and then will always have a closing paren. In between these matched outer parens, there will often be more parens which should be ignored for match purposes, but returned as an ordinary string as the contents of the match. - Inner and outer matched-sets of parens will always be correct. There will never be more opening parens than closing, and vice versa, which would of course confuse the regex.
- A single regex expression should (I believe) be able to accommodate everything, and I will supply the wildcard text at runtime.
I have a regex in my app currently that does almost what I am requesting here, except that it matches sets of double-quotes instead of opening/closing parens:
public static MatchCollection GetQuotedStrings(string str) {
Regex regex = new Regex("(\"([^\"]|\"\")*\")");
return regex.Matches(str);
}
The function above brilliantly finds "endcap sets" of double quotes, even when there are frequently more double-quotes inside matched strings. What I need is similar: opening/closing grouped sets of parens, but always with a wildcard string adjacent to the opening paren. Unfortunately I am a regex beginner and can't figure out how to modify the above regex in a way that works.
Edit
I fear my super-detailed write-up above is scaring people off. This is much simpler than it appears, so let me simplify it. I need a regex that matches all instances of (if...)
, where (if
will always be preceded by whitespace, there will always be a closing )
, and the ...
represents lots of misc content.
The only tricky part is there will frequently be other (..)
groupings inside the outer (if...)
grouping, and those inner groupings need to be treated like a normal string and not matched by the regex. That's it.