0

I am needing to parse a bunch of legacy file-based data that looks like this:

(or


        (if         (eq ?SSD-enart_Cl:sName rueck1)

                then

                (or (eq ?SSD_Cl:sName sb405)
                    (eq ?SSD_Cl:sName sb455)
                    (eq ?SSD_Cl:sName sb52)
                )
        )



        (if         (eq ?SSD-enart_Cl:sName rueck3)

                then

                (or (eq ?SSD_Cl:sName sb38)
                    (eq ?SSD_Cl:sName sb405)
                    (eq ?SSD_Cl:sName sb43)
                    (eq ?SSD_Cl:sName sb455)
                    (eq ?SSD_Cl:sName sb48)
                )
        )



        (if     
                    (eq ?SSD-enart_Cl:sName r-SSD-ck4)
            

                then

                    (<> ?SSD_Cl:qty -1)
        )


)

I need a wildcarded regex that will match and return sets of grouped parens that begin with <whitespace>(xxx<whitespace>....)<whitespace> where xxx is the wildcarded string and <whitespace> is not a literal string, but any whitespace, most commonly a tab, space, or linefeed. And I need nested paren-groups within the match(es) to be ignored in terms of the match, but included as part of its outer match. A few scenarios/examples will make this very clear, and all examples are relative to the data shown above.

  1. xxx = or, so the regex would look for <whitespace>(or<whitespace>....)<whitespace>

This should return a single match: the entire contents of the data inside (or ... ), specifically:

        (if         (eq ?SSD-enart_Cl:sName rueck1)

                then

                (or (eq ?SSD_Cl:sName sb405)
                    (eq ?SSD_Cl:sName sb455)
                    (eq ?SSD_Cl:sName sb52)
                )
        )



        (if         (eq ?SSD-enart_Cl:sName rueck3)

                then

                (or (eq ?SSD_Cl:sName sb38)
                    (eq ?SSD_Cl:sName sb405)
                    (eq ?SSD_Cl:sName sb43)
                    (eq ?SSD_Cl:sName sb455)
                    (eq ?SSD_Cl:sName sb48)
                )
        )



        (if     
                    (eq ?SSD-enart_Cl:sName r-SSD-ck4)
            

                then

                    (<> ?SSD_Cl:qty -1)
        )
  1. xxx = if, so the regex would look for <whitespace>(if<whitespace>....)<whitespace>

This should return exactly 3 matches:

Match #1:

(if         (eq ?SSD-enart_Cl:sName rueck1)
        then

        (or (eq ?SSD_Cl:sName sb405)
            (eq ?SSD_Cl:sName sb455)
            (eq ?SSD_Cl:sName sb52)
        )
)

Match #2:

(if         (eq ?SSD-enart_Cl:sName rueck3)

            then

            (or (eq ?SSD_Cl:sName sb38)
                (eq ?SSD_Cl:sName sb405)
                (eq ?SSD_Cl:sName sb43)
                (eq ?SSD_Cl:sName sb455)
                (eq ?SSD_Cl:sName sb48)
            )
)

Match #3:

(if     
            (eq ?SSD-enart_Cl:sName r-SSD-ck4)
            

            then

            (<> ?SSD_Cl:qty -1)
)

Note: I don't strictly need the (if and ending ) included in the strings that come back in the match; just the contents therein. But it's fine either way--whatever is easier.

  1. xxx = or, so the regex would look for <whitespace>(or<whitespace>....)<whitespace>

For this example, we need only look at one of the ors, because I will always be evaluating the string of a given if, not the whole string. So we can just look at the ors within the 2nd if, for example:

(if         (eq ?SSD-enart_Cl:sName rueck3)

            then

            (or (eq ?SSD_Cl:sName sb38)
                (eq ?SSD_Cl:sName sb405)
                (eq ?SSD_Cl:sName sb43)
                (eq ?SSD_Cl:sName sb455)
                (eq ?SSD_Cl:sName sb48)
            )
)

This should return exactly 1 match:

(or (eq ?SSD_Cl:sName sb38)
        (eq ?SSD_Cl:sName sb405)
        (eq ?SSD_Cl:sName sb43)
        (eq ?SSD_Cl:sName sb455)
        (eq ?SSD_Cl:sName sb48)
)
  1. xxx = eq, so the regex would look for <whitespace>(eq<whitespace>....)<whitespace>

Again, I will always be drilling down (via c#, not regex) into the nesting, and have, for example, the following or block within the 2nd if not the whole string. So we can just look at the eqs within the or of the 2nd if, for example:

(or (eq ?SSD_Cl:sName sb38)
    (eq ?SSD_Cl:sName sb405)
    (eq ?SSD_Cl:sName sb43)
    (eq ?SSD_Cl:sName sb455)
    (eq ?SSD_Cl:sName sb48)
)

And I am expecting exactly 5 matches, each of those inner (eq...).

Now that I have given examples, here are the principles that can be taken as absolutes:

  • In all cases there will be paren-groupings and I need the regex to not try to match nested paren-groupings inside the match, but just the outer match. But the inner groupings should be returned as part of the outer match. Any inside nested parens should be seen by the regex as an ordinary string, and simply returned with the match, and not try to see those as matches. In summary, when regex finds, for example (if...it needs to go find the closing ) for that (if, and ignore any parens inside.
  • I need to be able to supply my own wildcard, programmatically, into the regex, which could be if, and, or, etc. The wildcard text will never be special chars, just regular lowercase letters, and in all cases will be preceded by an opening paren, which itself will be preceded by whitespace, and then will always have a closing paren. In between these matched outer parens, there will often be more parens which should be ignored for match purposes, but returned as an ordinary string as the contents of the match.
  • Inner and outer matched-sets of parens will always be correct. There will never be more opening parens than closing, and vice versa, which would of course confuse the regex.
  • A single regex expression should (I believe) be able to accommodate everything, and I will supply the wildcard text at runtime.

I have a regex in my app currently that does almost what I am requesting here, except that it matches sets of double-quotes instead of opening/closing parens:

public static MatchCollection GetQuotedStrings(string str) {
       Regex regex = new Regex("(\"([^\"]|\"\")*\")"); 
       return regex.Matches(str);
}

The function above brilliantly finds "endcap sets" of double quotes, even when there are frequently more double-quotes inside matched strings. What I need is similar: opening/closing grouped sets of parens, but always with a wildcard string adjacent to the opening paren. Unfortunately I am a regex beginner and can't figure out how to modify the above regex in a way that works.

Edit

I fear my super-detailed write-up above is scaring people off. This is much simpler than it appears, so let me simplify it. I need a regex that matches all instances of (if...), where (if will always be preceded by whitespace, there will always be a closing ), and the ... represents lots of misc content.

The only tricky part is there will frequently be other (..) groupings inside the outer (if...) grouping, and those inner groupings need to be treated like a normal string and not matched by the regex. That's it.

halfer
  • 18,701
  • 13
  • 79
  • 158
HerrimanCoder
  • 5,858
  • 20
  • 65
  • 111
  • This looks like an XY problem to me. What do you want to do with the matches? Filter a data set on those values? – madreflection Aug 13 '20 at 22:43
  • madre: with the matches I will then re-run the same regex but with a different wildcard, drilling down to the chunks I need, processing, DB inserts. – HerrimanCoder Aug 13 '20 at 22:46
  • Is what you're inserting being taken from a source set that you're filtering on these conditions? – madreflection Aug 13 '20 at 22:47
  • 1
    I think you'd be better off parsing it into an AST. You can go directly to `Expression` objects or use an intermediate representation to help translate from things like `?SSD_Cl:sName` to property accesses. Once you have the `Expression`, you can compile that to a delegate that operates on objects in memory. Subexpressions would be included in parsing to the AST so you wouldn't have to drill down at all. – madreflection Aug 13 '20 at 22:52
  • Nah...much rather use regex for this. – HerrimanCoder Aug 13 '20 at 23:02
  • While the .NET RegEx library can handle [nested capture groups](https://stackoverflow.com/questions/1313934/), you will probably be better off with a simple stack-based parser. I [wrote one for XML](https://stackoverflow.com/a/26248614/22437) but you could easily change the search characters from ` – Dour High Arch Aug 14 '20 at 18:07

1 Answers1

0

This turned out to be the answer:

$@"\({wildcard}(?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)"

I pass the wildcard into the function at runtime and it's working great.

I gleaned the solution from here:

Regular expression to match balanced parentheses

HerrimanCoder
  • 5,858
  • 20
  • 65
  • 111