2

I'm parsing a simple language (Excel formulas) for the functions contained within. A function name must start with any letter, followed by any number of letters/numbers, and ending with an open paren (no spaces in between). For example MyFunc(. The function can contain any arguments, including other functions and must end with a close paren ). Of course, math within parens is allowed =MyFunc((1+1)) and (1+1) shouldn't be detected as a function because it fails the function rule I've just described. My goal is to recognize the highest level function calls in a formula, identify the function name, extract the arguments. With the arguments, I can recursively look for other function calls.

Using this tutorial I hacked up the following regexes. None seem to do the trick. They both fail on test case pasted below.

This should work but completely fails:

(?<name>[a-z][a-z0-9]*\()(?<body>(?>[a-z][a-z0-9]*\((?<DEPTH>)|\)(?<-DEPTH>)|.?)*(?(DEPTH)(?!)))\)

This works for many test cases, but fails for test case below. I don't think it handles nested functions correctly- it just looks for open paren/close paren in the nesting:

(?<name>[a-z][a-z0-9]*\()(?<body>(?>\((?<DEPTH>)|\)(?<-DEPTH>)|.?)*(?(DEPTH)(?!)))\)

Here's the test that breaks them all:

=Date(Year(A$5),Month(A$5),1)-(Weekday(Date(Year(A$5),Month(A$5),1))-1)+{0;1;2;3;4;5}*7+{1,2,3,4,5,6,7}-1

This should be matched as:

Date(ARGUMENTS1)
Weekday(ARGUMENTS2)
Where ARGUMENTS2 = Date(Year(A$5),Month(A$5),1)

Instead it matches:

ARGUMENTS2 = Date(Year(A$5),Month(A$5),1)-1)

I am using .net RegEx which provides for external memory.

SFun28
  • 32,209
  • 43
  • 123
  • 233
  • 3
    This cannot be done with a regex as regex only work with [regular languages](http://en.wikipedia.org/wiki/Regular_language). You need to use a parser. Also, is this homework? – JoshD Oct 27 '10 at 00:16
  • I think this approach makes little sense. I would rather write a parser for this. – steinar Oct 27 '10 at 00:16
  • @JoshD - why isn't this a regular language? Which principle does it violate? I cannot tell. No, this isn't homework (can I get my -1 reversed? =). I'm a professional software engineer and the solution is needed to workaround a bug in Excel. I've spent many hours trying to get this to work before posting. – SFun28 Oct 27 '10 at 01:09
  • @steinar - I could always to that approach, but my regex is pretty darn close to working. I just need sometime to look at it and tell me where its failing. – SFun28 Oct 27 '10 at 01:10
  • I wouldn't be too sure that it's close to working. As JoshD mentions, this is just not the the tool for this set of problems. You would be much better off writing a rather simple parser, even by hand (as opposed to using YACC or something similar), the parentheses make that pretty simple. – steinar Oct 27 '10 at 01:13
  • @SFun28: I didn't do the -1. This language in a very simple way is a^nb^n where a is ( and b is ). Using the [pumping lemma](http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages) it is simple to prove that this doesn't conform to a regular language. (Although, the RE you're using isn't regular either.... it can mimic a PDA... but it's still far from a parser.) – JoshD Oct 27 '10 at 01:21
  • @JoshD, @steinar - I see the confusion now...I think someone edited my post and deleted the part about my using .Net regex, which has memory, so I'm not sure if you two saw that. I don't think the average developer would find writing a simple parser easier than writing a RegEx. Speaking for myself, I'd have to dust off the cobwebs from my CS days at college. =) thanks for your input! – SFun28 Oct 27 '10 at 04:03
  • Yeah, I edited that out and replaced it with the `.net` tag. Some other regex flavors have recursive-matching support, but they all have wildly different syntax *and* semantics. Everything said in this thread applies to .NET alone. – Alan Moore Oct 27 '10 at 06:25

1 Answers1

4

This is well within the capabilities of .NET regexes. Here's a working demo:

using System;
using System.Text.RegularExpressions;

namespace Test
{
  class Test
  {
    public static void Main()
    {
      Regex r = new Regex(@"
        (?<name>[a-z][a-z0-9]*\()
          (?<body>
            (?>
               \((?<DEPTH>)
             |
               \)(?<-DEPTH>)
             |
               [^()]+
            )*
            (?(DEPTH)(?!))
          )
        \)", RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

      string formula = @"=Date(Year(A$5),Month(A$5),1)-(Weekday(Date(Year((A$5+1)),Month(A$5),1))-1)+{0;1;2;3;4;5}*7+{1,2,3,4,5,6,7}-1";

      foreach (Match m in r.Matches(formula))
      {
        Console.WriteLine("{0}\n", m.Value);
      }
    }
  }
}

output:

Date(Year(A$5),Month(A$5),1)

Weekday(Date(Year((A$5+1)),Month(A$5),1))

The main problem with your regex was that you were including the function name as part of the recursive match--for example:

Name1(...Name2(...)...)

Any open-paren that wasn't preceded by name was not counted, because it was matched by the final alternative, |.?), and that threw off the balance with the close-parens. That also meant that you couldn't match formulas like =MyFunc((1+1)), which you mentioned in the text but didn't include in the example. (I threw in an extra set of parens to demonstrate.)

EDIT: Here's the version with support for non-significant, quoted parens:

  Regex r = new Regex(@"
    (?<name>[a-z][a-z0-9]*\()
      (?<body>
        (?>
           \((?<DEPTH>)
         |
           \)(?<-DEPTH>)
         |
           ""[^""]+""
         |
           [^()""]+
        )*
        (?(DEPTH)(?!))
      )
    \)", RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
Alan Moore
  • 68,531
  • 11
  • 88
  • 149
  • Ah! Thanks! I would not have figured that out myself. This is the right solution. Thank you for believing me that this could be done with Regex =) Ok...I'll throw this out there purely as extra-credit/absolutely not needed/still grateful that you unblocked me: is there a way to ignore parens within quotes? – SFun28 Oct 27 '10 at 03:51
  • last comment submitted before complete. So the idea is that =Func("Hel(o") parses fine. I'm throwing this out there because I have a feeling that someone will be asking this very question in the future. For my purposes, I can guarantee that no strings will have unbalanced parens. – SFun28 Oct 27 '10 at 03:54
  • @SFun28: All you need to do is add another alternative to match complete, quoted sections, and then exclude quotes as well as parens in the final, *everything else* alternative: `(?>\\((?)|\\)(?)|""[^""]+""|[^()""]+)*` – Alan Moore Oct 27 '10 at 06:04
  • As such? @"(?[a-z][a-z0-9]*\()(?(?>\((?)|\)(?)|""[^""]+""|[^()""]+)*(?(DEPTH)(?!)))\)" – SFun28 Oct 27 '10 at 14:08
  • That seems to break some tests (man, i can't press enter for newline in this textbox) – SFun28 Oct 27 '10 at 14:08
  • A gem of an answer, brilliant – smirkingman Aug 20 '13 at 10:23