2

This may sound a little odd, but it would be extremely useful to me. Are there any regex implementations (any language but preferably java, javascript, c, c++) that use an event based model for matches?

I would like to be able to register a bunch of different regular expressions I am looking for in a string via an event based model, feed the string though the regex engine, and just have the events fired off correctly. Does anything like this exist?

I realize this is bordering on the territory of a heavy duty lexer/parser, but I would prefer to stay away from that if at all possible, as my search expressions would need to be dynamic (completely).

Thanks

jdc0589
  • 6,712
  • 5
  • 32
  • 39
  • 1
    The usual approach would be to take an existing library and build a wrapper to throw events. Why would you write code to accomplish a very specific task when you could use separation of concerns and all your code is modular and reusable for many tasks? – Jay Nov 30 '10 at 19:23
  • Sounds like you're talking about a sort of ad-hoc lexer. Instead of letting the regex engine scan for matches, you cycle through all the regexes, trying each one as if it were anchored at the current position. If none of them match, you bump ahead one position and start again. Does that sound right? – Alan Moore Nov 30 '10 at 22:52

3 Answers3

3

This is very easy to do in Perl regular expressions. All you do is insert your event callouts at the appropriate point in the pattern in the most straightforward manner imaginable.

First, imagine a pattern for pulling out decimal numbers from string:

my $rx0 = /[+-]?(?:\d+(?:\.\d*)?|\.\d+)/;

Let’s expand that out so we can insert our callouts:

my $rx1 = qr{
    [+-] ?
    (?: \d+
        (?: \. \d* ) ?
      |
        \. \d+
    )
}x;

For callouts, I’ll just print some debugging, but you could do anything you want:

my $rx2 = qr{
    (?: [+-]                (?{ say "\tleading sign"                })
    ) ?
    (?: \d+                 (?{ say "\tinteger part"                })
        (?: \.              (?{ say "\tinternal decimal point"      })
            \d*             (?{ say "\toptional fractional part"    })
        ) ?
      |
        \.                  (?{ say "\tleading decimal point"       })
        \d+                 (?{ say "\trequired fractional part"    })
    )                       (?{ say "\tsuccess"                     })
}x;

Here’s the whole demo:

use 5.010;
use strict;

use utf8;

my $rx0 = qr/[+-]?(?:\d+(?:\.\d*)?|\.\d+)/;

my $rx1 = qr{
    [+-] ?
    (?: \d+
        (?: \. \d* ) ?
      |
        \. \d+
    )
}x;

my $rx2 = qr{
    (?: [+-]                (?{ say "\tleading sign"                })
    ) ?
    (?: \d+                 (?{ say "\tinteger part"                })
        (?: \.              (?{ say "\tinternal decimal point"      })
            \d*             (?{ say "\toptional fractional part"    })
        ) ?
      |
        \.                  (?{ say "\tleading decimal point"       })
        \d+                 (?{ say "\trequired fractional part"    })
    )                       (?{ say "\tsuccess"                     })
}x;

my $string = <<'END_OF_STRING';

    The Earth’s temperature varies between
    -89.2°C and 57.8°C, with a mean of 14°C.

    There are .25 quarts in 1 gallon.

    +10°F is -12.2°C.

END_OF_STRING

while ($string =~ /$rx2/gp) {
    printf "Number: ${^MATCH}\n";
}

which when run produces this:

        leading sign
        integer part
        internal decimal point
        optional fractional part
        success
Number: -89.2
        integer part
        internal decimal point
        optional fractional part
        success
Number: 57.8
        integer part
        success
Number: 14
        leading decimal point
        leading decimal point
        required fractional part
        success
Number: .25
        integer part
        success
Number: 1
        leading decimal point
        leading sign
        integer part
        success
Number: +10
        leading sign
        integer part
        internal decimal point
        optional fractional part
        success
Number: -12.2
        leading decimal point

You may want to arrange a more grammatical regular expression for maintainability. This also helps for when you want to make a recursive descent parser out of it. (Yes, of course you can do that: this is Perl, after all. :)

Look at the last solution in this answer for what I mean by grammatical regexes. I also have larger examples elsewhere here on SO.

But it sounds like you should look at the Regexp::Grammars module by Damian Conway, which was built for just this sort of thing. This question talks about it, and has a link to the module proper.

Community
  • 1
  • 1
tchrist
  • 74,913
  • 28
  • 118
  • 169
1

You might want to check out PIRE - a very fast automata-based regexp engine, tuned to match zillions of lines of text against many regular expressions quickly. It's available in C and has some bindings.

GreyCat
  • 15,483
  • 17
  • 70
  • 107
0

It's really not something that's too hard to put together yourself if you can't find any existing library.

Something like this:

public class RegexNotifier {
   private final Map<Pattern, List<RegexListener>> listeners = new HashMap<Pattern, List<RegexListener>>();

   public synchronized void register(Pattern pattern, RegexListener listener) {
      List<RegexListener> list = listeners.get(pattern);
      if (list == null) {
         list = new ArrayList<RegexListener>();
         listeners.put(pattern, list);
      }
      list.add(listener);
   }

   public void process(String input) {
      for (Entry<Pattern, List<RegexListener>> entry : listeners.entrySet()) {
         if (entry.getKey().matcher(input).matches()) {
            for (RegexListener listener : entry.getValue()) {
               listener.stringMatched(input, entry.getKey());
            }
         }
      }
   }
}

interface RegexListener {
   public void stringMatched(String matched, Pattern pattern);
}

The only shortcoming I see with this is that Pattern doesn't implement hashCode() and equals(), meaning it will be less than optimal if equal patterns using different instances are used. But that usually doesn't happen because the factory method Pattern.compile() is good about caching patterns.

Mark Peters
  • 76,122
  • 14
  • 153
  • 186
  • 1
    I wrote fairly advanced tool using this principal a while back. It is certainly usable, but I would prefer something that could work in a single pass. I may just end up modifying what I already have though. – jdc0589 Nov 30 '10 at 19:17
  • @jdc0589: Ah, I see. I thought you were just looking for a library to reverse the call structure. Yes, my way does nothing to optimize the performance. – Mark Peters Nov 30 '10 at 19:41