47

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes.

Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2.

This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical.

Do you have any ideas how to do this effectively?

pnuts
  • 54,806
  • 9
  • 74
  • 122
Thomas Ahle
  • 28,005
  • 19
  • 77
  • 105

6 Answers6

24

I found the TRE library, which seems to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/ It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.

I tried the accompanying agrep tool with the regexp on the following file:

TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March

and got

$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)$' filename
1:TV Schedule for 10Jan
8:TVSchedule for Jan 10
7:T Schedule for 10 Jan 2010
3:TV Schedule for 10 March
15:Tv plan for March

Thanks a lot for all your suggestions.

Barm
  • 354
  • 4
  • 11
Thomas Ahle
  • 28,005
  • 19
  • 77
  • 105
  • In Python, @mihail-stanculescu suggestion to use the `regex` package is simpler and it seems full-featured. – Herbert May 19 '21 at 14:55
7

Also see: The Python regex (newer version, Oct '14) (search for "fuzzy" inside the document).

If you're not a Python guy (neither I am), you could compile your code to C (exe/dll). Then you would be able to use your dll even from good old vb6 (and the like).

Other libraries to choose from:

  • TRE/agrep ('classic, good, old and fast) (search for 'agrep performace'), but you need to write POSIX compatible regex (search for 'regular expressions info posix') Of course, all libraries/examples using TRE have this limitation (search for 'hackerboss approximate regex matching in python'). For massive data: search for 'A fast CUDA implementation of agrep algorithm'.
  • FREJ (Java) - some (more) limitations (for instance, no look ahead/look behind)
  • fuzzy-wuzzy (Python-based) - worth looking at, not tested...

Search also for:

  • 'Comparison_of_regular_expression_engines'
  • 'regular-expressions.info tools'

(sorry for not be able to post real links)

4

I just use the regex module: 'Alternative regular expression module, to replace re.' It provides the familiarity of re but includes options for fuzzy matching, along with several other improvements on re.

For Windows binaries, see this resource.

David C
  • 6,461
  • 4
  • 46
  • 64
3

Here is a resource on the question you are asking. It is a bit of a teaser for a company. More useful might be this paper. I've seen an implementation inspired by the paper that could do a fuzzy search, biased for special language (e.g. Arabic vs. English), on a large dataset.

In general, you won't be able to do what you asked about. You can make a regexp search fuzzy by replacing characters with equivalence classes, or you can search a database for near-matches defined by Levenshtein distance. Trying to expand the (n)DFA behind a regexp to include near-matches by distance would rapidly become impossibly complex.

bmargulies
  • 91,317
  • 38
  • 166
  • 290
  • The first one seams to be on standard approximate string matching? The second one seams to be on fuzzy lookups in a dictionary. That could probably be used thinking of the regex as a 'fictionary dictionary'? – Thomas Ahle Feb 28 '10 at 21:55
1

Have you considered using a lexer?

I've never actually used one so i can't be much help, but it sounds like it fits!

Paul Creasey
  • 26,917
  • 10
  • 51
  • 88
  • 1
    I think lexers are more for tokenizing than matching. If I start splitting my string, I won't be able to recognize characters moved from one token to another. – Thomas Ahle Feb 28 '10 at 17:31
  • You may have to define your problem as a lexing/parsing problem, rather than as a simple regular expression. Then you could use Levenshtein distance on the individual tokens. – Avi Feb 28 '10 at 17:36
  • I see. But the lexer link you've sent seams quite deterministic. What if instead of `TV Schedule for 10 Jan` I get `TV Schedule for Jan 10`? That should have a distance of 2, since two characters have been transposed. Maybe the lexer could indentify substrings looking like numbers or months, but then `TV Schedule forJan 10` or `TV Schedule for 10 Jan 2010` would give problems.. – Thomas Ahle Mar 04 '10 at 10:15
0

I started to implement a Java tool called prex for approximate regular expression matching. The tool determines how far a string s is from matching a regular expression r, i.e. how many insertions, deletions and substitutions on s are at least required (minimum cost) such that the resulting string s' is acceptable by r. If your are interested, you can check out the code from https://github.com/julianthome/prex. I would be very happy to get some feedback. Note, that the approach is still a bit slow, but I am currently incorporating some heuristics for improving its performance.

Julian
  • 1,384
  • 18
  • 24