0

I'm trying to search for certain words in a sentence (using PHP). these words might be split up with spaces, for whatever reason. (for example 'alpha betical' instead of 'alphabetical'). I'm comparing each group of characters divided by spaces in that sentence to a certain regular expression separately, for reasons. therefore, I cannot match 'alpha betical' to 'alphabetical' because it would try to match 'alpha' and 'betical' separately. 'alpha' does match the regular expression ('alphabetical') partially, though; if 'betical' would be added, it would match.

I need something like Java's Matcher.hitEnd(). (Returns true if the end of input was hit by the search engine in the last match operation performed by this matcher. When this method returns true, then it is possible that more input would have changed the result of the last search.) This question asks the same thing, plus a little more, but has no appropriate answer. I found this question which was answered, but only gives a solution that works for Java (mentioned in the start of this paragraph), and not PHP.

basically, if I'm matching 'alpha' to '/alphabetical/', I want something to tell me that it at least matches a part of the regular expression. (I am aware that in this case, I could switch them around and match alphabetical with '/^alpha/', but as I use it, the regular expression '/alphabetical/' would be a little more complex and therefore not suitable for the switch.. imagine something like '/[Aa]lpha-?betical(ly)?|[Ll]exicographical(ly)?/')

I know that regular expressions don't work partially, there's only matches or no matches. Is there a way to get what I want or do I have to go about my problem in an entirely different way?

Community
  • 1
  • 1
  • If you're going for partial matches, would your expressions match almost anything? As with your `alphabetical` example, you would want it to partially match the letter `a`, right? I think a better approach would be something along the lines of striping all whitespace from the input, then search for the full string. – Mr. Llama Jun 18 '14 at 21:49
  • Ok. I get you. I updated my answer with another idea. Consider implementing a wildcard string function, like SQL's "LIKE 'alpha^'" that returns matches based on length of match. Easy enough to specify a minimum length/score to qualify as match. – codenheim Jun 18 '14 at 22:35

2 Answers2

3

A regex either matches, or it doesnt. It is a finite automata that completes or not. Now there are surely automata out there that can exit the graph at any node and return a "score", but they are non-standard.

You can add boolean logic by matching multiple regexes. Or by adding lookahead or lookbehind.

Why not just write your regex to make whitespace optional?

  /a\s*l\s*p\s*h\s*a\s*b\s*e\s*t\s*i\s*c\s*a\s*l/

matches all sorts of combinations:

  alpha betical
  al p habet i cal

If you are familiar with wildcard / prefix matching (such as SQL's LIKE function), it is pretty easy to implement. Would that suffice?

Consider a simple implementation of a string scan algorithm that doesn't use regex at all, but searches out and returns matches sorted by score, where score is the length of the match, and you can even specify a minimum score.

Example:

FindLike(haystack: s, needle: "alphabetical", minlen:5);

Should be straightforward to write a case-insensitive function to scan a string in an iterative fashion, using a search string as a prefix match, once you match the initial character, iterate both string indexes until one ends or mismatches, then return, or add the substring to results list, and continue.

That said, you might be interested in fuzzy logic or fuzzy matching or approximate matching.

http://laurikari.net/tre/about/

Fuzzy Regular Expressions

Community
  • 1
  • 1
codenheim
  • 19,092
  • 1
  • 51
  • 77
  • Just knowing whether these matches occur is not enough for me, I want to edit the part of the original sentence where the match takes place. I could try to do this with a PREG_OFFSET_CAPTURE flag, but that only gives me the starting position of the match, not the end. I also filter on more than just spaces, such as .,:;'"| and mask characters like @ to a, 3 to e etc. In the end, I cannot reliably tell how many characters the match consist of if I match the entire sentence in one go. – user3754023 Jun 18 '14 at 22:16
  • The editing is easy enough once you define your approach. Regex libraries all have capture groups and replace options. Im just not sure which approach does what you want. Also, don't restrict your thinking to regex. – codenheim Jun 18 '14 at 22:41
  • If I understand correctly, this idea in its current form would not be sufficient. I really can't afford to mismatch or overlook some matches. it does offer a whole new way of solving my problem though. this might just be what I'm looking for. – user3754023 Jun 18 '14 at 23:13
  • Sure, then ignore the fuzzy matching option and consider the first two options. – codenheim Jun 19 '14 at 01:18
3

Your question is vast, and this answer focuses on this part:

If I'm matching 'alpha' to '/alphabetical/', I want something to tell me that it at least matches a part of the regular expression.

Two Options

There are several ways to do this. Whichever way you choose, you will need to build the patterns programmatically.

A General Option

Here is a general way that I like to use because it is straighforward. It is a series of optional lookaheads that look further and further down the string. Inside each lookahead is a capturing group.

^(?=(a))?(?=(al))?(?=(alp))?(?=(alph))?(?=(alpha))?(?=(alphab))?(?=(alphabe))?(?=(alphabet))?(?=(alphabeti))?(?=(alphabetic))?(?=(alphabetica))?(?=(alphabetical))?(?=(alphabetical$))?

The highest capture group that is set tells you how far we matched. For instance, for alpha, (?=(alpha)) would succeed, and Group 5 would be set (as well as groups 1, 2, 3, 4, 5).

This works in PCRE. In some engines you would need to wrap the lookarounds like so: (?:(?=(a)))? And in some engines it wouldn't work at all.

An Option for Mutually-Exclusive Tokens

Here is another way suggested by @CasimirEtHippolyte elsewhere, and that is beautifully compact. It works when tokens cannot "eat up" text that would have been matched by following tokens, which is the case here.

^(a(l(p(h(a(b(e(t(i(c(a(l($)?)?)?)?)?)?)?)?)?)?)?)?)?

You inspect which capture groups were set. The largest capture group that was set tells you how many letters were matched.

zx81
  • 38,175
  • 8
  • 76
  • 97