1

This is a tricky question, and maybe in the end it has no solution (or not a reasonable one, at least). I'd like to have a Java specific example, but if it can be done, I think I could do it with any example.

My goal is to find a way of knowing whether an string being read from an input stream could still match a given regular expression pattern. Or, in other words, read the stream until we've got a string that definitely will not match such pattern, no matter how much characters you add to it.

A declaration for a minimalist simple method to achieve this could be something like:

boolean couldMatch(CharSequence charsSoFar, Pattern pattern);

Such a method would return true in case that charsSoFar could still match pattern if new characters are added, or false if it has no chance at all to match it even adding new characters.

To put a more concrete example, say we have a pattern for float numbers like "^([+-]?\\d*\\.?\\d*)$".

With such a pattern, couldMatch would return true for the following example charsSoFar parameter:

"+"  
"-"  
"123"  
".24"  
"-1.04" 

And so on and so forth, because you can continue adding digits to all of these, plus one dot also in the three first ones.

On the other hand, all these examples derived from the previous one should return false:

"+A"  
"-B"  
"123z"  
".24."  
"-1.04+" 

It's clear at first sight that these will never comply with the aforementioned pattern, no matter how many characters you add to it.

EDIT:

I add my current non-regex approach right now, so to make things more clear.

First, I declare the following functional interface:

public interface Matcher {
    /**
     * It will return the matching part of "source" if any.
     *
     * @param source
     * @return
     */
    CharSequence match(CharSequence source);
}

Then, the previous function would be redefined as:

boolean couldMatch(CharSequence charsSoFar, Matcher matcher);

And a (drafted) matcher for floats could look like (note this does not support the + sign at the start, just the -):

public class FloatMatcher implements Matcher {
    @Override
    public CharSequence match(CharSequence source) {
        StringBuilder rtn = new StringBuilder();

        if (source.length() == 0)
            return "";

        if ("0123456789-.".indexOf(source.charAt(0)) != -1 ) {
            rtn.append(source.charAt(0));
        }

        boolean gotDot = false;
        for (int i = 1; i < source.length(); i++) {
            if (gotDot) {
                if ("0123456789".indexOf(source.charAt(i)) != -1) {
                    rtn.append(source.charAt(i));
                } else
                    return rtn.toString();
            } else if (".0123456789".indexOf(source.charAt(i)) != -1) {
                rtn.append(source.charAt(i));
                if (source.charAt(i) == '.')
                    gotDot = true;
            } else {
                return rtn.toString();
            }
        }
        return rtn.toString();
    }
}

Inside the omitted body for the couldMatch method, it will just call matcher.match() iteratively with a new character added at the end of the source parameter and return true while the returned CharSequence is equal to the source parameter, and false as soon as it's different (meaning that the last char added broke the match).

Fran Marzoa
  • 3,825
  • 1
  • 31
  • 46
  • That's what I think, but it's always sound to give Stackoverflow a chance. In such case, I would probably not use regular expressions at all, and instead create a parser interface so to write matchers programmatically. – Fran Marzoa Oct 30 '18 at 11:01
  • Indeed. :-) .... – T.J. Crowder Oct 30 '18 at 11:01
  • what does 'could be matched' and 'definitely unmatched' mean in your case? you can write regex with more relaxed patterns than you have. The questin is -- what is the rule? – Serge Oct 30 '18 at 11:05
  • Lookaround may make it a rather hard problem. If you exclude that it should just be basic regex matching which you stop early (but you'll probably need to write the code for that yourself, or rather derive that from some existing implementation). – Bernhard Barker Oct 30 '18 at 11:05
  • It seems that all your matching examples involve adding characters only to the end - is this a requirement or can you add characters anywhere in the string? Or, more concretely, can you add characters to string `B` to match regex `AB`? – Bernhard Barker Oct 30 '18 at 11:08
  • @Dukeling Only at the end. It would be used to read from an input stream sequentially. – Fran Marzoa Oct 30 '18 at 11:18
  • Regular expressions typically match "longest leftmost", so you can run into a serious backtracking problem. For example, given a regular expression `^[a-z]+b` and the string `"abcdefghijklmnopqrstuvwxyz"`, you have to scan the entire string because it can potentially match. Only when you get to the end of the string do you discover that there is no final `b`. But the substring `ab` matched. – Jim Mischel Oct 30 '18 at 17:39
  • @JimMischel well, for your case of `^[a-z]+b` and `"abcdefghijklmnopqrstuvwxyz"`, appending a `b` makes it a match. In contrast, `"abcdefghijklmn1234567890"` will never be a match. This *can* be detected. – Holger Oct 30 '18 at 18:14
  • 2
    @T.J.Crowder well, with a name as `hitEnd()`, it’s easy to overlook. It might be worth noting that `java.util.Scanner` already does the job of checking this flag and reading more content. It’s funny that the `Scanner` class is perceive by so many as a beginner’s tool for reading from the console, when it actually is an advanced tool for scanning for regex matches in a lazily loaded stream or reader. – Holger Oct 30 '18 at 18:23
  • @Holger Yes, it can be detected. My point is that adding a non-matching character causes a mismatch, but a previous, shorter, string *did* match. I don't know what the OP's use case is, but it sounds to me like he'll have to handle the backtracking case. – Jim Mischel Oct 30 '18 at 18:26
  • 1
    @Holger - Nice one!! Re `Scanner`: If it were a beginner's tool, it would be an abject failure of one. Far too difficult for beginners to understand (as I'm sure you know from the questions here on SO)... :-) – T.J. Crowder Oct 30 '18 at 18:28
  • @JimMischel well, yes, that’s the difference between `matches` and `find`. [this comment](https://stackoverflow.com/questions/53062616/how-to-know-if-a-string-could-match-a-regular-expression-by-adding-more-characte?noredirect=1#comment93024559_53062616) suggests, that the use case is a re-invention of `java.util.Scanner`. But to be fair, the fact that this is the right tool for the job is not easy to find. – Holger Oct 30 '18 at 18:34
  • 2
    @T.J.Crowder well, yes, it’s actually not a beginner’s tool. But I just threw `java.util.Scanner` at Google, to verify whether my impression is right. Lot’s of tutorials for beginners. Or [this Q&A](https://stackoverflow.com/q/11871520/2711488), lot’s of example, how to use it just for reading a line. Even it’s class documentation starts with a “reading from the console example”. And discusses lots of misleading stuff, like the format of localized numbers and hides a brief regex example somewhere in the middle. – Holger Oct 30 '18 at 18:42

2 Answers2

3

You can do it as easy as

boolean couldMatch(CharSequence charsSoFar, Pattern pattern) {
    Matcher m = pattern.matcher(charsSoFar);
    return m.matches() || m.hitEnd();
}

If the sequence does not match and the engine did not reach the end of the input, it implies that there is a contradicting character before the end, which won’t go away when adding more characters at the end.

Or, as the documentation says:

Returns true if the end of input was hit by the search engine in the last match operation performed by this matcher.

When this method returns true, then it is possible that more input would have changed the result of the last search.

This is also used by the Scanner class internally, to determine whether it should load more data from the source stream for a matching operation.

Using the method above with your sample data yields

Pattern fpNumber = Pattern.compile("[+-]?\\d*\\.?\\d*");
String[] positive = {"+", "-", "123", ".24", "-1.04" };
String[] negative = { "+A", "-B", "123z", ".24.", "-1.04+" };
for(String p: positive) {
    System.out.println("should accept more input: "+p
                      +", couldMatch: "+couldMatch(p, fpNumber));
}
for(String n: negative) {
    System.out.println("can never match at all: "+n
                      +", couldMatch: "+couldMatch(n, fpNumber));
}
should accept more input: +, couldMatch: true
should accept more input: -, couldMatch: true
should accept more input: 123, couldMatch: true
should accept more input: .24, couldMatch: true
should accept more input: -1.04, couldMatch: true
can never match at all: +A, couldMatch: false
can never match at all: -B, couldMatch: false
can never match at all: 123z, couldMatch: false
can never match at all: .24., couldMatch: false
can never match at all: -1.04+, couldMatch: false

Of course, this doesn’t say anything about the chances of turning a nonmatching content into a match. You could still construct patterns for which no additional character could ever match. However, for ordinary use cases like the floating point number format, it’s reasonable.

Community
  • 1
  • 1
Holger
  • 243,335
  • 30
  • 362
  • 661
  • 1
    Genius! Thanks a million, man. Not only you've answered my question, but provided me a hint that I should probably be using the Scanner class in my stuff. – Fran Marzoa Oct 31 '18 at 10:40
2

I have no specific solution, but you might be able to do this with negations.

If you setup regex patterns in a blacklist that definitely do not match with your pattern (e.g. + followed by char) you could check against these. If a blacklisted regex returns true, you can abort.

Another idea is to use negative lookaheads (https://www.regular-expressions.info/lookaround.html)

dinosaur
  • 66
  • 7