12

I'm writing a simple debugging program that takes as input simple strings that can contain stars to indicate a wildcard match-any

*.wav  // matches <anything>.wav
(*, a) // matches (<anything>, a)

I thought I would simply take that pattern, escape any regular expression special characters in it, then replace any \\* back to .*. And then use a regular expression matcher.

But I can't find any Java function to escape a regular expression. The best match I could find is Pattern.quote, which however just puts \Q and \E at the begin and end of the string.

Is there anything in Java that allows you to simply do that wildcard matching without you having to implement the algorithm from scratch?

Johannes Schaub - litb
  • 466,055
  • 116
  • 851
  • 1,175

6 Answers6

16

Just escape everything - no harm will come of it.

    String input = "*.wav";
    String regex = ("\\Q" + input + "\\E").replace("*", "\\E.*\\Q");
    System.out.println(regex); // \Q\E.*\Q.wav\E
    System.out.println("abcd.wav".matches(regex)); // true

Or you can use character classes:

    String input = "*.wav";
    String regex = input.replaceAll(".", "[$0]").replace("[*]", ".*");
    System.out.println(regex); // .*[.][w][a][v]
    System.out.println("abcd.wav".matches(regex)); // true

It's easier to "escape" the characters by putting them in a character class, as almost all characters lose any special meaning when in a character class. Unless you're expecting weird file names, this will work.

Bohemian
  • 365,064
  • 84
  • 522
  • 658
  • Hmm, why didn't I think of that. It seems too easy. Thanks! – Johannes Schaub - litb Jun 21 '14 at 02:21
  • 1
    Hmm, unfortunately that doesn't seem to work. Java complains "Illegal/unsupported escape sequence near index 3 \f\o\o". Apparently it only allows to escape a limited set of characters: "It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language.". – Johannes Schaub - litb Jun 21 '14 at 02:30
  • Did you copy paste? This code runs without error. I can only assume that you coded `replaceAll()` instead of `replace()` for the second method call. Is that what happened? – Bohemian Jun 21 '14 at 03:53
  • I used your initial solution. The updated answer should work aswell. – Johannes Schaub - litb Jun 21 '14 at 11:34
  • 3
    Wouldn't this break if the text being searched contains "\E"? You could use Pattern.quote(String) to avoid that. – twm May 01 '16 at 16:02
  • @twm Yes, but that isn't relevant to the question, which specifically excludes calling `Pattern.quote()` and wants only to have simple regex chars like dots treated as literals and the asterisk treated as a "wildcard" (`.*` in regex). – Bohemian May 01 '16 at 20:39
  • Ah, sorry - I missed the part about excluding Pattern.quote(). – twm May 01 '16 at 21:45
  • probably you want to replace ? with \w, what is commonly used in wildcards – AvrDragon Sep 08 '16 at 14:59
14

Using A Simple Regex

One of this method's benefits is that we can easily add tokens besides * (see Adding Tokens at the bottom).

Search: [^*]+|(\*)

  • The left side of the | matches any chars that are not a star
  • The right side captures all stars to Group 1
  • If Group 1 is empty: replace with \Q + Match + E
  • If Group 1 is set: replace with .*

Here is some working code (see the output of the online demo).

Input: audio*2012*.wav

Output: \Qaudio\E.*\Q2012\E.*\Q.wav\E

String subject = "audio*2012*.wav";
Pattern regex = Pattern.compile("[^*]+|(\\*)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
    if(m.group(1) != null) m.appendReplacement(b, ".*");
    else m.appendReplacement(b, "\\\\Q" + m.group(0) + "\\\\E");
}
m.appendTail(b);
String replaced = b.toString();
System.out.println(replaced);

Adding Tokens

Suppose we also want to convert the wildcard ?, which stands for a single character, by a dot. We just add a capture group to the regex, and exclude it from the matchall on the left:

Search: [^*?]+|(\*)|(\?)

In the replace function we the add something like:

else if(m.group(2) != null) m.appendReplacement(b, "."); 
zx81
  • 38,175
  • 8
  • 76
  • 97
  • this looks best so far. waiting for someone to perhaps find a simplier solution still. Thanks! – Johannes Schaub - litb Jun 21 '14 at 02:42
  • What I like is that if you want to add the single-character `?` token used in wildcard matching, it's a piece of cake: `[^*?]+|(\*)|(\?)`, then in the replace function we add `if(m.group(2) != null) m.appendReplacement(b, ".");` (as the dot is a single character) – zx81 Jun 21 '14 at 02:44
  • Doesn't `wildcardSpec.replaceAll("[^*]+", "\\\\Q$0\\\\E").replaceAll("\\*+", ".*")` work aswell? – Johannes Schaub - litb Jun 21 '14 at 02:51
  • Sure, if you want to chain two replacements. Just giving you something that's easy to read and especially to maintain in case your requirements change. It's a variation of a technique explained in detail here: [Regex-matching or replacing... except when...](http://stackoverflow.com/q/23589174/) I had a lot of fun with that answer btw :) Ah, also, added a section about Adding Tokens at the bottom of the answer. – zx81 Jun 21 '14 at 02:55
  • Thanks Johannes. :) See you next time. – zx81 Jun 21 '14 at 03:08
  • Looking for something that mimics Lucene's semantics, which includes the ability to escape wildcard characters, so that `Wal*mart` would become `\QWal\E.*\Qmart\E` but `Wal\*mart` would become either `\QWal*mart\E` or `\QWal\E\*\Qmart\E` – Paul Jackson Nov 03 '16 at 15:33
13

There is small utility method in Apache Commons-IO library: org.apache.commons.io.FilenameUtils#wildcardMatch(), which you can use without intricacies of the regular expression.

API documentation could be found in: https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/FilenameUtils.html#wildcardMatch(java.lang.String,%20java.lang.String)

Marek Gregor
  • 3,004
  • 1
  • 23
  • 26
1

You can also use the Quotation escape characters: \\Q and \\E - everything between them is treated as literal and not considered to be part of the regex to be evaluated. Thus this code should work:

    String input = "*.wav";
    String regex = "\\Q" + input.replace("*", "\\E.*?\\Q") + "\\E";

    // regex = "\\Q\\E.*?\\Q.wav\\E"

Note that your * wildcard might also be best matched only against word characters using \w depending on how you want your wildcard to behave(?)

Matt Coubrough
  • 3,343
  • 2
  • 23
  • 37
1

Regex While Accommodating A DOS/Windows Path

Implementing the Quotation escape characters \Q and \E is probably the best approach. However, since a backslash is typically used as a DOS/Windows file separator, a "\E" sequence within the path could effect the pairing of \Q and \E. While accounting for the * and ? wildcard tokens, this situation of the backslash could be addressed in this manner:

Search: [^*?\\]+|(\*)|(\?)|(\\)

Two new lines would be added in the replace function of the "Using A Simple Regex" example to accommodate the new search pattern. The code would still be "Linux-friendly". As a method, it could be written like this:

public String wildcardToRegex(String wildcardStr) {
    Pattern regex=Pattern.compile("[^*?\\\\]+|(\\*)|(\\?)|(\\\\)");
    Matcher m=regex.matcher(wildcardStr);
    StringBuffer sb=new StringBuffer();
    while (m.find()) {
        if(m.group(1) != null) m.appendReplacement(sb, ".*");
        else if(m.group(2) != null) m.appendReplacement(sb, ".");     
        else if(m.group(3) != null) m.appendReplacement(sb, "\\\\\\\\");
        else m.appendReplacement(sb, "\\\\Q" + m.group(0) + "\\\\E");
    }
    m.appendTail(sb);
    return sb.toString();
}

Code to demonstrate the implementation of this method could be written like this:

String s = "C:\\Temp\\Extra\\audio??2012*.wav";
System.out.println("Input: "+s);
System.out.println("Output: "+wildcardToRegex(s));

This would be the generated results:

Input: C:\Temp\Extra\audio??2012*.wav
Output: \QC:\E\\\QTemp\E\\\QExtra\E\\\Qaudio\E..\Q2012\E.*\Q.wav\E
J. Hanney
  • 21
  • 4
0

Lucene has classes that provide this capability, with additional support for backslash as an escape character. ? matches a single character, 1 matches 0 or more characters, \ escapes the following character. Supports Unicode code points. Supposed to be fast but I haven't tested.

CharacterRunAutomaton characterRunAutomaton;
boolean matches;
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Walmart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // false
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal*mart")));
matches = characterRunAutomaton.run("Walmart"); // true
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // true
matches = characterRunAutomaton.run("Waldomart"); // true
characterRunAutomaton = new CharacterRunAutomaton(WildcardQuery.toAutomaton(new Term("", "Wal\\*mart")));
matches = characterRunAutomaton.run("Walmart"); // false
matches = characterRunAutomaton.run("Wal*mart"); // true
matches = characterRunAutomaton.run("Wal\\*mart"); // false
matches = characterRunAutomaton.run("Waldomart"); // false
Paul Jackson
  • 1,847
  • 17
  • 25