13

With reference to below question - String.replaceAll single backslashes with double backslashes

I wrote a test program, and I found that the result is true in both cases, whether I escape the backslash or not. This may be because - \t is a recognized Java String escape sequence. (Try \s and it would complain). - \t is taken as literal tab in the regex. I am somewhat unsure of the reasons.

Is there any general guideline about escaping regex in Java. I think using two backslashes is the correct approach.

I would still like to know your opinions.

public class TestDeleteMe {

  public static void main(String args[]) {
    System.out.println(System.currentTimeMillis());

    String str1 = "a    b"; //tab between a and b 

    //pattern - a and b with any number of spaces or tabs between 
    System.out.println("matches = " + str1.matches("^a[ \\t]*b$")); 
    System.out.println("matches = " + str1.matches("^a[ \t]*b$")); 
  }
}
Community
  • 1
  • 1
RuntimeException
  • 1,515
  • 2
  • 22
  • 29

4 Answers4

9

There are two interpretations of escape sequences going on: first by the Java compiler, and then by the regexp engine. When Java compiler sees two slashes, it replaces them with a single slash. When there is t following a slash, Java replaces it with a tab; when there is a t following a double-slash, Java leaves it alone. However, because two slashes have been replaced by a single slash, regexp engine sees \t, and interprets it as a tab.

I think that it is cleaner to let the regexp interpret \t as a tab (i.e. write "\\t" in Java) because it lets you see the expression in its intended form during debugging, logging, etc. If you convert Pattern with \t to string, you will see a tab character in the middle of your regular expression, and may confuse it for other whitespace. Patterns with \\t do not have this problem: they will show you a \t with a single slash, telling you exactly the kind of whitespace that they match.

Sergey Kalinichenko
  • 675,664
  • 71
  • 998
  • 1,399
  • 1
    Thanks. Now I understand that regex engine understands both `[ \t]` (\t after space) and `[ ]` (tab after space) and processes them the same. Do you think I am right in saying this? `[ \t]` looks more understandable though. So I must use `[ \\t]` in Java. – RuntimeException Feb 02 '12 at 14:05
  • @SatishMotwani "must" is too strong a word, but letting `\\t` flow to the regexp is a good practice. – Sergey Kalinichenko Feb 02 '12 at 14:07
8

Yes, there is a general guideline about escaping: Escape sequences in your Java source get replaced by the Java compiler (or some preprocessor eventually). The compiler will complain about any escape sequences it does not know, e.g. \s. When you write a String literal for a RegEx pattern, the compiler will process this literal as usual and replace all escape sequences with the according character. Then, when the program is executed, the Pattern class compiles the input String, that is, it will evaluate escape sequences another time. The Pattern class knows \s as a character class and will therefore be able to compile a pattern containing this class. However, you need to escape \s from the Java compiler which does not know this escape sequence. To do so, you escape the backslash resulting in \\s.

In short, you always need to escape character classes for RegEx patterns twice. If you want to match a backslash, the correct pattern is \\\\ because the Java compiler will make it \\ which the Pattern compiler will recognize as the escaped backslash character.

Michael Schmeißer
  • 3,284
  • 1
  • 17
  • 28
  • Thanks. I understand. So you need to write your `String` in Java so that the Pattern engine gets what it expects. I think I will have to be very careful when writing regex in Java in future. – RuntimeException Feb 02 '12 at 13:57
6

The first form \\t will be expanded to a tab char by the pattern class.

The second form \t will be expanded to a tab char by Java before it builds a pattern.

In the end, you get a tab char either way.

tim_yates
  • 154,107
  • 23
  • 313
  • 320
  • 5
    This is correct, the *"I believe"* is not necessary. The `"\\t"` translates to `"\t"` in the Java string, which translates to a tab character in the regex engine. The `"\t"` translates to a tab character in the Java string, which remains unchanged in the regex. – Tomalak Feb 02 '12 at 13:53
1

With org.apache.commons.lang3.StringEscapeUtils.unescapeJava(...), you can escape most of the common spl.chars and also the unicode characters (converts unicode charset to readable regular character)

  • I was trying to escape `\*` and `\d` (to get asterisks and digits) with 2 or 4 backslashes with no luck. When I used 4 backslashes and `StringEscapeUtils.unescapeJava`, it worked! This saved my sanity; thank you – skia.heliou Oct 25 '18 at 13:09