-2

For input String id, I want to do 4 steps like below :

  1. Remove all not lowercase alphabet, number, "-", "_", "."
  2. If "." is multiple and continuous, replace it to single "." (ex: he......llo -> he.llo)
  3. If String start with ".", remove it.
  4. If String ends with ".", remove it.

And Here is 4 Line of my code :

id = id.replaceAll("[^" + "a-z" + "0-9" + "-" + "_" + "." + "]", "");
id = id.replaceAll(".{2,}",".");
id = id.replaceAll("^.","");
id = id.replaceAll(".$","");

I found the return of rule 2 will be "." (ex : he...llo -> .) and rule 3,4 will remove string which is not "."

So I fix the code like :

id = id.replaceAll("[^" + "a-z" + "0-9" + "-" + "_" + "." + "]", "");
id = id.replaceAll("\\.{2,}",".");
id = id.replaceAll("\\^.","");
id = id.replaceAll("\\.$","");

And it works fine. I just don't understand. Is that regular expression need to add "\" twice before it uses? If it is right, why rule 1 work just fine? Who can get me right answer specifically? at last, I wonder can I code rule 3 and rule 4 at once? like using && to ?

Kioni
  • 17
  • 5
  • `.` has special meaning in the context of regex. Thus if we want to match the literal dot, we have to escape it. A single backslash is not sufficient since this is the escape character for java-strings, and we need a literal backslash in the regex-`String`, which we get by using `\\`. Same goes for `^` and `$`. --- A remark: I imagine `id = id.replaceAll("\\^.","");` should be `id = id.replaceAll("^\\.","");` – Turing85 Sep 12 '20 at 08:48

1 Answers1

1
  • . in a regular expression means "match any single character"
  • \. in a regular expression means "match a single dot/period/full-stop character". A different way to write this would be [.], which has the same end result, but is semantically different (I'm not sure if this has a negative impact on the generated code to match the expression)
  • [abc.] in a regular expression means "match a single character that must be 'a' or 'b' or 'c' or '.'" ([^…] inverts the meaning: match any character that is not). Attention: - has special meaning in a character class, so make sure you always put it first or last if you want to match the hyphen character specfically.

As for why the backslash has to be duplicated: Java itself uses the backslash to escape characters in a string. To get a literal backslash as part of the string, you have to escape the backslash itself: "\\" is a string containing a single backslash character ("\" is a syntax error in Java, because the backslash escapes the following quotation mark, i.e. the string is never terminated).

To reduce your logic down to two replaceAll calls, I would suggest to change the order of your calls and then join your expressions as alternatives with the | operator:

id = id.replaceAll(".+", ".") // fold all dots
        .replaceAll("[^a-z0-9_.-]|^\\.|\\.$", "");
knittl
  • 197,664
  • 43
  • 269
  • 318