0

Can anyone explain why

text.replaceAll("\\W|\\d|\\s+", " ");

and

text.replaceAll("\\W|\\d", " ").replaceAll("\\s+", " ");

are different? In the first example the text doesn't remove more than 1 spaces and in the second example - it removes.

Helosze
  • 303
  • 1
  • 6
  • 18

5 Answers5

1

The String.replaceAll method parses the string only once, and \W contains already \s. That is why the branch \s+ is never tested in your first code (the first branch on the left wins).

In the second code, the whole string is parsed an other time with \s+.

Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • Thanks, but when \s+ is the first one, it still doesn't work. – Helosze Jan 11 '17 at 17:34
  • @Helosze: obviously since all `\W` and digits characters are not already replaced with spaces. To obtain the same result in one pass, use `[\\W\\d]+` – Casimir et Hippolyte Jan 11 '17 at 17:38
  • ok, but the situation is - text[space][space][space]text - that 3 spaces shouldn't be changed into one space by \s? – Helosze Jan 11 '17 at 17:41
  • @Helosze: No, as I explained in my answer, `\\W` will match each spaces (one by one) and since the `\\W` branch succeeds, the `\\s+` branch is never tested. *(the first on the left wins)* – Casimir et Hippolyte Jan 11 '17 at 17:43
  • If you put `\\s+` in the first branch, this branch will succeed and your three spaces are replaced with a single space. But there's always an important difference with your second code. In your second code, the part `text.replaceAll('\\W|\\d', ' ')` creates a new string with new spaces characters that are matched with the second part `.replaceAll('\\s+', ' ')` – Casimir et Hippolyte Jan 11 '17 at 17:49
  • And it doesn't work when \s if the first and I don't know why. :) But nevermind, [\W\d]+ is that what I was looking for. Thanks. – Helosze Jan 11 '17 at 17:52
  • @Helosze: I haven't the time now, but I will try to add a better and more detailed explanation later in my answer. – Casimir et Hippolyte Jan 11 '17 at 18:01
1

Because in the first example \W takes each space (thus \s+ does not) and replaces it with a space. That still happens in the second example, but \s+ now acts separately after \W|\d and folds many-spaces into a single space char.

try text.replaceAll("[\\W\\d\\s]+"," ")

1

Your first example: \W|\d|\s+ matches:

  • one non-word character (\W)
  • OR one digit character (\d)
  • OR one-or-more spaces (\s+)

It's a lazy OR, so each ' ' matches the \W, and gets replaced by a .

Perhaps you want (\W|\d|\s)+, in which the whole group is repeated. However here the \s is redundant, since it's included in \W.

For single characters, it's usually simpler to use a character class rather than |:

[\W\d]+.
slim
  • 36,139
  • 10
  • 83
  • 117
0

REGEXP:

\w <= [^a-zA-Z0-9_] and whitespace
\d <= numbers
\s+ {
\s <= whitespace
+ <= 1 or more...
}

Example: (+)

\w+ <= [^a-zA-Z0-9_] and whitespace(1 or more)
\d+ <= numbers(1 or more)

Result: for "\w+"

hello123 => hello

Result: for "\d+"

hello123 => 123

Result: for "\w+\d+"

hello123 => hello123

Enjoy.

-1

\W means any non-word character ([^a-zA-Z0-9_]), which includes white-space.

Therefore in your first pattern, the \s+ part is redundant: It matches any single white-space character and replaces it with " ". The replaceAll method in Java parses the string only once.

Tom Lord
  • 22,829
  • 4
  • 43
  • 67