Can anyone explain why
text.replaceAll("\\W|\\d|\\s+", " ");
and
text.replaceAll("\\W|\\d", " ").replaceAll("\\s+", " ");
are different? In the first example the text doesn't remove more than 1 spaces and in the second example - it removes.
The String.replaceAll
method parses the string only once, and \W
contains already \s
. That is why the branch \s+
is never tested in your first code (the first branch on the left wins).
In the second code, the whole string is parsed an other time with \s+
.
Because in the first example \W takes each space (thus \s+ does not) and replaces it with a space. That still happens in the second example, but \s+ now acts separately after \W|\d and folds many-spaces into a single space char.
try text.replaceAll("[\\W\\d\\s]+"," ")
Your first example: \W|\d|\s+
matches:
\W
)\d
)\s+
)It's a lazy OR, so each ' ' matches the \W
, and gets replaced by a .
Perhaps you want (\W|\d|\s)+
, in which the whole group is repeated. However here the \s
is redundant, since it's included in \W
.
For single characters, it's usually simpler to use a character class rather than |
:
[\W\d]+.
REGEXP:
\w <= [^a-zA-Z0-9_] and whitespace
\d <= numbers
\s+ {
\s <= whitespace
+ <= 1 or more...
}
Example: (+)
\w+ <= [^a-zA-Z0-9_] and whitespace(1 or more)
\d+ <= numbers(1 or more)
Result: for "\w+"
hello123 => hello
Result: for "\d+"
hello123 => 123
Result: for "\w+\d+"
hello123 => hello123
Enjoy.
\W
means any non-word character ([^a-zA-Z0-9_]
), which includes white-space.
Therefore in your first pattern, the \s+
part is redundant: It matches any single white-space character and replaces it with " "
. The replaceAll
method in Java parses the string only once.