how to use word boundaries with stringr?

Question

I am puzzled by this simple behavior

> str_detect('the U.S. have been', regex('\\bu\\.s\\.',ignore_case = TRUE))
[1] TRUE
> str_detect('the U.S. have been', regex('\\bu\\.s\\.\\b',ignore_case = TRUE))
[1] FALSE

Why does the matching fail in the second case? Isn't' there a word boundary at right before "have"?

Thanks!

That is because a word boundary meaning depends on the context. Here, `\b` after `\.` requires the next char to be a word char. It is common knowledge in regex. — Wiktor Stribiżew, May 19 '21 at 00:04
well not so common apparently :D. this is pretty subtle I think. thanks for downvote? — ℕʘʘḆḽḘ, May 19 '21 at 00:06
There are a lot of such regex questions. I just supplied two. — Wiktor Stribiżew, May 19 '21 at 00:07
@WiktorStribiżew would you agree that the solution is `regex("\\bu\\.s\\.[$\\s]")` then? — ℕʘʘḆḽḘ, May 19 '21 at 00:08
You need to match either a whitespace or end of string, and that is `regex("\\bu\\.s\\.(?:\\s|$)")`. `[$\s]` matches a `$` or whitespace. — Wiktor Stribiżew, May 19 '21 at 00:09
interesting. I am pretty sure I had seen examples where using $ within brackets would mean end-of-string. But here it does not work you are right. so `?:$` is lookahead, right? — ℕʘʘḆḽḘ, May 19 '21 at 00:13
A non-capturing group. Please [study when you have spare time](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). And `$` within brackets *NEVER* and nowhere means end-of-string. — Wiktor Stribiżew, May 19 '21 at 00:13
ah I see. I used brackets instead of regular parenthesis. So the solution can be `regex("\\bu\\.s\\.($|\\s)")`. See, all of this is very useful I think! thanks — ℕʘʘḆḽḘ, May 19 '21 at 00:16

score 2 · Accepted Answer · answered May 18 '21 at 23:50

2

We can use \\s

 str_detect('the U.S. have been', regex('\\bu\\.s\\.\\s',ignore_case = TRUE))

answered May 18 '21 at 23:50

akrun

674,427
24
381
486

yes but what about the string "the u.s." ? do you know what the issue is? – ℕʘʘḆḽḘ May 18 '21 at 23:51
it matches only where the one side is letter, digit or underscore. You can check [here](https://www.rexegg.com/regex-boundaries.html) – akrun May 18 '21 at 23:55

score 2 · Answer 2 · answered May 18 '21 at 23:57

2

Try running the following to see the issue:

str_view_all('the U.S. have been', regex('\\b', ignore_case = TRUE))

\b matches word boundaries, which are the transition from a word character (alphabetic letters, marks and decimal numbers) to non-word characters. Here, the transition from S to . is a word boundary, since . is not a word character. The transition from . to is not. So your second pattern doesn't match (there is no . immediately followed by a word boundary)

answered May 18 '21 at 23:57

Calum You

12,622
2
17
35

thanks for clarifying this. So I imagine the right regex here is `str_detect('the U.S. have been', regex('\\bu\\.s\\.[$\\s]',ignore_case = TRUE))` – ℕʘʘḆḽḘ May 18 '21 at 23:59

how to use word boundaries with stringr?

2 Answers2