-1

I am puzzled by this simple behavior

> str_detect('the U.S. have been', regex('\\bu\\.s\\.',ignore_case = TRUE))
[1] TRUE
> str_detect('the U.S. have been', regex('\\bu\\.s\\.\\b',ignore_case = TRUE))
[1] FALSE

Why does the matching fail in the second case? Isn't' there a word boundary at right before "have"?

Thanks!

ℕʘʘḆḽḘ
  • 15,284
  • 28
  • 88
  • 180

2 Answers2

2

We can use \\s

 str_detect('the U.S. have been', regex('\\bu\\.s\\.\\s',ignore_case = TRUE))
akrun
  • 674,427
  • 24
  • 381
  • 486
2

Try running the following to see the issue:

str_view_all('the U.S. have been', regex('\\b', ignore_case = TRUE))

\b matches word boundaries, which are the transition from a word character (alphabetic letters, marks and decimal numbers) to non-word characters. Here, the transition from S to . is a word boundary, since . is not a word character. The transition from . to is not. So your second pattern doesn't match (there is no . immediately followed by a word boundary)

Calum You
  • 12,622
  • 2
  • 17
  • 35
  • thanks for clarifying this. So I imagine the right regex here is `str_detect('the U.S. have been', regex('\\bu\\.s\\.[$\\s]',ignore_case = TRUE))` – ℕʘʘḆḽḘ May 18 '21 at 23:59