Troubles with regexp in R: Match word surrounded by whitespace or start/end of string

Question

I want to count the number of words from a dictionary that appear in a string if it is surrounded by whitespace or it is at the start or end of the string.

I'm using this answer like this:

library(stringi)

testStr <- c("dutch dutch brown", "brown ", "AAdutch", "dutchAA", "AAbrown",
             "brownAA", "hello")
stri_count_regex(testStr, "(^|\\s+)dutch|brown(\\s+|$)")

Which returns 3 1 0 1 1 0 0, but I'm expecting 3 1 0 0 0 0 0. So the problem is that it also counts "dutchAA" and "AAbrown" which I don't want.

I'm a bit puzzled about this, as this regular expression works fine when I run it on RegExr.

maybe `stri_count_regex(testStr, "\\b(dutch|brown)\\b")` not sure of the difference, can you post the link to the regextr you used — rawr, Mar 08 '17 at 17:00

score 2 · Accepted Answer · answered Mar 08 '17 at 17:05

2

Try using the following regex :

(?:\b|\s+)(?:dutch|brown)(?:\s+|\b)

regex demo

r

library(stringi)

testStr <- c("dutch dutch brown", "brown ", "AAdutch", "dutchAA", "AAbrown",
             "brownAA", "hello")
stri_count_regex(testStr, "(?:\\b|\\s+)(?:dutch|brown)(?:\\s+|\\b)")  # 3 1 0 0 0 0 0

answered Mar 08 '17 at 17:05

m87

4,193
3
13
31

Just a little additional context: `(?:pattern)` is a "[non-capturing group](http://stackoverflow.com/a/3513858/143319)", and `\\b` is a word boundary - it matches on the start or end of a word without actually matching any characters from the word. – Matt Parker Mar 08 '17 at 17:22
What's the motivation for the non-capturing groups here, anyway? – Matt Parker Mar 08 '17 at 17:23

Troubles with regexp in R: Match word surrounded by whitespace or start/end of string

1 Answers1