1

I want to count the number of words from a dictionary that appear in a string if it is surrounded by whitespace or it is at the start or end of the string.

I'm using this answer like this:

library(stringi)

testStr <- c("dutch dutch brown", "brown ", "AAdutch", "dutchAA", "AAbrown",
             "brownAA", "hello")
stri_count_regex(testStr, "(^|\\s+)dutch|brown(\\s+|$)")

Which returns 3 1 0 1 1 0 0, but I'm expecting 3 1 0 0 0 0 0. So the problem is that it also counts "dutchAA" and "AAbrown" which I don't want.

I'm a bit puzzled about this, as this regular expression works fine when I run it on RegExr.

Community
  • 1
  • 1
ulima2_
  • 970
  • 9
  • 16
  • maybe `stri_count_regex(testStr, "\\b(dutch|brown)\\b")` not sure of the difference, can you post the link to the regextr you used – rawr Mar 08 '17 at 17:00

1 Answers1

2

Try using the following regex :

(?:\b|\s+)(?:dutch|brown)(?:\s+|\b)

regex demo

r

library(stringi)

testStr <- c("dutch dutch brown", "brown ", "AAdutch", "dutchAA", "AAbrown",
             "brownAA", "hello")
stri_count_regex(testStr, "(?:\\b|\\s+)(?:dutch|brown)(?:\\s+|\\b)")  # 3 1 0 0 0 0 0
m87
  • 4,193
  • 3
  • 13
  • 31
  • Just a little additional context: `(?:pattern)` is a "[non-capturing group](http://stackoverflow.com/a/3513858/143319)", and `\\b` is a word boundary - it matches on the start or end of a word without actually matching any characters from the word. – Matt Parker Mar 08 '17 at 17:22
  • What's the motivation for the non-capturing groups here, anyway? – Matt Parker Mar 08 '17 at 17:23