3

\\b represents a word boundary. I don't understand why this operator has different effects depending on the character that follows. Example:

test1 <- 'aland islands'
test2 <- 'åland islands'

regex1 <- "[å|a]land islands"
regex2 <- "\\b[å|a]land islands"

grepl(regex1, test1, perl = TRUE)
[1] TRUE
grepl(regex2, test1, perl = TRUE)
[1] TRUE

grepl(regex1, test2, perl = TRUE)
[1] TRUE
grepl(regex2, test2, perl = TRUE)
[1] FALSE

This only seems to be an issue when perl = TRUE:

grepl(regex1, test2, perl = FALSE)
[1] TRUE
grepl(regex2, test2, perl = FALSE)
[1] TRUE

Unfortunately, in my application, I absolutely need to keep perl=TRUE.

Vincent
  • 10,769
  • 6
  • 31
  • 32
  • I can't confirm: `grepl("\\b[å|a]land islands", "åland islands", perl = TRUE)` returns `TRUE` for me. – Maurits Evers Jul 23 '18 at 01:26
  • @MauritsEvers you are using Windows? – wp78de Jul 23 '18 at 01:32
  • @wp78de I'm on MacOS and Linux. – Maurits Evers Jul 23 '18 at 01:36
  • I cant produce the error. i am on windows – Onyambu Jul 23 '18 at 02:19
  • As @wp78de points out in their answer, the nasty thing about this behavior is that it is inconsistent across platforms. When I saw you couldn't reproduce, I tried different things: The weird behavior arises in the Docker environment I usually use, but not when I run R straight from terminal on osx. This is annoying. – Vincent Jul 23 '18 at 02:31

1 Answers1

6

This is a (known) glitch in R's regex subsystem and is related to the character encoding of the input and the system locale / built properties.

The R documentation on grep states (highlighting added):

The POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

Only gsub and grepexpr are mentioned here grepl seems to be affected as well.

Possible soutions

  • using R's default (TRE reference) regex engine: PERL=FALSE as already discovered by you.
  • stick with the PCRE (reference) regex using the *UCP flag (Unicode mode|Unicode Character Properties), which changes the matching behavior so that Unicode alphanumerics are not considered as word boundaries:

    Code Sample:

    options(encoding = "UTF-8")
    
    test1 <- 'aland islands'
    test2 <- 'åland islands'
    regex1 <- "[å|a]land islands"
    regex2 <- "(*UCP)\\b[å|a]land islands"    
    grepl(regex1, test2, perl = TRUE)
    #[1] TRUE
    grepl(regex2, test2, perl = TRUE)
    #[1] TRUE
    grepl(regex1, test2, perl = TRUE)
    #[1] TRUE
    grepl(regex2, test2, perl = TRUE)
    #[1] TRUE
    grepl(regex1, test2, perl = FALSE)
    #[1] TRUE
    grepl(regex2, test2, perl = FALSE)
    #[1] FALSE
    

    Online Demo

    Notes:

    • The 6th test, using TRE with the (*UCP) flag, fails grepl(regex2, test2, perl = FALSE)

    • The *UCP flag does not work if R is not installed with Unicode support for PCRE (may be the case in some environments, e.g. some minimal Cloud/Docker installations).


What's really annoying is that R's behavior is inconsistent across platforms:

  • Works as expected on current 64bit Windows (10)
  • May work on current Linux distros

Test your original code with these online R environments:

  • tutorialspoint or
  • Ideone

    Only test case 4 is FALSE: gepl(regex2, test2, perl = TRUE)
    (Running R 3.3/3.4 on Linux?)

  • JDoodle

    Test case 4 and 6 are FALSE (Running R 3.3-3.5 on Linux?)


Further readings:

wp78de
  • 16,078
  • 6
  • 34
  • 56
  • 1
    This is an **excellent** answer. Thanks so much for taking the time to write this; I really appreciate it. I will read all those links and be better informed next time I run into a similar issue. – Vincent Jul 23 '18 at 02:32