R regular expression: using \\b with 'Å' vs. 'A' characters

Question

\\b represents a word boundary. I don't understand why this operator has different effects depending on the character that follows. Example:

test1 <- 'aland islands'
test2 <- 'åland islands'

regex1 <- "[å|a]land islands"
regex2 <- "\\b[å|a]land islands"

grepl(regex1, test1, perl = TRUE)
[1] TRUE
grepl(regex2, test1, perl = TRUE)
[1] TRUE

grepl(regex1, test2, perl = TRUE)
[1] TRUE
grepl(regex2, test2, perl = TRUE)
[1] FALSE

This only seems to be an issue when perl = TRUE:

grepl(regex1, test2, perl = FALSE)
[1] TRUE
grepl(regex2, test2, perl = FALSE)
[1] TRUE

Unfortunately, in my application, I absolutely need to keep perl=TRUE.

I can't confirm: `grepl("\\b[å|a]land islands", "åland islands", perl = TRUE)` returns `TRUE` for me. — Maurits Evers, Jul 23 '18 at 01:26
As @wp78de points out in their answer, the nasty thing about this behavior is that it is inconsistent across platforms. When I saw you couldn't reproduce, I tried different things: The weird behavior arises in the Docker environment I usually use, but not when I run R straight from terminal on osx. This is annoying. — Vincent, Jul 23 '18 at 02:31

wp78de · Accepted Answer · 2019-01-08T21:51:08.760

This is a (known) glitch in R's regex subsystem and is related to the character encoding of the input and the system locale / built properties.

The R documentation on grep states (highlighting added):

The POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

Only gsub and grepexpr are mentioned here grepl seems to be affected as well.

Possible soutions

using R's default (TRE reference) regex engine: PERL=FALSE as already discovered by you.
stick with the PCRE (reference) regex using the *UCP flag (Unicode mode|Unicode Character Properties), which changes the matching behavior so that Unicode alphanumerics are not considered as word boundaries:

Code Sample:
```
options(encoding = "UTF-8")

test1 <- 'aland islands'
test2 <- 'åland islands'
regex1 <- "[å|a]land islands"
regex2 <- "(*UCP)\\b[å|a]land islands"    
grepl(regex1, test2, perl = TRUE)
#[1] TRUE
grepl(regex2, test2, perl = TRUE)
#[1] TRUE
grepl(regex1, test2, perl = TRUE)
#[1] TRUE
grepl(regex2, test2, perl = TRUE)
#[1] TRUE
grepl(regex1, test2, perl = FALSE)
#[1] TRUE
grepl(regex2, test2, perl = FALSE)
#[1] FALSE
```
Online Demo

Notes:
- The 6th test, using TRE with the (*UCP) flag, fails grepl(regex2, test2, perl = FALSE)
- The *UCP flag does not work if R is not installed with Unicode support for PCRE (may be the case in some environments, e.g. some minimal Cloud/Docker installations).

What's really annoying is that R's behavior is inconsistent across platforms:

Works as expected on current 64bit Windows (10)
May work on current Linux distros

Test your original code with these online R environments:

tutorialspoint or
Ideone

Only test case 4 is FALSE: gepl(regex2, test2, perl = TRUE)
(Running R 3.3/3.4 on Linux?)
JDoodle

Test case 4 and 6 are FALSE (Running R 3.3-3.5 on Linux?)

R regular expression: using \\b with 'Å' vs. 'A' characters

1 Answers1

Possible soutions