How to build a regex for a whole word match for all languages in cpp?

Question

Will the \b regex for word boundary work in cpp for all languages? Or is it just latin alphabet?

If not - how would one match a whole word such as "תפוח"?

Specifically I thought about something like this[^\s]תפוח[$\s] but not sure if ^ is interpeted as negation or start of string here...

I'm using the PCRE library.

Regex is not a C++ feature. So it depends on which library you are using for regex. Text encoding will be more important. — Almo, Apr 17 '14 at 13:03
Inside a character class, `$` is a literal `$` and `^` a literal `^`, unless it's the first character in which case this becomes a negative character class: you'll need `(?:\s|^)` — Robin, Apr 17 '14 at 13:04
Word segmentation is a huge problem in languages such as Chinese, Japanese, Sanskrt, and others, that is not solvable via regular expressions, and only semisolvable using other methods. — Amadan, Apr 17 '14 at 13:05
@Almo What about `std::regex`? It's part of the standard library (but only since C++11, so your compiler may not support it yet). — James Kanze, Apr 17 '14 at 13:37
@Robin your comment helped a lot. And I also found this cool site http://regexr.com/ which clarified the meaning of this pattern. — Uri, Apr 23 '14 at 11:47
I'm glad I could help. My personal favorite is http://regex101.com/, but whatever floats your regex boat :) FYI if you need to push deeper, there are lots of resources in this StackOverflow FAQ: http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean — Robin, Apr 23 '14 at 12:05

score 0 · Answer 1 · answered Apr 17 '14 at 13:17

You don't say what regex engine you are using. But anyway you might like to consider using boost regex, because it has a wrapper which can be used with the ICU library for handling unicode.

The documentation for this says you can:

Create regular expressions that support various Unicode data properties, including character classification.

This implies /b and /B should work with any encoding supported by ICU.

In the 'standards' section for Unicode compliance it says:

1.4 Simple Word Boundaries: Conforming: non-spacing marks are included in the set of word characters.

How to build a regex for a whole word match for all languages in cpp?

1 Answers1