3

Will the \b regex for word boundary work in cpp for all languages? Or is it just latin alphabet?

If not - how would one match a whole word such as "תפוח"?

Specifically I thought about something like this[^\s]תפוח[$\s] but not sure if ^ is interpeted as negation or start of string here...

I'm using the PCRE library.

Uri
  • 20,857
  • 6
  • 38
  • 61
  • 3
    Regex is not a C++ feature. So it depends on which library you are using for regex. Text encoding will be more important. – Almo Apr 17 '14 at 13:03
  • 1
    Inside a character class, `$` is a literal `$` and `^` a literal `^`, unless it's the first character in which case this becomes a negative character class: you'll need `(?:\s|^)` – Robin Apr 17 '14 at 13:04
  • Word segmentation is a huge problem in languages such as Chinese, Japanese, Sanskrt, and others, that is not solvable via regular expressions, and only semisolvable using other methods. – Amadan Apr 17 '14 at 13:05
  • @Almo What about `std::regex`? It's part of the standard library (but only since C++11, so your compiler may not support it yet). – James Kanze Apr 17 '14 at 13:37
  • @Robin your comment helped a lot. And I also found this cool site http://regexr.com/ which clarified the meaning of this pattern. – Uri Apr 23 '14 at 11:47
  • I'm glad I could help. My personal favorite is http://regex101.com/, but whatever floats your regex boat :) FYI if you need to push deeper, there are lots of resources in this StackOverflow FAQ: http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean – Robin Apr 23 '14 at 12:05

1 Answers1

0

You don't say what regex engine you are using. But anyway you might like to consider using boost regex, because it has a wrapper which can be used with the ICU library for handling unicode.

The documentation for this says you can:

Create regular expressions that support various Unicode data properties, including character classification.

This implies /b and /B should work with any encoding supported by ICU.

In the 'standards' section for Unicode compliance it says:

1.4 Simple Word Boundaries: Conforming: non-spacing marks are included in the set of word characters.

harmic
  • 22,855
  • 3
  • 52
  • 72