Definition of word boundary \b
Defining word boundary in word is imprecise. Let me define the word boundary with look-ahead, look-behind, and short-hand word character class \w
.
A word boundary \b
is equivalent to:
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
Which means:
Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).
OR
- Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).
(Note how similar this is to the expansion of XOR into conjunction and disjunction)
A non-word boundary \B
is equivalent to:
(?:(?<!\w)(?!\w)|(?<=\w)(?=\w))
Which means:
(Note how similar this is to the expansion of XNOR into conjunction and disjunction).
Definition of word character \w
Since the definition of \b
and \B
depends on definition of \w
1, you need to consult the specific documentation to know exactly what \w
matches.
1 Most of the regex flavors define \b
based on \w
. Well, except for Java [Point 9], where in default mode, \w
is ASCII-only and \b
is partially Unicode-aware.
In JavaScript, it would be [A-Za-z0-9_]
in default mode.
In .NET, \w
by default would match [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\P{Lm}\p{Nd}\p{Pc}]
, and it will have the same behaviour as JavaScript if ECMAScript option is specified. In the list of characters in Pc category, you only have to know that space (ASCII 32) is not included.
Answer to the question
With the definition above, answering the question becomes easy:
"hi hello# world#"
In hello#
, after #
is space (U+0020, in Zs category), which is not a word character, and #
is not a word character itself (in Unicode, it is in Po category). Therefore, \B
can match here. The branch (?<!\w)(?!\w)
is used in this case.
In world#
, after #
is end of string. Since #
is not a word character, and we cannot find any word character ahead (there is nothing there), \B
can match the empty string just after #
. The branch (?<!\w)(?!\w)
is also used in this case.
Addendum
Alan Moore gives quite a good summary in the comment:
I think the key point to remember is that regexes can't read. That is, they don't deal in words, only in characters. When we say \b
matches the beginning or end of a word, we don't mean it identifies a word and then seeks out its endpoints, like a human would. All it can see is the character before the current position and the character after the current position. Thus, \b
only indicates that the current position could be a word boundary. It's up to you to make sure the characters on either side what they should be.