1

Wanted to match a word that ends with # like

hi hello# world#

I tried to use boundary

\b\w+#\b

and it doesn't match.I thought \b is a non word boundary but it doesn't seem so from this case


Surprisingly

\b\w+#\B

matches!

So why does \B works here and not \b!Also why doesn't \b work in this case!


NOTE: Yes we can use \b\w+#(?=\s|$) but I want to know why \B works in this case!

Anirudha
  • 30,881
  • 7
  • 64
  • 81

3 Answers3

6

Definition of word boundary \b

Defining word boundary in word is imprecise. Let me define the word boundary with look-ahead, look-behind, and short-hand word character class \w.

A word boundary \b is equivalent to:

(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

Which means:

  • Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).

    OR

  • Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).

(Note how similar this is to the expansion of XOR into conjunction and disjunction)

A non-word boundary \B is equivalent to:

(?:(?<!\w)(?!\w)|(?<=\w)(?=\w))

Which means:

  • Right ahead and right behind, we cannot find any word character. Note that empty string is consider a non-word boundary under this definition.

    OR

  • Right ahead and right behind, both sides are word characters. Note that this branch requires 2 characters, i.e. cannot occur at the beginning or the end of a non-empty string.

(Note how similar this is to the expansion of XNOR into conjunction and disjunction).

Definition of word character \w

Since the definition of \b and \B depends on definition of \w1, you need to consult the specific documentation to know exactly what \w matches.

1 Most of the regex flavors define \b based on \w. Well, except for Java [Point 9], where in default mode, \w is ASCII-only and \b is partially Unicode-aware.

  • In JavaScript, it would be [A-Za-z0-9_] in default mode.

  • In .NET, \w by default would match [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\P{Lm}\p{Nd}\p{Pc}], and it will have the same behaviour as JavaScript if ECMAScript option is specified. In the list of characters in Pc category, you only have to know that space (ASCII 32) is not included.

Answer to the question

With the definition above, answering the question becomes easy:

"hi hello# world#"

In hello#, after # is space (U+0020, in Zs category), which is not a word character, and # is not a word character itself (in Unicode, it is in Po category). Therefore, \B can match here. The branch (?<!\w)(?!\w) is used in this case.

In world#, after # is end of string. Since # is not a word character, and we cannot find any word character ahead (there is nothing there), \B can match the empty string just after #. The branch (?<!\w)(?!\w) is also used in this case.

Addendum

Alan Moore gives quite a good summary in the comment:

I think the key point to remember is that regexes can't read. That is, they don't deal in words, only in characters. When we say \b matches the beginning or end of a word, we don't mean it identifies a word and then seeks out its endpoints, like a human would. All it can see is the character before the current position and the character after the current position. Thus, \b only indicates that the current position could be a word boundary. It's up to you to make sure the characters on either side what they should be.

Community
  • 1
  • 1
nhahtdh
  • 52,949
  • 15
  • 113
  • 149
  • I think the key point to remember is that regexes can't read. That is, they don't deal in words, only in characters. When we say `\b` matches the beginning or end of a word, we don't mean it identifies a word and then seeks out its endpoints, like a human would. All it can see is the character *before* the current position and the character *after* the current position. Thus, `\b` only indicates that the current position *could* be a word boundary. It's up to you to make sure the characters on either side what they should be. – Alan Moore May 18 '13 at 15:30
5

The pound # symbol is not considered a "word boundary".

\b\w+#\b doesn't work because w+# is not considered a word, therefore it will not match world#.
\b\w+6\b on the other hand is, therefore it will match world6.

"Word Characters" are defined by: [A-Za-z0-9_].

Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".

http://www.regular-expressions.info/wordboundaries.html

Ayman Safadi
  • 11,348
  • 1
  • 25
  • 41
  • yes indeed..but `\b.+?\b` seems to match any word containing non word character too – Anirudha May 18 '13 at 10:56
  • So `\b` is not a character that the RegEx matches, it is an [anchor](http://www.regular-expressions.info/anchors.html). In other words your RegEx isn't matching word or non-word characters, it's matching `w+#` in your first example and `.+?` (anything) in your second. You're using the `\b` anchor to describe the "surroundings" of your match. – Ayman Safadi May 18 '13 at 11:13
  • 1
    `The pound # symbol is not considered a "word boundary".` Word boundary is not defined by a single character. It is defined by 2 characters. `"Word Characters" are defined by: [A-Za-z0-9_].` Depends on which language you are using. If we are talking about .NET, then it will include Unicode characters. – nhahtdh May 18 '13 at 12:30
1

The # and the space are both non-word characters, so the invisible boundary between them is not a word boundary. Therefore \b will not match it and \B will match it.

tom
  • 18,043
  • 6
  • 39
  • 36