1

I have the following regular expression for capturing positive & negative time offsets.

\b(?<sign>[\-\+]?)(?<hours>2[1-3]|[01][0-9]|[1-9]):(?<minutes>[0-5]\d)\b

It matches fine but the leading sign doesn't appear in the capture group. Am I formatting it wrong? You can see the effect here https://regex101.com/r/CQxL8q/1/

paddyb
  • 77
  • 7

2 Answers2

1

That is because of the first \b. The \b word boundary does not match between a start of the string/newline and a - or + (i.e. a non-word char).

You need to move the word boundary after the optional sign group:

(?<sign>[-+]?)\b(?<hours>2[1-3]|[01][0-9]|[1-9]):(?<minutes>[0-5][0-9])\b
              ^^

See the regex demo.

Now, since the char following the word boundary is a digit (a word char) the word boundary will work correctly failing all matches where the digit is preceded with another word char.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
1

The word boundary anchor (\b) matches the transition between a word character (letter, digit or underscore) to a non-word character or vice-versa. There is no such transition in -13:21.

The word boundary anchor could stay between the sign and the hours to avoid matching it in expressions that looks similar to a time (65401:23) but you cannot prevent it match 654:01:23 or 654-01:23.

As a side note [\-\+] is just a convoluted way to write [-+]. + does not have any special meaning inside a character class, there is no need to escape it. - is a special character inside a character class but not when it is the first or the last character (i.e. [- or -]).

Another remark: you use both [0-9] and \d in your regex. They denote the same thing1 but, for readability, it's recommended to stick to only one convention. Since other character classes that contain only digits are used, I would use [0-9] and not \d.

And some bugs in the regex fragment for hours: 2[1-3]|[01][0-9]|[1-9] do not match 0 (but it matches 00) and 20.

Given all the above corrections and improvements, the regex should be:

(?<sign>[-+]?)\b(?<hours>2[0-3]|[01][0-9]|[0-9]):(?<minutes>[0-5][0-9])\b

1 \d is the same as [0-9] when the Unicode flag is not set. When Unicode is enabled, \d also matches the digits in non-Latin based alphabets.

axiac
  • 56,918
  • 8
  • 77
  • 110
  • Thanks. Some nice tips there – paddyb Sep 28 '17 at 10:38
  • 1
    @paddyb: Just FYI: [`\d` is not always matching the same chars as `[0-9]`](https://stackoverflow.com/a/16621778/3832970). It is also true if we use the pattern in Python 3, or with a unicode modifier in PHP, Java, Python 2. – Wiktor Stribiżew Sep 28 '17 at 10:46