0

^(?:([0-9]+):)??(?:([0-9]+):)?([0-9]+)(?:[.,]([0-9]+))?[^0-9]* $

This is a regular expression that my professor wrote for me in R studio for getting olympic results from a website. Can someone explain me with some level of details what exactly each part of the regular expression does and how does work all together.

Some examples of results that this regular expression is used for are:

3:49:03, 1:21:08 , 49,03 , 3:42,02.

Thank you for all the help in advance.

Giuseppe Romagnuolo
  • 3,185
  • 2
  • 27
  • 36
ProgFacts
  • 11
  • 2

2 Answers2

0

Let's start with a legend of the syntax used:

  • ^ is beginning of string
  • (?:pattern) non capturing group
  • ?? non-greedy evaluation
  • [^0-9] not 0-9, the caret within square brackets negates the pattern in the square brakets
  • + one of more
  • * 0 or more
  • $ end of string

So now let's analyse what you have:

^(?:([0-9]+):)??(?:([0-9]+):)?([0-9]+)(?:.,)?[^0-9]* $

  1. ^ Start of string
  2. (?:pattern:) pattern followed by a colon (no capture (?:) part)
    • ([0-9]+) digits 0-9, one or more times, captured
  3. ?? preceding group occurring 0 or 1 time, non greedy
  4. (?:pattern:) pattern followed by a colon (no capture (?:) part)
    • ([0-9]+) digits 0-9, one or more times, captured
  5. ? preceding group occurring 0 or 1 time, greedy
  6. ([0-9]+) digits 0-9, one or more times, captured
  7. (?:.,) dot or comma, non captured
  8. ? preceding group occurring 0 or 1 time, greedy
  9. [^0-9]* non digits 0-9, matched 0 or more times
  10. a space
  11. $ end of string
Giuseppe Romagnuolo
  • 3,185
  • 2
  • 27
  • 36
0
  • ^ Start of the String
  • ([0-9]+) At least one of the numbers 0,...9
  • (x)? At most one of the thing from x
  • $ End of the string

Why he bothered to include End and Start of the String is beyond me, also why he used that many ? ...

i probably wouldve looked for it like so:

(([0-9]+)([:,.]?))*([0-9]+)

Meaning, (([0-9]+)([:,.]?)) = (at least one number followed by a possible seperator (1 of : , or .)), this repeated any time (0,1, ..., n), followed by at least one number. This would match also numbers without seperators, like 12. For numbers with at least one seperator replace the * by a +.

With stringr, extraction would look like this:

library(stringr)
str_extract(pattern = '(([0-9]+)([:,.]?))*([0-9]+)', string= 'hello, this is a time 02:04,34 in a sentence')

Output would be "02:04,34"

scrimau
  • 1,033
  • 1
  • 10
  • 24