4

I am having a hard time creating a regex (in Python 3.6) with which I could parse a datetime string, following these rules:

  • Date is always in the form of YYYYMMDD, where YYYY is any year between 2000 and 2099 (inclusive), so that becomes 20yyMMDD
  • Time is always in the form of HHMMSS
  • Date is always before time, as YYYYMMDDHHMMSS
  • Date and time can be separated with no character or any non-numeric character
    • YYYYMMDDHHMMSS - Ok
    • YYYYMMDD-HHMMSS - Ok
    • YYYYMMDD HHMMSS - Ok
    • YYYYMMDD1HHMMSS - Not accepted
  • There can be any characters in front or at the end, except the character "touching" the date string must be non-numeric
    • (YYYYMMDDHHMMSS) - Ok
    • 123-YYYYMMDDHHMMSS)123 - Ok
    • abc1YYYYMMDDHHMMSS - Not accepted

I know basics of regex, read many SO answers (found Regex: match everything but, Regex, every non-alphanumeric character except white space or colon and others pretty useful), but just cannot figure out regex to pass all of my test cases.

I need two groups for the actual date and time parsing, that is (20[\d]{6})([\d]{6}). Then I added support for the additional characters .*(20[\d]{6})[^\d]?([\d]{6}).* which works fine until there is a numeric character in the front, at the end or in the middle and it shall not match, but it matches. So I started adding different thing in the front or the back, as example (?<![\d]), .*[^\d]?, [^\d]?.*,... but unfortunately my regex knowledge ends soon and the string becomes a mess which I do not understand nor does it work properly.

I made some test strings (each with the desired results) and a simple test function:

import datetime
import re
from typing import Tuple, List

#my_regex = r"(?<![\d])(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"
my_regex = r"\b(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"

dt = datetime.datetime(2017, 12, 17, 9, 10, 11)

tests: List[Tuple[str, datetime.datetime]] = [
    # Clean one.
    ("20171217091011", dt),
    # Character in between.
    ("20171217a091011", dt),
    ("20171217b091011", dt),
    ("20171217-091011", dt),
    ("20171217_091011", dt),
    ("20171217 091011", dt),
    ("201712170091011", None),  # Before/in between/at the end in this case.
    # Characters in front.
    ("a20171217091011", dt),
    ("b20171217091011", dt),
    (" 20171217091011", dt),
    ("-20171217091011", dt),
    ("_20171217091011", dt),
    ("020171217091011", None),
    ("aa20171217091011", dt),
    ("a1-20171217091011", dt),
    ("123_20171217091011", dt),
    ("123 20171217091011", dt),
    ("123=20171217091011", dt),
    ("201720171217091011", None),
    # Characters at the end.
    ("20171217091011a", dt),
    ("20171217091011b", dt),
    ("20171217091011 ", dt),
    ("20171217091011-", dt),
    ("20171217091011_", dt),
    ("201712170910110", None),
    ("20171217091011aa", dt),
    ("20171217091011a1", dt),
    ("20171217091011-a1", dt),
    ("20171217091011-123", dt),
    ("20171217091011_123", dt),
    ("20171217091011 123", dt),
    ("20171217091011?123", dt),
    # Characters at both ends.
    ("a20171217091011a", dt),
    ("(20171217091011)", dt),
    ("a-20171217091011 b", dt),
    ("123(20171217091011)456", dt),
    (" 20171217091011 ", dt),
    ("2017 20171217091011 2017", dt),
    ("20171218-20171217091011-070809", dt),
    # Characters at both ends and in the middle.
    ("123(20171217-091011)456", dt),
    ("a2017(20171217 091011)b", dt),
    ("2017xx(20171217?091011)cc2017", dt),
    ("2017xx(201712170091011)cc2017", None),
    ("2017xx(201712170091011", None),
    # Other cases.
    ("20171217091011 20171116080910", dt),  # Match first.
    ("A-20171116-080910-20171217091011", datetime.datetime(2017, 11, 16, 8, 9, 10)),  # Match first.
]

for test_str, test_time in tests:
    match = re.match(my_regex, test_str)
    time = None
    if match:
        try:
            time = datetime.datetime.strptime("".join(match.groups()), "%Y%m%d%H%M%S")
        except ValueError:
            pass
    if time != test_time:
        print("{: <32s} = {} instead of {}".format(test_str, time, test_time))

But I just cannot get all of the test strings to pass, as example:

a20171217091011                  = None instead of 2017-12-17 09:10:11
b20171217091011                  = None instead of 2017-12-17 09:10:11
 20171217091011                  = None instead of 2017-12-17 09:10:11
-20171217091011                  = None instead of 2017-12-17 09:10:11
_20171217091011                  = None instead of 2017-12-17 09:10:11
aa20171217091011                 = None instead of 2017-12-17 09:10:11
a1-20171217091011                = None instead of 2017-12-17 09:10:11
123_20171217091011               = None instead of 2017-12-17 09:10:11
123 20171217091011               = None instead of 2017-12-17 09:10:11
123=20171217091011               = None instead of 2017-12-17 09:10:11
201712170910110                  = 2017-12-17 09:10:11 instead of None
a20171217091011a                 = None instead of 2017-12-17 09:10:11
(20171217091011)                 = None instead of 2017-12-17 09:10:11
a-20171217091011 b               = None instead of 2017-12-17 09:10:11
123(20171217091011)456           = None instead of 2017-12-17 09:10:11
 20171217091011                  = None instead of 2017-12-17 09:10:11
2017 20171217091011 2017         = None instead of 2017-12-17 09:10:11
20171218-20171217091011-070809   = 2017-12-18 20:17:12 instead of 2017-12-17 09:10:11
123(20171217-091011)456          = None instead of 2017-12-17 09:10:11
a2017(20171217 091011)b          = None instead of 2017-12-17 09:10:11
2017xx(20171217?091011)cc2017    = None instead of 2017-12-17 09:10:11
A-20171116-080910-20171217091011 = None instead of 2017-11-16 08:09:10

Thank you for any ideas.

SherylHohman
  • 12,507
  • 16
  • 70
  • 78
Bojan P.
  • 795
  • 1
  • 9
  • 18

1 Answers1

2

It seems that you need to check the general pattern with the regex while validating actual date time values with the appropriate Python methods.

So, you may fix the code using the following regex:

r'(?<!\d)20\d{6}\D?\d{6}(?!\d)'

See the regex demo

Details

  • (?<!\d) - a negative lookbehind that fails the match if there is a digit immediately to the left of the current position
  • 20 - a 20 substring
  • \d{6} - any 6 digits
  • \D? - 1 or 0 non-digit chars
  • \d{6} - any 6 digits
  • (?!\d) - a negative lookahead that fails the match if there is a digit immediately to the right of the current position.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • 1
    The only change I made to your solution was to add two groups, `20\d{6}` -> `(20\d{6})` and `\d{6}` -> `(\d{6})` which I then parse with `"".join(match.groups())`. I could use `match.string` without the groups but I don't know how to get rid of `\D?` match in the middle. And I replaced `re.match` with `re.search` as you suggested (otherwise it does not work). – Bojan P. Dec 17 '17 at 17:10
  • 1
    @BojanP. Since it is impossible to match discontinuous texts within a single regex matching operation, joining the two group values is the right way to go. Also, just note that `re.match` only matches the pattern at the start of the string, thus, you have to use `re.search` whenever the match position is unknown. – Wiktor Stribiżew Dec 17 '17 at 18:23