I am having a hard time creating a regex (in Python 3.6) with which I could parse a datetime string, following these rules:
- Date is always in the form of
YYYYMMDD
, whereYYYY
is any year between 2000 and 2099 (inclusive), so that becomes20yyMMDD
- Time is always in the form of
HHMMSS
- Date is always before time, as
YYYYMMDDHHMMSS
- Date and time can be separated with no character or any non-numeric character
YYYYMMDDHHMMSS
- OkYYYYMMDD-HHMMSS
- OkYYYYMMDD HHMMSS
- OkYYYYMMDD1HHMMSS
- Not accepted
- There can be any characters in front or at the end, except the character "touching" the date string must be non-numeric
(YYYYMMDDHHMMSS)
- Ok123-YYYYMMDDHHMMSS)123
- Okabc1YYYYMMDDHHMMSS
- Not accepted
I know basics of regex, read many SO answers (found Regex: match everything but, Regex, every non-alphanumeric character except white space or colon and others pretty useful), but just cannot figure out regex to pass all of my test cases.
I need two groups for the actual date and time parsing, that is (20[\d]{6})([\d]{6})
. Then I added support for the additional characters .*(20[\d]{6})[^\d]?([\d]{6}).*
which works fine until there is a numeric character in the front, at the end or in the middle and it shall not match, but it matches. So I started adding different thing in the front or the back, as example (?<![\d])
, .*[^\d]?
, [^\d]?.*
,... but unfortunately my regex knowledge ends soon and the string becomes a mess which I do not understand nor does it work properly.
I made some test strings (each with the desired results) and a simple test function:
import datetime
import re
from typing import Tuple, List
#my_regex = r"(?<![\d])(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"
my_regex = r"\b(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"
dt = datetime.datetime(2017, 12, 17, 9, 10, 11)
tests: List[Tuple[str, datetime.datetime]] = [
# Clean one.
("20171217091011", dt),
# Character in between.
("20171217a091011", dt),
("20171217b091011", dt),
("20171217-091011", dt),
("20171217_091011", dt),
("20171217 091011", dt),
("201712170091011", None), # Before/in between/at the end in this case.
# Characters in front.
("a20171217091011", dt),
("b20171217091011", dt),
(" 20171217091011", dt),
("-20171217091011", dt),
("_20171217091011", dt),
("020171217091011", None),
("aa20171217091011", dt),
("a1-20171217091011", dt),
("123_20171217091011", dt),
("123 20171217091011", dt),
("123=20171217091011", dt),
("201720171217091011", None),
# Characters at the end.
("20171217091011a", dt),
("20171217091011b", dt),
("20171217091011 ", dt),
("20171217091011-", dt),
("20171217091011_", dt),
("201712170910110", None),
("20171217091011aa", dt),
("20171217091011a1", dt),
("20171217091011-a1", dt),
("20171217091011-123", dt),
("20171217091011_123", dt),
("20171217091011 123", dt),
("20171217091011?123", dt),
# Characters at both ends.
("a20171217091011a", dt),
("(20171217091011)", dt),
("a-20171217091011 b", dt),
("123(20171217091011)456", dt),
(" 20171217091011 ", dt),
("2017 20171217091011 2017", dt),
("20171218-20171217091011-070809", dt),
# Characters at both ends and in the middle.
("123(20171217-091011)456", dt),
("a2017(20171217 091011)b", dt),
("2017xx(20171217?091011)cc2017", dt),
("2017xx(201712170091011)cc2017", None),
("2017xx(201712170091011", None),
# Other cases.
("20171217091011 20171116080910", dt), # Match first.
("A-20171116-080910-20171217091011", datetime.datetime(2017, 11, 16, 8, 9, 10)), # Match first.
]
for test_str, test_time in tests:
match = re.match(my_regex, test_str)
time = None
if match:
try:
time = datetime.datetime.strptime("".join(match.groups()), "%Y%m%d%H%M%S")
except ValueError:
pass
if time != test_time:
print("{: <32s} = {} instead of {}".format(test_str, time, test_time))
But I just cannot get all of the test strings to pass, as example:
a20171217091011 = None instead of 2017-12-17 09:10:11
b20171217091011 = None instead of 2017-12-17 09:10:11
20171217091011 = None instead of 2017-12-17 09:10:11
-20171217091011 = None instead of 2017-12-17 09:10:11
_20171217091011 = None instead of 2017-12-17 09:10:11
aa20171217091011 = None instead of 2017-12-17 09:10:11
a1-20171217091011 = None instead of 2017-12-17 09:10:11
123_20171217091011 = None instead of 2017-12-17 09:10:11
123 20171217091011 = None instead of 2017-12-17 09:10:11
123=20171217091011 = None instead of 2017-12-17 09:10:11
201712170910110 = 2017-12-17 09:10:11 instead of None
a20171217091011a = None instead of 2017-12-17 09:10:11
(20171217091011) = None instead of 2017-12-17 09:10:11
a-20171217091011 b = None instead of 2017-12-17 09:10:11
123(20171217091011)456 = None instead of 2017-12-17 09:10:11
20171217091011 = None instead of 2017-12-17 09:10:11
2017 20171217091011 2017 = None instead of 2017-12-17 09:10:11
20171218-20171217091011-070809 = 2017-12-18 20:17:12 instead of 2017-12-17 09:10:11
123(20171217-091011)456 = None instead of 2017-12-17 09:10:11
a2017(20171217 091011)b = None instead of 2017-12-17 09:10:11
2017xx(20171217?091011)cc2017 = None instead of 2017-12-17 09:10:11
A-20171116-080910-20171217091011 = None instead of 2017-11-16 08:09:10
Thank you for any ideas.