I am trying to match a string to a regex in Python 3.x
, and then use the same regex to match a bunch of sample strings. I have a catalog of regexes (grok
) which I use to match the custom string that I have.
I am using the regex
library and not the default re
.
PROBLEM
Consider a simple log line -
"[17/Dec/2005:02:40:45 -0500] 192.168.2.10:ossecdb LOG: duration: 0.016 ms statement: SELECT id FROM location WHERE name = 'enigma->/var/log/messages' AND server_id = '1'"
Now, my Python code browses through a catalog (dict
of regexes that I have, and finds all the regexes which match this string.
Example,
catalog.txt
'TIMESTAMP_ISO8601': '(?>\d\d){1,2}-(?:0?[1-9]|1[0-2])-(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])[T ](?:2[0123]|[01]?[0-9]):?(?:[0-5][0-9])(?::?(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))?(?:Z|[+-](?:2[0123]|[01]?[0-9])(?::?(?:[0-5][0-9])))?',
'HTTPDATE': '(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])\/\b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b\/(?>\d\d){1,2}:(?!<[0-9])(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9])(?::(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))(?![0-9]) (?:[+-]?(?:[0-9]+))',
'IPV4': '(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])',
'QUOTEDSTRING': '(?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>`(?>\\.|[^\\`]+)+`)|``))'
I read this file and store it as a dict
in my python code. Now, I start searching the string for matches -
import regex, json
catalog = {
'HTTPDATE': '(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])/\\b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\\b/(?>\\d\\d){1,2}:(?!<[0-9])(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9])(?::(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))(?![0-9]) (?:[+-]?(?:[0-9]+))',
'TIMESTAMP_ISO8601': '(?>\\d\\d){1,2}-(?:0?[1-9]|1[0-2])-(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])[T ](?:2[0123]|[01]?[0-9]):?(?:[0-5][0-9])(?::?(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))?(?:Z|[+-](?:2[0123]|[01]?[0-9])(?::?(?:[0-5][0-9])))?',
'IPV4': '(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])'}
input_string = "[17/Dec/2005:02:40:45 -0500] 192.168.2.10:ossecdb LOG: duration: 0.016 ms statement: SELECT id FROM location WHERE name = 'enigma->/var/log/messages' AND server_id = '1'"
matches_dict = {}
for key in catalog.keys():
try:
res = regex.search(catalog.get(key), input_string)
if res and len(res) == 1:
matches_dict[key] = res
except Exception as e:
print(e)
print(matches_dict)
This prints -
>>> {'HTTPDATE': <regex.Match object;'IPV4': <regex.Match object; span=(29, 41),match='192.168.2.10'>}'
However, if I try to validate this regex on https://regex101.com, or anywhere else, I get an error - https://regex101.com/r/GClNoc/1
In all of the flavors (PERL
, Python
, etc) there is some error.
I am not sure why they aren't working in the regex101.com. Can someone please help me identifying whether there's anything wrong?
The regex
library doesn't seem to have a problem extracting the data.
Thanks!