1

I am trying to match a string to a regex in Python 3.x, and then use the same regex to match a bunch of sample strings. I have a catalog of regexes (grok) which I use to match the custom string that I have.

I am using the regex library and not the default re.

PROBLEM

Consider a simple log line -

"[17/Dec/2005:02:40:45 -0500] 192.168.2.10:ossecdb LOG: duration: 0.016 ms statement: SELECT id FROM location WHERE name = 'enigma->/var/log/messages' AND server_id = '1'"

Now, my Python code browses through a catalog (dict of regexes that I have, and finds all the regexes which match this string.

Example,

catalog.txt


'TIMESTAMP_ISO8601': '(?>\d\d){1,2}-(?:0?[1-9]|1[0-2])-(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])[T ](?:2[0123]|[01]?[0-9]):?(?:[0-5][0-9])(?::?(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))?(?:Z|[+-](?:2[0123]|[01]?[0-9])(?::?(?:[0-5][0-9])))?',

'HTTPDATE': '(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])\/\b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b\/(?>\d\d){1,2}:(?!<[0-9])(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9])(?::(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))(?![0-9]) (?:[+-]?(?:[0-9]+))',

'IPV4': '(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])',


'QUOTEDSTRING': '(?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>`(?>\\.|[^\\`]+)+`)|``))'

I read this file and store it as a dict in my python code. Now, I start searching the string for matches -

import regex, json

catalog = {
    'HTTPDATE': '(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])/\\b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\\b/(?>\\d\\d){1,2}:(?!<[0-9])(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9])(?::(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))(?![0-9]) (?:[+-]?(?:[0-9]+))', 
    'TIMESTAMP_ISO8601': '(?>\\d\\d){1,2}-(?:0?[1-9]|1[0-2])-(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])[T ](?:2[0123]|[01]?[0-9]):?(?:[0-5][0-9])(?::?(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))?(?:Z|[+-](?:2[0123]|[01]?[0-9])(?::?(?:[0-5][0-9])))?', 
    'IPV4': '(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])'}

input_string = "[17/Dec/2005:02:40:45 -0500] 192.168.2.10:ossecdb LOG:  duration: 0.016 ms  statement: SELECT id FROM location WHERE name = 'enigma->/var/log/messages' AND server_id = '1'"

matches_dict = {}

for key in catalog.keys():
    try:
        res = regex.search(catalog.get(key), input_string)

        if res and len(res) == 1:
            matches_dict[key] = res

    except Exception as e:
        print(e)

print(matches_dict)

This prints -

>>> {'HTTPDATE': <regex.Match object;'IPV4': <regex.Match object; span=(29, 41),match='192.168.2.10'>}'

However, if I try to validate this regex on https://regex101.com, or anywhere else, I get an error - https://regex101.com/r/GClNoc/1

In all of the flavors (PERL, Python, etc) there is some error.

I am not sure why they aren't working in the regex101.com. Can someone please help me identifying whether there's anything wrong?

The regex library doesn't seem to have a problem extracting the data.

Thanks!

Adhish Thite
  • 421
  • 1
  • 4
  • 17
  • 1
    There is no issue. You test with a *string literal* (the way you see it in the console), while you must test with a **literal string pattern**, and **use the appropriate delimiter** (if you test with a PCRE regex tester). [**Look here**](https://regex101.com/r/GClNoc/3). – Wiktor Stribiżew Sep 25 '19 at 22:06
  • Thanks. That's perfect. Is there a way I can make this regex work with Python too? I tried all the delimiters for Python, but doesn't seem to work. – Adhish Thite Sep 26 '19 at 18:37
  • 1
    Use [this one](https://regex101.com/r/GClNoc/4). – Wiktor Stribiżew Sep 26 '19 at 18:38
  • That's great. It's all in the detail, for sure. – Adhish Thite Sep 26 '19 at 18:42

0 Answers0