-1

I am trying to find a regular expression which should satisfy the following needs.

It should identify all space(s) as separators until a doublepoint is passed 2 times. After this pass, it should continue to use spaces as separators until a 3rd doublepoint is identified. This 3rd colon should be used as separator as well. But all spaces before and after this specific colon should not be used as separator. After this special doublepoint has been identified, no more separator should be found even its a space or a colon.

2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf c.w.f.w.NiceController             : z rest as async texting: json, special character, spacses.....

I would like to have the separators her identified as following (Separator shown as X)

2019-12-28X13:00:00.112XDEBUGXn-somethingspecial.atX---X[9999-118684]X3894ß8349ß84930ßaa14e38eae18e3ebfXc.w.f.w.NiceControllerXz rest as async texting: json, special character, spacses.....


2019-12-28 X 13:00:00.112 X DEBUG X n-somethingspecial.at X --- X [9999-118684] X 3894ß8349ß84930ßaa14e38eae18e3ebf X c.w.f.w.NiceController X z rest as async texting: json, special character, spacses.....

Exactly 8 separtors are found here.

Any ideas how to do this via regular expression?

My current approach does not work as I tried to to this like the following

Any ideas about this?

Update:

(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?<=DEBUG)\s|(?<=\s---)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=\[[0-9a-z\#\.\-]{15}\])\s|((?<=\[[0-9a-z\#\.\-]{15}\]\s)\s|(?<=\[[0-9a-z\#\.\-]{15}\]\s[a-z0-9]{32})\s)|\s(?=---)|(?<=[a-zA-Z])\s+\:\s

That's my current syntax to identify the separators.

Update 2:
Regex above is faulty.

Update 3:

(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})

This is the current regex. Targetapproach is to call

            df = pd.read_csv(file_name,
                         sep="(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})",
                         names=['date', 'time', 'level', 'host', 'template', 'threadid', 'logid', 'classmethods', 'line'],
                         engine='python',
                         nrows=100)

This could be extended later to dask which gives me the change to parse multiple log files in one dataframe.

The last column line is not identified correctly. For unknown reasons yet.

Jason Aller
  • 3,391
  • 28
  • 37
  • 36
  • Looks like you are looking to create a regex, but do not know where to get started. Please check [Reference - What does this regex mean](https://stackoverflow.com/questions/22937618) resource, it has plenty of hints. Also, refer to [Learning Regular Expressions](https://stackoverflow.com/questions/4736) post for some basic regex info. Once you get some expression ready and still have issues with the solution, please edit the question with the latest details and we'll be glad to help you fix the problem. – Wiktor Stribiżew Jun 12 '20 at 14:53
  • Hm. Maybe you are right. I am familiar with regex and I use it very often. But this is a bit a tricky one to me. Anyway I will read your links. Group, captering, positive and negative lookahead and lookbehing is quite used often by me. I will edit the question later. – Peter Ebelsberger Jun 12 '20 at 15:02
  • 1
    Just add the pattern you tried explaining what problem you have with it. – Wiktor Stribiżew Jun 12 '20 at 15:03
  • In English, `:` is called "colon". "Doublepoint" might be confusing for some readers. It's often better to use the punctuation symbol directly rather than a name. – rici Jun 12 '20 at 15:17
  • Also, I think counting `:` is probably a distraction. I don't know for sure what your log syntax is, but maybe you just want to find the first eight space-separated components, and then everything after the next `:`. – rici Jun 12 '20 at 15:20
  • @rici you are right, spelling updated in original post. Regarding your comment. Thats more or less true but there is an nondeterministic amount of spaces before 3rd colon. – Peter Ebelsberger Jun 12 '20 at 17:39
  • @PeterEbelsberger Your question is very unclear right now. What is the tool, what is the method/function you are using with the pattern? Why not just match and extract the necessary details using capturing groups, like in https://regex101.com/r/julKEF/1? – Wiktor Stribiżew Jun 12 '20 at 19:51
  • @Peter: That doesn't matter unless it is possible that it is possible for one of the initial fields to be missing. `str.split(None, 8)` will return a list of maximum length 9. (I.e. 8+1). The first eight items in the list are the first eight fields in `str`, where fields are separated by any sequence of whitespace. The ninth element, if there is one, starts at the first non-whitespace character after the eighth field and extends to the end of `str`. Thus, if `str` were your example line, `str.split(None, 8)[8]` would be `": z rest as async texting: json, special character, spacses....."`... – rici Jun 12 '20 at 21:57
  • ... and the string you're looking for might be `str.split(None, 8)[8][1:]` or it might be `str.split(None, 8)[8].split(None, 1)[1]`. Note that these expressions are imprecise because you actually have to deal with the possibility that there weren't enough fields in `str` to create the last element in the split list. So you need to put that into a `try` statement or check the size of the vector returned by `split` before trying to use it. But that's the basic idea, anyway. No regular expression necessary. – rici Jun 12 '20 at 22:01
  • @WiktorStribiżew Capturing the groups is the easy one and I could do this. Your solution would be acceptable if I would try to use the generator pattern inside python. As I would like to avoid additional code, I believe the build in pandas and its cluster version DASK would be a fine approach to have the log in frames for later easier postprocessing. Beside this your solution is a good one. If I am not able to make the final line, I will have to go this way... ;-( or parse it inside the frames... – Peter Ebelsberger Jun 14 '20 at 10:52
  • Since the `re` patterns cannot have lookbehind patterns of variable length, you can't do that the way you intended. Read the file in line by line and extract the fields with my regex, it will be least troublesome. – Wiktor Stribiżew Jun 14 '20 at 10:56

2 Answers2

0

If that log format is sufficiently regular, you can take the lines apart much more easily with str.split.

The assumptions are that none of the first eight fields have an internal space, and that all of them are always present (or, if not all are present, that the last field, which starts after the colon, is also not present). You can then use the maxsplit argument to str.split in order to stop splitting when the ninth field starts:

def separate(logline):
    fields = logline.split(maxsplit=8) # 8 space separate fields + the rest
    if len(fields) > 8:
        # Fix up the ninth field. Perhaps you want to remove the colon:
        fields[8] = fields[8][1:]
        # or perhaps you want the text starting at the first non-whitespace
        # character after the colon:
        #
        # if fields[8][0] == ':':
        #      fields[8] = fields[8].split(maxsplit=1)[1]
        #
        # etc.
    return fields

>>> logline = ( "2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at"
...           + " --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf"
...           + " c.w.f.w.NiceController"
...           + "             : z rest as async texting: json, special character, spaces.....")
>>> separate(logline)
['2019-12-28', '13:00:00.112', 'DEBUG', 'n-somethingspecial.at', '---',
 '[9999-118684]', '3894ß8349ß84930ßaa14e38eae18e3ebf',
 'c.w.f.w.NiceController',
 ' z rest as async texting: json, special character, spaces.....']
rici
  • 201,785
  • 23
  • 193
  • 283
  • Thanks for your suggestion. I will update my question so that it is more specific. I would like to provide a regular expression to pandas separator so that dataframe will be parsed automatically. – Peter Ebelsberger Jun 13 '20 at 21:47
0

Solution

The current outcome of my problem can be solved via the following regular expression.

 (?:(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.hostname\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s))|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})

Maybe minor adaptions have to be done maybe but for now it works pretty good.