0

got a quick question here about regex. I have a file(testlog-date.log) that has lines like this

# 2014-04-09 16:43:15,136|PID: 1371|INFO|Test.Controller.Root|Finished processing request        in   0.003355s for https://website/heartbeat

I'm looking to use regex to capture the PID and the time. So far I have this

import re

file_handler = open("testlog-20140409.log", "r")
for line in file_handler:
    var1 = re.findall(r'(\d+.\d+)s', line)
    print var1
file_handler.close()

So I'm able to print all the process time..question is how do I also capture the PID (and possibly other information into my variable var1? I tried doing this

var1 = re.findall(r'PID: (\d+) (\d+.\d+)s', line) 

It prints out empty structures.

Much appreciated Thanks!

Followup: My file is quite large. I'm thinking of storing all the data into one structure and sort them using by process time, and print out the top 20. Any idea how I could do it properly?

Guagua
  • 768
  • 1
  • 7
  • 13
  • possible duplicate of [Reference - What does this regex mean?](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) – Andy Apr 10 '14 at 00:34
  • @Andy thanks for pointing at that. I was looking at that post and tried multiple things but didn't do what I expected. The community here is nice enough to provide an answer within minutes with educational purposes as well so really grateful for that. – Guagua Apr 16 '14 at 21:54

3 Answers3

2

Use the regex (.*)\|(PID: .*)\|(.*)\|(.*)\|(.*). Each parenthesis in your regex pattern denotes a separate group.

In [125]: text = '2014-04-09 16:43:15,136|PID: 1371|INFO|Test.Controller.Root|Finished processing request        in   0.003355s for https://website/heartbeat'
In [126]: pattern = re.compile(r'(.*)\|(PID: .*)\|(.*)\|(.*)\|(.*)')
In [127]: results = re.findall(pattern, text)
In [128]: results
Out[128]:
[('2014-04-09 16:43:15,136',
  'PID: 1371,
  'INFO',
  'Test.Controller.Root',
  'Finished processing request        in   0.003355s for https://website/heartbeat')]

So now you have a tuple with each element belonging to each of your groups (timestamp, PID, routine, log level and the log message.

EDIT

For large files, regex are time consuming. Your log lines have '|' as the delimiter. You can just use those to split the line.

all_lines = []
for line in file:
    all_lines.append(line.split('|'))

This stores the data as a list of lists:

[['2014-04-09 16:43:15,136','PID: 1371','INFO','Test.Controller.Root','Finished processing request        in   0.003355s for https://website/heartbeat'],
...,
...]

To sort all_lines you can use the sorted() function and pass the first field of each of the sub-lists as the comparator.

sorted_lines = sorted(all_lines, key=lambda x: x[0])
shaktimaan
  • 10,886
  • 2
  • 25
  • 32
  • Ty for the fast reply! My file is quite large. I'm thinking of storing all the data into one structure and sort them using by process time, and print out the top 20. Any proper ways of doing this? – Guagua Apr 09 '14 at 21:33
  • @Guagua Updated my answer – shaktimaan Apr 09 '14 at 21:42
1

You should put .*? (non-greedy match for any chars) between the PID and time parts:

>>> import re
>>> s = "# 2014-04-09 16:43:15,136|PID: 1371|INFO|Test.Controller.Root|Finished processing request        in   0.003355s for https://website/heartbeat"
>>> re.findall(r'PID: (\d+).*?(\d+.\d+)s', s)
[('1371', '0.003355')]

For a more generic approach see @shaktimaan's answer.

alecxe
  • 414,977
  • 106
  • 935
  • 1,083
0

You could use

(?P<name>...)

Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name.

It makes reading code easier..

Also, for big files you are best to compile the regex first.

https://docs.python.org/2/library/re.html

Example in your case :

def searchData(line):
    pattern=re.compile(r"^#\s+(?P<date>[^\|]+)\|PID:\s*(?P<pid>[0-9]+)\|.*")

    try:
        result=pattern.search(line)
        if not result:
            raise ValueError

    except ValueError:

        #print "Nothing found in \"%s\"" % line.strip("\n")
        return None

    else:
        date=result.group('date')
        pid=result.group('pid')
        return date,pid
UnX
  • 381
  • 1
  • 6