Locate multiple keywords in lines using Python

Question

I got a line like this :

20:28:26.684597 24:d5:6e:76:9s:10 (oui Unknown) > 45:83:r4:7u:9s:i2 (oui Unknown), ethertype 802.1Q (0x8100), length 78: vlan 64, p 0, ethertype IPv4, (tos 0x48, ttl 34, id 5643, offset 0, flags [none], proto TCP (6), length 60) 192.168.45.28.56982 > 172.68.54.28.webcache: Flags [S], cksum 0xg654 (correct), seq 576485934, win 65535, options [mss 1460,sackOK,TS val 2544789 ecr 0,wscale 0,eol], length 0

In this line I need to find ID value from "id 5643" and another value (56982) from 192.168.45.28.56982. In these "id" will be constant and 192.168.45.28 is constant.

I have written a script like this, please suggest a way to shorten the code as in my script multiple steps are involved :

file = open('test.txt')
fi = file.readlines()

for line in fi:
    test = (line.split(","))
    for word2 in test:
        if "id" in word2:
            find2 = word2.split(" ")[-1]
            print("************", find2)
    for word in test:
        if "192.168.45.28" in word:
            find = word.split(".")
            print(find)
            for word1 in find:
                if ">" in word1:
                    find1 = word1.split(">")[0]
                    print(find1)

#

Just edited my question as per your suggestion // so for such cases 'readlines' is best suited or is there a better efficient method available. — Zoro99, Mar 13 '16 at 10:18

dantiston · Answer 1 · 2016-03-14T04:07:07.367

2

You could use regular expressions:

import re

# This searches for the literal id
# followed by a space and 1 or more digits
idPattern = re.compile("id (\d+)")
# This searches for your IP followed by a 
# a dot and one or more digits
ipPattern = re.compile("192\.168\.45\.28\.(\d+)")

with open("test.txt", 'r') as data:
    for line in data:
        id = idPattern.findall(line)
        ip = ipPattern.findall(line)

See the Python regular expression docs

edited Mar 14 '16 at 04:07

answered Mar 13 '16 at 07:46

dantiston

3,974
2
23
27

Got the following error "AttributeError: 'set' object has no attribute 'extend'" // But I want values to be stored in variable id1 and ip1 for every line as I need to perform some more operations on them. Could you please suggest a code for that – Zoro99 Mar 13 '16 at 08:06
@dantiston Sure `set()` has extend? It's a list attribute. Didn't you mean `set.add()`? – jDo Mar 13 '16 at 08:15
@jDo you're right, I wrote and tested as a list and forgot to change extend when I switched to set. – dantiston Mar 14 '16 at 04:04
@Zoro99 I updated the code to store the results at each line. – dantiston Mar 14 '16 at 04:07

jDo · Answer 2 · 2016-03-13T09:55:28.890

Same approach as the others. It won't add empty lists to your results though, it compiles the regex for efficiency, it doesn't read the whole file into memory in one go and it doesn't use id as a variable name (it's a built-in function so best to avoid it). There can be duplicates in the output (I couldn't just assume that you wanted unique entries only).

import re

re_id = re.compile("id (\d+)")
re_ip = re.compile("192\.168\.45\.28\.(\d+)")

ids = []
ips = []

with open("test.txt", "r") as f:
    for line in f:
        id_res = re_id.findall(line)
        if any(id_res):
            ids.append(id_res[0])
        ip_res = re_ip.findall(line)
        if any(ip_res):
            ips.append(ip_res[0])

BramV · Accepted Answer · 2016-03-13T11:17:21.023

0

You can use a regex. Some more info here: https://docs.python.org/2/library/re.html

You could write it like this

import re
file = open('test.txt')
fi = file.readlines()

for line in fi:
    match = re.match('.*id (\d+).*',line)
    if match:
        print("************ %s" % match.group(1))
    match = re.match('.*192\.168\.45\.28\.(\d+).*',line)
    if match:
        print(match.group(1))

**update**

As jDo pointed out it is better to use findall, compile the regex upfront qnd dont use readlines, so you will get something like this:

import re

re_id = re.compile("id (\d+)")
re_ip = re.compile("192\.168\.45\.28\.(\d+)")
with open("test.txt", "r") as f:
    for line in f:
        match = re.findall(re_id,line)
        if match:
            print("************ %s" % match.group(1))
        match = re.findall(re_ip,line)
        if match:
            print(match.group(1))

edited Mar 13 '16 at 11:17

answered Mar 13 '16 at 07:42

BramV

56
8

It didnt give any output, though script got executed fine – Zoro99 Mar 13 '16 at 08:08
I think the regex wasnt fully correct. I updated it. Quickly tested it here and should work – BramV Mar 13 '16 at 08:14
1

You're reading the whole file into memory though. As someone pointed out [here](https://stackoverflow.com/questions/17246260/python-readlines-usage-and-efficient-practice-for-reading) *"The efficient way to use readlines() is to not use it. Ever."* Also, compile your regex for extra efficiency and use `findall` to search within strings rather than from the beginning (then you could do away with the asterisks) – jDo Mar 13 '16 at 08:47
You are right but he only asked for sorter code not for memory optimisation. – BramV Mar 13 '16 at 08:50
@BramV Well, I guess it's a matter of definition whether or not avoiding something you should almost never use can be called an "optimisation" :D – jDo Mar 13 '16 at 08:57
@BramV You should change the `readlines()` to `with open() as f: for line in f` as well. The next person who finds this post and decides to parse a log file half the size of their RAM will feel the effect if you don't. *"The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files."* - user katrielalex in [this post](https://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python) – jDo Mar 13 '16 at 09:29
I edited my post with my actual end goal... which is not working with "with open" ... please let me know what am I doing wrong here. Thanks for your help. – Zoro99 Mar 13 '16 at 09:52

Locate multiple keywords in lines using Python

3 Answers3