Coursera Course - Introduction of Data Science in Python Assignment 1

Question

I'm taking this course on Coursera, and I'm running some issues while doing the first assignment. The task is to basically use regular expression to get certain values from the given file. Then, the function should output a dictionary containing these values:

example_dict = {"host":"146.204.224.152", 

                "user_name":"feest6811", 

                "time":"21/Jun/2019:15:45:24 -0700",

                "request":"POST /incentivize HTTP/1.1"}

This is just a screenshot of the file. Due to some reasons, the link doesn't work if it's not open directly from Coursera. I apologize in advance for the bad formatting. One thing I must point out is that for some cases, as you can see in the first example, there's no username. Instead '-' is used.

159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845
136.195.158.6 - feeney9464 [21/Jun/2019:15:46:11 -0700] "HEAD /open-source/markets HTTP/2.0" 204 21149

This is what I currently have right now. However, the output is None. I guess there's something wrong in my pattern.

import re
def logs():
    
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    # YOUR CODE HERE
        
        pattern = """ 
        (?P<host>\w*)
        (\d+\.\d+.\d+.\d+\ )
        (?P<user_name>\w*)
        (\ -\ [a-z]+[0-9]+\ )
        (?P<time>\w*)
        (\[(.*?)\])
        (?P<request>\w*)
        (".*")
        """
        for item in re.finditer(pattern,logdata,re.VERBOSE):
       
            print(item.groupdict())

All debugging details are there in the question, the question is on-topic. Just fixed the formatting. — Wiktor Stribiżew, Oct 20 '20 at 07:51

score 2 · Accepted Answer · answered Oct 19 '20 at 10:16

You can use the following expression:

(?P<host>\d+(?:\.\d+){3}) # 1+ digits and 3 occurrenses of . and 3 digits
\s+\S+\s+                 # 1+ whitespaces, 1+ non-whitespaces, 1+ whitespaces
(?P<user_name>\S+)\s+\[   # 1+ non-whitespaces (Group "user_name"), 1+ whitespaces and [
(?P<time>[^\]\[]*)\]\s+   # Group "time": 0+ chars other than [ and ], ], 1+ whitespaces
"(?P<request>[^"]*)"      # ", Group "request": 0+ non-" chars, "

See the regex demo. See the Python demo:

import re
logdata = r"""159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845
136.195.158.6 - feeney9464 [21/Jun/2019:15:46:11 -0700] "HEAD /open-source/markets HTTP/2.0" 204 21149"""
pattern = r'''
(?P<host>\d+(?:\.\d+){3}) # 1+ digits and 3 occurrenses of . and 3 digits
\s+\S+\s+                 # 1+ whitespaces, 1+ non-whitespaces, 1+ whitespaces
(?P<user_name>\S+)\s+\[   # 1+ non-whitespaces (Group "user_name"), 1+ whitespaces and [
(?P<time>[^\]\[]*)\]\s+   # Group "time": 0+ chars other than [ and ], ], 1+ whitespaces
"(?P<request>[^"]*)"      # ", Group "request": 0+ non-" chars, "
'''
for item in re.finditer(pattern,logdata,re.VERBOSE):
    print(item.groupdict())

Output:

{'host': '159.253.153.40', 'user_name': '-', 'time': '21/Jun/2019:15:46:10 -0700', 'request': 'POST /e-business HTTP/1.0'}
{'host': '136.195.158.6', 'user_name': 'feeney9464', 'time': '21/Jun/2019:15:46:11 -0700', 'request': 'HEAD /open-source/markets HTTP/2.0'}

Thank you so much!!! It worked!!! However, may I just ask a question regarding your solution? It probably sounds stupid, but don't you need to include everything in the parenthesis? For example, ("?P[^"]*"). Or are they the same? Also, may you please explain the meaning of "?:" in your regular expression — BryantHsiung, Oct 19 '20 at 13:20
@BryantHsiung You can't use `("?P[^"]*")`, it is an invalid regex construct. See more about [non-capturing groups here](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions). — Wiktor Stribiżew, Oct 19 '20 at 13:42

Coursera Course - Introduction of Data Science in Python Assignment 1

1 Answers1