0

I'm looking for a clean way to extract some data from a string using regex and the python re module. Each line of the string is of the form key = value. There are only certain keys that I'm interested in, but for some strings these keys may be missing. I can think of a few ways to do this by iterating over the string line by line, or by using re.finditer(), but what I'd really like to do is use named groups and a single call to re.match(), to end with a dictionary of groups using .groupdict() method of the returned match object. I can do that using named groups when all the groups are present, but it seems that if I make groups optional then they don't get matched even when present.

I'm probably missing something obvious, but is there a way to do this in a single regex or do I need a multistep process?

import re

# trying to extract 'type', 'count' and 'destinations'.
# string1 has all keys and a single re.match works
# string2 is missing 'count'... any suggestions?

string1 = """
Name: default
type = Route
status = 0
count = 5
enabled = False
start_time = 18:00:00
end_time = 00:00:00
destinations = default
started = False
"""

string2 = """
Name: default
type = Route
status = 0
enabled = False
start_time = 18:00:00
end_time = 00:00:00
destinations = default
started = False
"""


pattern = re.compile(r"(?s).*type = (?P<type>\S*).*count = (?P<count>\S*).*destinations = (?P<destinations>\S*)")

m1 = re.match(pattern,string1)
# m1.groupdict() == {'type': 'Route', 'count': '5', 'destinations': 'default'}

m2 = re.match(pattern,string2)
# m2 == None
user5219763
  • 1,174
  • 11
  • 18
  • 1
    Does it have to be a regex? Seems like it would be easier to split the lines with something like [`splitlines`](https://docs.python.org/3/library/stdtypes.html#str.splitlines) and parse the resultant list into a dictionary. – excaza Aug 13 '18 at 15:38
  • There's definitely a lot of ways it could be done, with iteration and `splitlines`, or something like `re.finditer(r'(?P.*) = (?P.*)',string)`, I was really just interested in if theres a concise way to do it with a single regex to avoid extra work parsing it all to a dictionary. – user5219763 Aug 13 '18 at 15:44
  • 1
    Putting together and maintaining an extremely long and/or delicate regex seems like more work than at most 9 lines of parsing the same string into a dictionary. – excaza Aug 13 '18 at 15:47
  • It might end up being more complicated for sure, but I was thinking of a function to build up the required regex from a list of desired keys to keep it maintainable. Having spent a few hours trying to figure out the right regex, I kind of want to know for the sake of knowing if its possible at this point. – user5219763 Aug 13 '18 at 15:58
  • So, can all of these keys be missing, i.e. are they all optional? If yes, then it makes sense to use the solution like below. Or, you may use something like `dict(re.findall(r'(key1|key2|keyN)\s*=\s*(.+)',s))` – Wiktor Stribiżew Aug 20 '18 at 18:16
  • Well, a better regex would be `dict(re.findall(r'(?m)^(key1|key2|keyN)\s*=\s*(.+)',s))`. Why use `re.match` here? If you are after this kind of solution, I will post the code with dynamic regex building. – Wiktor Stribiżew Aug 20 '18 at 18:22

6 Answers6

4

You could solve this with one line of simple regular expression.

>> dict(re.findall(r'^(type|count|destinations) = (\S*)$', string1, re.MULTILINE))
{'count': '5', 'type': 'Route', 'destinations': 'default'}

>> dict(re.findall(r'^(type|count|destinations) = (\S*)$', string2, re.MULTILINE))
{'type': 'Route', 'destinations': 'default'}
AnnieFromTaiwan
  • 2,091
  • 1
  • 15
  • 33
2

Check on this out.

#python 3.5.2
import re

# trying to extract 'type', 'count' and 'destinations'.
# string1 has all keys and a single re.match works
# string2 is missing 'count'... any suggestions?

string1 = """
Name: default
type = Route
status = 0
count = 5
enabled = False
start_time = 18:00:00
end_time = 00:00:00
destinations = default
started = False
"""

string2 = """
Name: default
type = Route
status = 0
enabled = False
start_time = 18:00:00
end_time = 00:00:00
destinations = default
started = False
"""

pattern = re.compile(r"""
(?mx)\A
(?=(?:[\s\S]*?^\s*type\s*=\s*(?P<type>.*)$)?)
(?=(?:[\s\S]*?^\s*count\s*=\s*(?P<count>.*)$)?)
(?=(?:[\s\S]*?^\s*destinations\s*=\s*(?P<destinations>.*)$)?)
""")

m1 = re.match(pattern, string1)
print (m1.groupdict())

m2 = re.match(pattern, string2)
print (m2.groupdict())

To try it online, please click here.

Andrei Odegov
  • 2,599
  • 2
  • 14
  • 20
1

You can use something similar to the following dictionary comprehension, which splits and filters the key-value pairs based on an input tuple of desired field names:

import re

def regexandgroup(instr: str, savekeys: tuple):
    exp = '^(\w+)[ \t:=]+([\w:]+)$'
    match = re.findall(exp, instr, re.MULTILINE)

    return {group[0]: group[1] for group in match if group[0] in savekeys}

Which gives us:

>> print(regexandgroup(string1, ('type', 'count', 'destinations')))
{'type': 'Route', 'count': '5', 'destinations': 'default'}

>> print(regexandgroup(string2, ('type', 'count', 'destinations')))
{'type': 'Route', 'destinations': 'default'}
excaza
  • 11,984
  • 5
  • 26
  • 44
  • Thats a really nice solution to the extracting the values, thanks! I guess if savekeys was a set it'd be a little more efficient as the size of the required keys grows. – user5219763 Aug 13 '18 at 16:38
  • @excaza: possibly worth noting that `[\s:=]+` will match a single newline character, so the pattern will match two consecutive lines with one word each. It might be better to use `[ \t:=]+`. (Also, `:` isn't in `\w`, but the time fields aren't in the list so I guess it doesn't matter.) – rici Aug 13 '18 at 18:46
  • @rici Shouldn't matter much in this context but I've incorporated the former, thanks! For the latter, `:` is already in the second capture group. – excaza Aug 13 '18 at 20:07
  • @excaza: ah, so it is... – rici Aug 13 '18 at 20:19
0

You didn't really specify if any field can be missing or if count is the only field that could be missing. However, this pattern will match all 3 cases that you suggested and it will store them in named capture groups.

type = (?<type>\S*)|count = (?<count>\d+)|destinations = (?<destinations>\S*)

Demo

| just means or, so you're looking for type = ... OR count = ... OR destinations = ...

emsimpson92
  • 1,721
  • 1
  • 6
  • 22
0

Why not use pandas to do things all at once? The following uses the regex from @andrei-odegov

import pandas as pd


# create a Series object from your strings
s = pd.Series([string1, string2])

regex = r"""
    (?mx)\A
    (?=(?:[\s\S]*?^\s*type\s*=\s*(?P<type>.*)$)?)
    (?=(?:[\s\S]*?^\s*count\s*=\s*(?P<count>.*)$)?)
    (?=(?:[\s\S]*?^\s*destinations\s*=\s*(?P<destinations>.*)$)?)
"""

# return a DataFrame which contains your results
df = s.str.extract(regex, expand=True)

print(df)


    type count destinations
0  Route     5      default
1  Route   NaN      default
jeschwar
  • 1,111
  • 5
  • 10
0

Just extract the key/value pairs, then you can either ignore the additional keys, or else add … if x.split(' = ')[0] in wanted_keys to filter them. Use setdefault if you want to fill in missing keys.

>>> dict(x.split(' = ') for x in string1.strip().splitlines()[1:])
{'status': '0', 'count': '5', 'started': 'False', 'start_time': '18:00:00', 'enabled': 'False', 'end_time': '00:00:00', 'type': 'Route', 'destinations': 'default'}
jhermann
  • 2,001
  • 11
  • 16