3

I am writing a regular expression in python to capture the contents inside an SSI tag.

I want to parse the tag:

<!--#include file="/var/www/localhost/index.html" set="one" -->

into the following components:

  • Tag Function (ex: include, echo or set)
  • Name of attribute, found before the = sign
  • Value of attribute, found in between the "'s

The problem is that I am at a loss on how to grab these repeating groups, as name/value pairs may occur one or more times in a tag. I have spent hours on this.

Here is my current regex string:

^\<\!\-\-\#([a-z]+?)\s([a-z]*\=\".*\")+? \-\-\>$

It captures the include in the first group and file="/var/www/localhost/index.html" set="one" in the second group, but what I am after is this:

group 1: "include"
group 2: "file"
group 3: "/var/www/localhost/index.html"
group 4 (optional): "set"
group 5 (optional): "one"

(continue for every other name="value" pair)


I am using this site to develop my regex

Bhargav Rao
  • 41,091
  • 27
  • 112
  • 129
NuclearPeon
  • 4,789
  • 3
  • 40
  • 47
  • capture all the tags at once `((?:[a-z]=".*")+?) -->$` then parse it afterwards. Also your regex is needlessly escaped! – Adam Smith Jul 02 '14 at 20:29
  • @AdamSmith: That does not work for me. I get two groups when applying that regex: `group 0 : e="/tmp/index.html" set="one" -->`, `group 1: e="/tmp/index.html" set="one"` – NuclearPeon Jul 02 '14 at 20:32
  • 1
    why not use different patterns for each? It will make it much simpler. – Padraic Cunningham Jul 02 '14 at 21:39
  • @PadraicCunningham I thought about doing that, but I was hoping it could be done without. I didn't realize how much effort it would take for something that appeared trivial. – NuclearPeon Jul 02 '14 at 21:42
  • The patterns are quite simple individually and if you wanted to create a dict from the key,value pairs it would be very easily accomplished. – Padraic Cunningham Jul 02 '14 at 21:46

5 Answers5

3

Grab everything that can be repeated, then parse them individually. This is probably a good use case for named groups, as well!

import re

data = """<!--#include file="/var/www/localhost/index.html" set="one" reset="two" -->"""
pat = r'''^<!--#([a-z]+) ([a-z]+)="(.*?)" ((?:[a-z]+?=".+")+?) -->'''

result = re.match(pat, data)
result.groups()
('include', 'file', '/var/www/localhost/index.html', 'set="one" reset="two"')

Then iterate through it:

g1, g2, g3, g4 = result.groups()
for keyvalue in g4.split(): # split on whitespace
    key, value = keyvalue.split('=')
    # do something with them
Adam Smith
  • 45,072
  • 8
  • 62
  • 94
  • `kv = lambda x: x.split('=')` and `{key: val for key, val in [kv(x) for x in m.group(4).split()] }` gives me everything I need in a dictionary. Thanks! – NuclearPeon Jul 02 '14 at 21:30
  • 1
    @NuclearPeon skip the conflating lambda! just do `dict([x.split("=") for x in m.group(4).split()])` – Adam Smith Jul 02 '14 at 21:46
  • 1
    Thanks, I attempted to do it that way, but got a bunch of errors so I resigned to the lambda. This clears it right up! *Edit: I should have known better* – NuclearPeon Jul 02 '14 at 21:47
2

A way with the new python regex module:

#!/usr/bin/python

import regex

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    (?>
        \G(?<!^)
      |
        <!-- \# (?<function> [a-z]+ )
    )
    \s+
    (?<key> [a-z]+ ) \s* = \s* " (?<val> [^"]* ) "
'''

matches = regex.finditer(p, s)

for m in matches:
    if m.group("function"):
        print ("function: " + m.group("function"))
    print (" key:   " + m.group("key") + "\n value: " + m.group("val") + "\n")

The way with re module:

#!/usr/bin/python

import re

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    <!-- \# (?P<function> [a-z]+ )
    \s+
    (?P<params> (?: [a-z]+ \s* = \s* " [^"]* " \s*? )+ )
    -->
'''

matches = re.finditer(p, s)

for m in matches:
    print ("function: " + m.group("function"))
    for param in re.finditer(r'[a-z]+|"([^"]*)"', m.group("params")):
        if param.group(1):
            print (" value: " + param.group(1) + "\n")
        else:
            print (" key:   " + param.group())
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
1

I recommend against using a single regular expression to capture every item in a repeating group. Instead--and unfortunately, I don't know Python, so I'm answering it in the language I understand, which is Java--I recommend first extracting all attributes, and then looping through each item, like this:

   import  java.util.regex.Pattern;
   import  java.util.regex.Matcher;
public class AllAttributesInTagWithRegexLoop  {
   public static final void main(String[] ignored)  {
      String input = "<!--#include file=\"/var/www/localhost/index.html\" set=\"one\" -->";

      Matcher m = Pattern.compile(
         "<!--#(include|echo|set) +(.*)-->").matcher(input);

      m.matches();

      String tagFunc = m.group(1);
      String allAttrs = m.group(2);

      System.out.println("Tag function: " + tagFunc);
      System.out.println("All attributes: " + allAttrs);

      m = Pattern.compile("(\\w+)=\"([^\"]+)\"").matcher(allAttrs);
      while(m.find())  {
         System.out.println("name=\"" + m.group(1) + 
            "\", value=\"" + m.group(2) + "\"");
      }
   }
}

Output:

Tag function: include
All attributes: file="/var/www/localhost/index.html" set="one"
name="file", value="/var/www/localhost/index.html"
name="set", value="one"

Here's an answer that may be of interest: https://stackoverflow.com/a/23062553/2736496


Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.

Community
  • 1
  • 1
aliteralmind
  • 18,274
  • 16
  • 66
  • 102
  • 1
    +1 for an answer in Java about a Python regex tested on PHP regex tester. – Casimir et Hippolyte Jul 02 '14 at 20:39
  • @CasimiretHippolyte: If you are referring to the webpage, regex101.com, then it does have the option to test regex in python which I have selected. – NuclearPeon Jul 02 '14 at 20:42
  • @aliteralmind: While I cannot use Java, I sincerely appreciate the effort in answering. I realize this question may be considered spam, seeing as there are many questions that ask variations of this. I've been reading various articles on regex, including the python regular expression docs (which I've read more than once). It's hard to wrap my head around. Thank you. – NuclearPeon Jul 02 '14 at 20:47
  • 1
    @AdamSmith Regarding jwz’s quip, it is true only insofar as a little knowledge being always a dangerous thing: ***“Perilous to us all are the devices of an art deeper than we possess ourselves.”*** – tchrist Jul 02 '14 at 20:56
  • 1
    @NuclearPeon: Glad to help. I just wanted to express the idea of iterating through the groups, as opposed to trying to do it in one big mega regex. Wrong language, but same concepts. – aliteralmind Jul 02 '14 at 21:13
  • I'm getting really good insight into what regex *should* be used for from this question. Fancy dynamic programming, not so much... – NuclearPeon Jul 02 '14 at 21:15
0

Unfortunately python does not allow for recursive regular expressions.
You can instead do this:

import re
string = '''<!--#include file="/var/www/localhost/index.html" set="one" set2="two" -->'''
regexString = '''<!--\#(?P<tag>\w+)\s(?P<name>\w+)="(?P<value>.*?")\s(?P<keyVal>.*)\s-->'''
regex = re.compile(regexString)
match = regex.match(string)
tag = match.group('tag')
name = match.group('name')
value = match.group('value')
keyVal = match.group('keyVal').split()
for item in keyVal:
    key, val in item.split('=')
    # You can now do whatever you want with the key=val pair
Chrispresso
  • 3,037
  • 11
  • 24
0

The regex library allows capturing repeated groups (while builtin re does not). This allows for a simple solution without needing external for-loops to parse the groups afterwards.

import regex

string = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
rgx = regex.compile(
    r'<!--#(?<fun>[a-z]+)(\s+(?<key>[a-z]+)\s*=\s*"(?<val>[^"]*)")+')

match = rgx.match(string)
keys, values = match.captures('key', 'val')
print(match['fun'], *map(' = '.join, zip(keys, values)), sep='\n  ')

gives you what you're after

include
  file = /var/www/localhost/index.html
  set = one
codeMonkey
  • 432
  • 7
  • 16