0

I'm trying to make a regular expression that finds the tagnames and attributes of elements. For example, if I have this:

<div id="anId" class="aClass">

I want to be able to get an array that looks like this:

["(full match)", "div", "id", "anId", "class", "aClass"]

Currently, I have the regex /<(\S*?)(?: ?(.*?)="(.*?)")*>/, but for some reason it skips over every attribute except for the last one.

var str = '<div id="anId" class="aClass">'
console.log(str.match(/<(\S*)(?: ?(.*?)="(.*?)")*>/));

Regex101: https://regex101.com/r/G0ncwF/2

Another odd thing: if I remove the * after the non-capture group, the capture group in quotes seems to somehow "forget" that it's lazy. (Regex101: https://regex101.com/r/C0UwI8/2)

Why does this happen, and how can I avoid it? I couldn't find any questions/answers that helped me (Python re.finditer match.groups() does not contain all groups from match looked promising, but didn't seem help me at all)

(note: I know there are better ways to get the attributes, I'm just experimenting with regex)

UPDATE:

I've figured out at least why the quantifiers seem to "forget" that they're lazy. It's actually just that the regex is trying to match all the way to the angle brackets. I suppose I must have been thinking that the non-capturing group was "insulating" everything and preventing that from happening, and I didn't see it was still lazy because there was only one angle bracket for it to find.

var str = '"foo" "bar"> "baz>"'
console.log("/\".*?\"/ produces ", str.match(/".*?"/), ", finds first quote, finds text, lazily stops at second quote");
console.log("/\".*?\">/ produces ", str.match(/".*?">/), ", finds first quote, finds text, sees second quote but doesn't see angle bracket, keeps going until it sees \">, lazily stops");

So at least that's solved. But I still don't understand why it skips over every attribute but the last one.

And note: Other regexes using different tricks to find the attributes are nice and all, but I'm mostly looking to learn why my regex skips over the attributes, so I can maybe understand regex a bit better.

Lemondoge
  • 885
  • 4
  • 13

1 Answers1

1

Playing along with your experimentation you could do this: Instead of scanning for what you want, you can scan for what you don't want, and then filter it out:

const html = '<div id="anId" class="aClass">';
const regex = /[<> ="]/;
let result = html.split(regex).filter(Boolean);
console.log('result: '+JSON.stringify(result));

Output:

result: ["div","id","anId","class","aClass"]

Explanation:

  • regex /[<> ="]/ lists all chars you don't want
  • .split(regex) splits your text along the unwanted chars
  • .filter(Boolean) gets rid of the unwanted chars

Mind you this has flaws, for example it will split incorrectly for html <div id="anId" class="aClass anotherClass">, e.g a space in an attribute value. To support that you could preprocess the html with another regex to escape spaces in quotes, then postprocess with another regex to restore the spaces...

Yes, an HTML parser is more reliable for these kind of tasks.

Peter Thoeny
  • 3,223
  • 1
  • 6
  • 13
  • This does work, but I'm mostly looking for a reason why my original regex didn't work (and maybe why it "forgets" that it's lazy) – Lemondoge Dec 04 '20 at 22:13
  • Ah, because you defined 3 capture groups and 1 non-capture group in your original regex, the match result array has length 4. If you want to get a longer array you need to add more capture groups, as in this one `//` returns `[ "
    ", "div", "id", "anId", "class", "aClass" ]`
    – Peter Thoeny Jan 31 '21 at 01:17