-1

I am trying to change an HTML page that has inline styles, I want to make a regex expression that captures the background url and the selector, example:

<div>some html here</div>
<style>#some-selector {
  padding-top: 408px;
}
#some-selector .bg {
  background-image: url(www.some-url.com/some-image.jpg);
}
#some-selector {
  background-position: 43% 97%;
}

what I want to capture here is #some-selector .bg and www.some-url.com/some-image.jpg, keep in mind that the HTML page is big, and the expression should be fast

I came up with this expr <style[\s\S]*?[>}\/\n](.*){[\s\S]*?background.*?url\((.*?)\) but it's not working correctly, I know that I the first [\s\S] should be greedy but when i remove the ? it leads to catastrophic backtracking <style[\s\S]*[>}\/\n](.*){[\s\S]*?background.*?url\((.*?)\) it does work on small strings but on the whole page it causes catastrophic backtracking, i've used regex101 to test it.

Any help is appreciated

Edit: here's an example https://regex101.com/r/ZMxOSz/1

  • 1
    Which tool or language you're using? It is not a good idea to parse a CSS with regex; therefore; you should use a CSS parser. –  Jul 08 '20 at 19:08
  • I am using PHP, I think extracting CSS and parsing it will take much more time than just using regex, and in this case, every ms matters – Lamari Taha Jul 09 '20 at 17:46
  • Nooo...parser is always the best choice since regexes may backfire if the css file is dynamic.Read [this](https://stackoverflow.com/a/4234491/7571182) as to why it is a bad idea. –  Jul 09 '20 at 18:20

1 Answers1

0

update
After a closer look, I offer 2 soulutions that mitigate backtracking issue's to a relative degree.
Before looking at them, I want to point out that there are only a very few delimiters associated with CSS syntax.
Moreover, it's more related to the order and content of allowed characters that define CSS syntax.

The cure to backtracking is to restrict the regex engine to fewer allowable
characters to match and withing strategic position.
If you look at the CSS specification here -> https://www.w3.org/TR/CSS21/syndata.html
you'll notice that it is entirely defined by regular expressions.
That indicates CSS parsers are entirely constructed with chopped version of regex.

However, while it would be an interesting exercise to put it into a
all encompasing regex, I will decline that challenge, because there is
nothing in it for me.

Instead, I offer these 2 regex tailored to your request.

Fisrt one:

  • Matches only the first url() block within the <style> element

<style[^>]*?>(?:[^{}:]*{[^{}]*?:[^{}()]*?})*?(?:([^{}:]*){[^{}]*?:\s*url\s*\(\s*([^{}()]*?)\s*\)\s*})

see -> https://regex101.com/r/2SNIks/1


Second one:

  • Matches all the url() blocks with the <style> element

(?:<style[^>]*?>|(?!^)\G)(?:(?:(?!</style)[^{}:])*{[^{}]*?:[^{}()]*?})*?(?:([^{}:]*){[^{}]*?:\s*url\s*\(\s*([^{}()]*?)\s*\)\s*})

see -> https://regex101.com/r/d8q6LH/1


For both regex,

  • The selector is in group 1
  • The url is in group 2
  • 1
    Thanks for your time, But the your regex expression does not work on a real world example, as I said the hml page is big, I added – Lamari Taha Jul 09 '20 at 17:33
  • Yes, I see the problem. Added some alternatives that hopefully work better for you. –  Jul 09 '20 at 21:48
  • Thanks the second one is working, really wanna upvote your answer but don't have enough reputation, I also got downvoted for my question :') – Lamari Taha Jul 10 '20 at 16:06
  • I upvoted you. By 'second one' do you mean on this page ? –  Jul 10 '20 at 19:21
  • I mean the second expr in your answer – Lamari Taha Jul 13 '20 at 15:22
  • Oh, I upvoted your question, you should have enough reputation to both Accept this answer and upvote it as well. That is if it helped you out. –  Jul 13 '20 at 19:48
  • I accepted the question, but need 15 rep to upvote, only have 11 :( – Lamari Taha Jul 14 '20 at 18:15