-1

I'm having a hard time with extracting the url in a text using python

I got the text from style attribute of a tag with beautiful soup, the text is always:

background:url(//somedomaine.com/annonces/103028/large.jpg) no-repeat center center

My goal is to extract "//somedomaine.com/annonces/103028/large.jpg" but I'm new with regex, I tried to use the "$" modifier with "url" but it didn't help.

Souames
  • 841
  • 2
  • 9
  • 16

3 Answers3

3

background:url\(([^\)]+)\)

This regex will look for the text background:url(, and thencapture everything up until the first ) it encounters.

Demo

Nick Reed
  • 5,029
  • 4
  • 14
  • 34
  • 1
    Couldn't you just do a non-greedy match there and the character group wouldn't be necessary? `\((.+?)\)`? – Green Cloak Guy Oct 13 '19 at 18:34
  • 1
    You absolutely could, and it would satisfy OP's requirements, too (hence my upvote on your answer). My answer was just force of habit - whenever HTML/CSS is involved, I often use negated character classes, since they'll match across lines. It improves flexibility, since HTML/CSS lets the tag opening and closing be on different lines. – Nick Reed Oct 13 '19 at 18:37
2

Here's an incredibly generic match:

text = "background:url(//somedomaine.com/annonces/103028/large.jpg) no-repeat center center"
regstr = r"background:url\((.*)\) no-repeat center center"

import re
x = re.match(regstr, text)
print(x.group(1))  # '//somedomaine.com/annonces/103028/large.jpg'

The regex here is very straightforward - match the largest possible set of arbitrary characters (.*) surrounded by the given text ("background:url(" in the front, ") no-repeat center center" in the back).

Green Cloak Guy
  • 18,876
  • 3
  • 21
  • 38
1

If you want a non-regex solution and just search for substring,

url = text[text.find('url(') + 4: text.find(')')]

Not robust for urls containing )|url(

modesitt
  • 6,434
  • 2
  • 30
  • 61