0

I have a regular expression to fetch some links in HTML document.

((http://)(|up)(\.example\.com))*(/uploads/pp2p|/sites/default/files/[-_a-zA-Z0-9%/]+)\.(jpg|jpeg|gif|png)

What I am intending to match is, if the http part exists match it if not, don't. If up part exists match it if not, don't. If example.com exists match it if not, don't. The same about /uploads/pp2p and the other one, if exists match if not, don't. Finally, if it has one of the following image formats match it if not, don't. I expect to get a list of links like

links = ['http://up.example.com/uploads/pp2p/www.jpg', '/sites/default/files/.png', 'http://example.com/uploads/zzz.jpg']

And the elements in the link continue to be filled with different combinations. Anyway, I am getting results as a tuple like

[('', '', '', '', '/sites/default/files/favicon', 'png'), ('', '', '', '', '/sites/default/files/logo_2', 'png')]

I don't want to get a tuple, I want the match to be represented as a whole. Only a complete link in each list element. How can I avoid getting a tuple as a result of the Regex match?

FreeMind
  • 165
  • 2
  • 14

1 Answers1

1

I am assuming you are getting images off a web page somewhere.

Here's a quick way to grab all the image src links using lxml.html:

from lxml.html import parse
import re

doc = parse('http://www.androidpolice.com').getroot()
links = []
img_list = []

for img in doc.cssselect('img'):
    links.append(img.get('src'))

for link in links:
    match = re.search(".*androidpolice\.com.*",link)
    if match:
        img_list.append(match.group(0))

for img in img_list:
    print(img)

Gives the output:

http://www.androidpolice.com/wp-content/themes/ap2/ap_resize/ap_resize.php?src=http%3A%2F%2Fwww.androidpolice.com%2Fwp-content%2Fuploads%2F2015%2F05%2Fnexus2cee_gamethumb_thumb1.png&h=128&zc=3
http://www.androidpolice.com/wp-content/themes/ap2/ap_resize/ap_resize.php?src=http%3A%2F%2Fwww.androidpolice.com%2Fwp-content%2Fuploads%2F2015%2F05%2Fnexus2cee_gamethumb_thumb1.png&w=150&h=75&f=8|8|8|8|8|8|8|8|8|8|8|8|8
http://www.androidpolice.com/wp-content/themes/ap2/ap_resize/ap_resize.php?src=http%3A%2F%2Fwww.androidpolice.com%2Fwp-content%2Fuploads%2F2014%2F06%2Fnexusae0_Google-Photos-icon-logo-150x150.png&h=128&zc=3

-----[output truncated]-----

Then you could do (?:%2F)([\w-]+\.(?:png|jpg)) to get all the image names (of course just an example). I.E. nexus2cee_gamethumb_thumb1.png

Updated Code

Changed it to search only for androidpolice.com in each link. You can find more on using re module at 6.2. re — Regular expression operations.

flamusdiu
  • 1,540
  • 2
  • 12
  • 25