5

My code only returns an empty string, and I have no idea why.

import urllib2

def getImage(url):
    page = urllib2.urlopen(url)
    page = page.read() #Gives HTML to parse

    start = page.find('<a img=')
    end = page.find('>', start)

    img = page[start:end]

return img

It would only return the first image it finds, so it's not a very good image scraper; that said, my primary goal right now is just to be able to find an image. I'm unable to.

5 Answers5

2

You should use a library for this and there are several out there, but to answer your question by changing the code you showed us...

Your problem is that you are trying to find images, but images don't use the <a ...> tag. They use the <img ...> tag. Here is an example:

<img src="smiley.gif" alt="Smiley face" height="42" width="42">

What you should do is change your start = page.find('<a img=') line to start = page.find('<img ') like so:

def getImage(url):
    page = urllib2.urlopen(url)
    page = page.read() #Gives HTML to parse

    start = page.find('<img ')
    end = page.find('>', start)

    img = page[start:end+1]
    return img
bohney
  • 1,017
  • 8
  • 12
2

Consider using BeautifulSoup to parse your HTML:

from BeautifulSoup import BeautifulSoup
import urllib
url  = 'http://www.google.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for img in soup.findAll('img'):
     print img['src']
tehmisvh
  • 586
  • 3
  • 9
0

Article on screen scraping with ruby: http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/ Its not scraping images but its a good article and may help.

Jake Sellers
  • 2,170
  • 2
  • 19
  • 37
0

Extracting the image information this way is not a good idea. There are severaly better options, depending on your knowledge and your motivation to learn something new:

Community
  • 1
  • 1
Achim
  • 14,333
  • 13
  • 70
  • 128
  • knowing how to use regex is useful skill, but it's not a "better option" for web scraping in any way whatsoever. – root Oct 17 '12 at 15:40
0

Some instructions that might be of help:

  1. Use Google Chrome. Set the mouse over the image and right click. Select "Inspect element". That will open a section where you'll be able to see the html near the image.

  2. Use Beautiful Soup to parse the html:

    from BeautifulSoup import BeautifulSoup
    
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    html = response.read()
    soap = BeautifulSoap(html)
    imgs = soup.findAll("img")
    items = []
    for img in imgs:
        print img['src'] #print the image location
        items.append(img['src']) #store the locations for downloading later
    
glglgl
  • 81,640
  • 11
  • 130
  • 202
martincho
  • 3,829
  • 5
  • 29
  • 37