Image scraping program in Python not functioning as intended

Question

My code only returns an empty string, and I have no idea why.

import urllib2

def getImage(url):
    page = urllib2.urlopen(url)
    page = page.read() #Gives HTML to parse

    start = page.find('<a img=')
    end = page.find('>', start)

    img = page[start:end]

return img

It would only return the first image it finds, so it's not a very good image scraper; that said, my primary goal right now is just to be able to find an image. I'm unable to.

bohney · Answer 1 · 2012-10-17T15:06:45.717

You should use a library for this and there are several out there, but to answer your question by changing the code you showed us...

Your problem is that you are trying to find images, but images don't use the <a ...> tag. They use the <img ...> tag. Here is an example:

<img src="smiley.gif" alt="Smiley face" height="42" width="42">

What you should do is change your start = page.find('<a img=') line to start = page.find('<img ') like so:

def getImage(url):
    page = urllib2.urlopen(url)
    page = page.read() #Gives HTML to parse

    start = page.find('<img ')
    end = page.find('>', start)

    img = page[start:end+1]
    return img

I just tried my suggested `getImage` function on http://yahoo.com and got this: `` — bohney, Oct 17 '12 at 15:03

score 2 · Answer 2 · answered Oct 17 '12 at 15:04

2

Consider using BeautifulSoup to parse your HTML:

from BeautifulSoup import BeautifulSoup
import urllib
url  = 'http://www.google.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for img in soup.findAll('img'):
     print img['src']

answered Oct 17 '12 at 15:04

tehmisvh

586
3
9

score 0 · Answer 3 · answered Oct 17 '12 at 14:57

0

Article on screen scraping with ruby: http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/ Its not scraping images but its a good article and may help.

answered Oct 17 '12 at 14:57

Jake Sellers

2,170
2
19
37

score 0 · Answer 4 · edited May 23 '17 at 11:56

0

Extracting the image information this way is not a good idea. There are severaly better options, depending on your knowledge and your motivation to learn something new:

http://scrapy.org/ is a very good framework for extracting data from web pages. As it looks like you're a beginner, it might a bit overkill.
Learn regular expressions to extract the information: http://docs.python.org/library/re.html and Learning Regular Expressions
Use http://www.crummy.com/software/BeautifulSoup/ to parse data from the result of page.read().

edited May 23 '17 at 11:56

Community

1
1

answered Oct 17 '12 at 14:59

Achim

14,333
13
70
128

knowing how to use regex is useful skill, but it's not a "better option" for web scraping in any way whatsoever. – root Oct 17 '12 at 15:40

score 0 · Answer 5 · edited Oct 17 '12 at 15:10

Some instructions that might be of help:

Use Google Chrome. Set the mouse over the image and right click. Select "Inspect element". That will open a section where you'll be able to see the html near the image.

Use Beautiful Soup to parse the html:

from BeautifulSoup import BeautifulSoup

request = urllib2.Request(url)
response = urllib2.urlopen(request)
html = response.read()
soap = BeautifulSoap(html)
imgs = soup.findAll("img")
items = []
for img in imgs:
    print img['src'] #print the image location
    items.append(img['src']) #store the locations for downloading later

Image scraping program in Python not functioning as intended

5 Answers5

Linked