Same function acting differently same type of data

Question

I've probably spent too long on this already but I'm finding it hard to understand why I'm getting a FileNotFoundError: [Errno 2] No such file or directory: when the only difference I can see is the link. Using Beautiful Soup

Objective: Download an image and place in a different folder which works fine except on some .jpg files. I've tried different types of paths and striping the file names but its the same problem.

Test images:

http://img2.rtve.es/v/5437650?w=1600&preview=1573157283042.jpg # Not Working

http://img2.rtve.es/v/5437764?w=1600&preview=1573172584190.jpg #Works perfect

Here is the function:

def get_thumbnail():
'''
Download image and place in the images folder
'''
soup = BeautifulSoup(r.text, 'html.parser')

# Get thumbnail  image
for  preview in soup.findAll(itemprop="image"): 
  preview_thumb = preview['src'].split('//')[1]    

# Download image 
url = 'http://' + str(preview_thumb).strip()
path_root = Path(__file__).resolve().parents[1] 
img_dir = str(path_root) + '\\static\\images\\'


urllib.request.urlretrieve(url, img_dir + show_id() + '_' + get_title().strip()+ '.jpg')

Other functions used:

def show_id():
  for  image_id in soup.findAll(itemprop="image"): 
    preview_id = image_id['src'].split('/v/')[1]
    preview_id = preview_id.split('?')[0]
  return preview_id

def get_title():
  title = soup.find('title').get_text()
  return title

All I can work out is the problem must be finding the images folder for the first image but the second works perfect.

This is the error I keep getting and it seems to be breaking at request.py

Thanks for any input.

alecxe · Accepted Answer · 2019-11-09T18:14:46.367

It's quite likely the "special characters" in the image filename are throwing urlretrieve() (and open() used inside it) off:

>>> from urllib import urlretrieve  # Python 3: from urllib.request import urlretrieve
>>> url = "https://i.stack.imgur.com/1RUYX.png"

>>> urlretrieve(url, "test.png")  # works
('test.png', <httplib.HTTPMessage instance at 0x10b284a28>)

>>> urlretrieve(url, "/tmp/test 07/11/2019.png") # does not work
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 98, in urlretrieve
    return opener.retrieve(url, filename, reporthook, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 249, in retrieve
    tfp = open(filename, 'wb')
IOError: [Errno 2] No such file or directory: '/tmp/test 07/11/2019.png'

In other words, the image titles you use as filenames must be properly pre-formatted before using as filenames for saving. I'd just "slugify" them to avoid having problems with it at all. One way to do it would be to simply use slugify module:

import os
from slugify import slugify

image_filename = slugify(show_id() + '_' + get_title().strip()) + '.jpg'
image_path = os.path.join(img_dir, image_filename)
urllib.request.urlretrieve(url, image_path)

For instance, that is what would slugify do to test 07/11/2019 image name:

>>> slugify("test 07/11/2019")
'test-07-11-2019'

Same function acting differently same type of data

1 Answers1