Download and save PDF file with Python requests module

Question

I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.

In [1]: import requests

In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'

In [3]: response = requests.get(url)

In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
   ...:     f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
      1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2     f.write(response.text)
      3 

UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)

In [5]: import codecs

In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
   ...:     f.write(response.text)
   ...:

I know it is a codec problem of some kind but I can't seem to get it to work.

score 184 · Accepted Answer · edited May 23 '17 at 12:18

You should use response.content in this case:

with open('/tmp/metadata.pdf', 'wb') as f:
    f.write(response.content)

From the document:

You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.

And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.

You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:

import requests

url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)

with open('/tmp/metadata.pdf', 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.

So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.

Relates:

Cool, thank you for the additional information about response.raw. — Jim, Mar 01 '19 at 03:31

user6481870 · Answer 2 · 2020-08-04T03:06:40.487

27

In Python 3, I find pathlib is the easiest way to do this. Request's response.content marries up nicely with pathlib's write_bytes.

from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)

edited Aug 04 '20 at 03:06

answered Nov 08 '18 at 08:39

user6481870

681
8
8

1

Thank you for posting this. The original question was Python 2.7 but I've moved on and now use Python 3. I didn't know about about the pathlib library [new in version 3.4] and will incorporate it into my current projects. – Jim Nov 09 '18 at 17:50
It give `544` and the file is broken, any ideas? – ah bon Apr 29 '20 at 08:34
@ahbon, what do you mean? – user6481870 May 01 '20 at 02:02

score 17 · Answer 3 · answered Oct 29 '19 at 19:56

17

You can use urllib:

import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")

answered Oct 29 '19 at 19:56

jugi

391
2
12

1

This is best one, tbh. – Dhaval Savalia Dec 02 '19 at 16:39
This one is best – roktim Jul 21 '20 at 05:43
2

`urlretrieve` relies on global settings to determine request headers, making it unsuitable for some use cases. – Michael Crenshaw Oct 21 '20 at 15:49

score 5 · Answer 4 · answered Jun 21 '20 at 11:42

Generally, this should work in Python3:

import urllib.request 
..
urllib.request.get(url)

Remember that urllib and urllib2 don't work properly after Python2.

If in some mysterious cases requests don't work (happened with me), you can also try using

wget.download(url)

https://medium.com/@dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Duck Ling · Answer 5 · 2019-04-01T18:13:50.587

2

Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.

My solution:

Change the downloadPath accordingly to where you want your file to be saved. Feel free to use the absolute path too for your usage.

Save the below as downloadFile.py.

Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension

Remember to add an extension!

Example usage: python downloadFile.py http://www.google.co.uk google.html

import requests
import sys
import os

def downloadFile(url, fileName):
    with open(fileName, "wb") as file:
        response = requests.get(url)
        file.write(response.content)


scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]      
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')

edited Apr 01 '19 at 18:13

answered Mar 31 '19 at 07:52

Duck Ling

608
6
12

Pawel, thank you for your answer. I was a Python novice when I first posted this question. Now I know the language very well. Your use case of writing a Python script to download a file from a command line can be covered by utilities like wget or curl. Also, your function downloadFile as posted seems to call itself. Did you intend to indent the second block of code? In stackoverflow you can correct that by out-denting that. I'd also like to suggest you have a look at Python's argparse library. You can use it to make nice command line utilities. It will take care of the parameters for you. – Jim Apr 01 '19 at 16:32
I do like your use of a context manager (with open... as file:, etc) to handle the file writing. Your code is neatly written. You are on a good path to learning Python. Good luck! – Jim Apr 01 '19 at 16:38
1

Thanks for the reply, @Jim! I've edited the post, and indeed I did not "intend to indent" :D the main part of the program. Thanks for your advices! :) – Duck Ling Apr 01 '19 at 18:15

score -4 · Answer 6 · edited May 25 '17 at 12:30

-4

regarding Kevin answer to write in a folder tmp, it should be like this:

with open('./tmp/metadata.pdf', 'wb') as f:
    f.write(response.content)

he forgot . before the address and of-course your folder tmp should have been created already

edited May 25 '17 at 12:30

shiva

3,483
5
16
37

answered Apr 01 '17 at 23:52

Nima Sajedi

61
6

5

1- Kevin did not come up with the idea to write in `tmp`, it was like like in OP's question. 2- the `/tmp` directory is the tmp in Unix systems, located at `/tmp`, no `.` – realUser404 Jul 26 '17 at 22:44

Download and save PDF file with Python requests module

6 Answers6

Linked

Related