After reading a lot of posts about web scraping and how to follow URL redirections with Python, I am finally compelled to ask for your help!
Here is an example of a website I am trying to scrape: http://xmaths.free.fr/1ES/cours/index.php .
My objective is to automatically download exercises and their correction in PDF. I have succeeded in saving exercises, but I am facing a problem while attempting to download the corrections PDF files.
For example, to reach a correction file, the website provides this link http://xmaths.free.fr/1ES/cours/corrige.php?nomexo=1ESpctgex01 . When you click on it, this opens a page telling you that you are going to access the correction. Then, after a few seconds, the file automatically opens with the url http://xmaths.free.fr/corrections/rMu623S1NA.pdf.
I first thought about a redirection. I have used the requests.history attribute (see this post) but the code returns that there is no redirection followed.
Here is the code I have written to try downloading the correction files:
from bs4 import BeautifulSoup
import requests
correction_urls = ['http://xmaths.free.fr/1ES/cours/indications.php?nomexo=1ESderiex01', 'http://xmaths.free.fr/1ES/cours/indications.php?nomexo=1ESderiex02', 'http://xmaths.free.fr/1ES/cours/indications.php?nomexo=1ESderian02']
# Accessing each webpage stored in correction_urls list
for i, correction_url in enumerate(correction_urls):
r = requests.get(correction_url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
# Iterate over each link on the page
for link in soup.find_all("a"):
href = link.get("href")
# Identify links to corrections
if str(href)[0:12] == "corrige.php?":
# Build the full url and access it
correction_pdf = "http://xmaths.free.fr/1ES/cours/" + href
r = requests.get(correction_pdf)
# Rename and save the pdf file
with open("math_correction{}.pdf".format(i+1), "wb") as f:
f.write(r.content)
In this way, I do not manage to reach the final link of the PDF, but only the link of the page before the file opens.
Thanks in advance for your help!