1

After reading a lot of posts about web scraping and how to follow URL redirections with Python, I am finally compelled to ask for your help!

Here is an example of a website I am trying to scrape: http://xmaths.free.fr/1ES/cours/index.php .

My objective is to automatically download exercises and their correction in PDF. I have succeeded in saving exercises, but I am facing a problem while attempting to download the corrections PDF files.

For example, to reach a correction file, the website provides this link http://xmaths.free.fr/1ES/cours/corrige.php?nomexo=1ESpctgex01 . When you click on it, this opens a page telling you that you are going to access the correction. Then, after a few seconds, the file automatically opens with the url http://xmaths.free.fr/corrections/rMu623S1NA.pdf.

I first thought about a redirection. I have used the requests.history attribute (see this post) but the code returns that there is no redirection followed.

Here is the code I have written to try downloading the correction files:

from bs4 import BeautifulSoup
import requests

correction_urls = ['http://xmaths.free.fr/1ES/cours/indications.php?nomexo=1ESderiex01', 'http://xmaths.free.fr/1ES/cours/indications.php?nomexo=1ESderiex02', 'http://xmaths.free.fr/1ES/cours/indications.php?nomexo=1ESderian02']

# Accessing each webpage stored in correction_urls list
for i, correction_url in enumerate(correction_urls):
    r = requests.get(correction_url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc)
    
    # Iterate over each link on the page
    for link in soup.find_all("a"):
        href = link.get("href")
        
        # Identify links to corrections
        if str(href)[0:12] == "corrige.php?":
            
            # Build the full url and access it
            correction_pdf = "http://xmaths.free.fr/1ES/cours/" + href
            r = requests.get(correction_pdf)
            
            # Rename and save the pdf file
            with open("math_correction{}.pdf".format(i+1), "wb") as f:
                f.write(r.content)

In this way, I do not manage to reach the final link of the PDF, but only the link of the page before the file opens.

Thanks in advance for your help!

Jérémy
  • 11
  • 1

1 Answers1

0

You can extract the correct path from <meta> tag inside head:

<META HTTP-EQUIV="Refresh" CONTENT="1 ; url=../../corrections/rMu623S1NA.pdf">

import requests
from bs4 import BeautifulSoup


url = 'http://xmaths.free.fr/1ES/cours/corrige.php?nomexo=1ESpctgex01'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
r = requests.get('http://xmaths.free.fr/1ES/cours/' + soup.meta['content'].split(';')[-1].split('=')[-1])

with open('document.pdf', 'wb') as f_out:
    f_out.write(r.content)
Andrej Kesely
  • 81,807
  • 10
  • 31
  • 56