Beautifulsoup stuck at page

Question

i am trying to scrape some content from pages but Beautifulsoup stuck at some pages where there is no source code , for example this one .

import requests
from bs4 import BeautifulSoup

def make_soup(url):
    try:
        html = requests.get(url).content
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = "https://cdn.podigee.com/uploads/u735/1d4d4b22-528e-4447-823e-b3ca5e25bccb.mp3?v=1578558565&source=webplayer"
soup = make_soup(url)

print(soup.select_one("a.next").get('href'))

This works pretty well. What happens is, if a file like .mp4 or .m4a gets in the crawler instead of an HTML page, then the script hangs :(

What exactly do you want to scrap? There's barely any HTML on the page. — Edeki Okoh, Jan 28 '20 at 16:29
yes that's an issue , i want to apply some sort of check if page contains HTML then perform scraping otherwise return and take next URL as argument without wasting any time ... — Joy, Jan 28 '20 at 16:30

Cohan · Answer 1 · 2020-01-28T17:05:45.357

0

I assume by your comment that you have a list of urls that you're looking to parse. In that case, you can loop through them, and when make_soup() returns None you can jump to the next iteration with the continue keyword.

def make_soup(url):
    try:
        html = requests.get(url).content
    except:
        return None
    return BeautifulSoup(html, "lxml")

urls = [
    "https://cdn.podigee.com/uploads/u735/1d4d4b22-528e-4447-823e-b3ca5e25bccb.mp3?v=1578558565&source=webplayer",
]

for url in urls:
    soup = make_soup(url)
    if soup is None:
        continue

    print(soup.select_one("a.next").get('href'))

For cases where the url takes too long, you can specify a timeout function. If you're on Windows, you can look here

edited Jan 28 '20 at 17:05

answered Jan 28 '20 at 16:54

Cohan

3,844
2
17
36

i tried but it is also not working for me , the problem is make_soup never returns for media URLs , like the one i have added above – Joy Jan 28 '20 at 16:57
it take some time to return * – Joy Jan 28 '20 at 16:58

score 0 · Answer 2 · answered Jan 28 '20 at 16:59

def is_downloadable(url):
    """
    Does the url contain a downloadable resource
    """
    h = requests.head(url, allow_redirects=True)
    header = h.headers
    content_type = header.get('content-type')
    if 'text' in content_type.lower():
        return False
    if 'html' in content_type.lower():
        return False
    return True
url = "https://cdn.podigee.com/uploads/u735/1d4d4b22-528e-4447-823e-b3ca5e25bccb.mp3?v=1578558565&source=webplayer"

print(is_downloadable(url))

Beautifulsoup stuck at page

2 Answers2