0

i am trying to scrape some content from pages but Beautifulsoup stuck at some pages where there is no source code , for example this one .

import requests
from bs4 import BeautifulSoup

def make_soup(url):
    try:
        html = requests.get(url).content
    except:
        return None
    return BeautifulSoup(html, "lxml")

url = "https://cdn.podigee.com/uploads/u735/1d4d4b22-528e-4447-823e-b3ca5e25bccb.mp3?v=1578558565&source=webplayer"
soup = make_soup(url)

print(soup.select_one("a.next").get('href'))

This works pretty well. What happens is, if a file like .mp4 or .m4a gets in the crawler instead of an HTML page, then the script hangs :(

Joy
  • 87
  • 7
  • What exactly do you want to scrap? There's barely any HTML on the page. – Edeki Okoh Jan 28 '20 at 16:29
  • yes that's an issue , i want to apply some sort of check if page contains HTML then perform scraping otherwise return and take next URL as argument without wasting any time ... – Joy Jan 28 '20 at 16:30

2 Answers2

0

I assume by your comment that you have a list of urls that you're looking to parse. In that case, you can loop through them, and when make_soup() returns None you can jump to the next iteration with the continue keyword.

def make_soup(url):
    try:
        html = requests.get(url).content
    except:
        return None
    return BeautifulSoup(html, "lxml")

urls = [
    "https://cdn.podigee.com/uploads/u735/1d4d4b22-528e-4447-823e-b3ca5e25bccb.mp3?v=1578558565&source=webplayer",
]

for url in urls:
    soup = make_soup(url)
    if soup is None:
        continue

    print(soup.select_one("a.next").get('href'))

For cases where the url takes too long, you can specify a timeout function. If you're on Windows, you can look here

Cohan
  • 3,844
  • 2
  • 17
  • 36
  • i tried but it is also not working for me , the problem is make_soup never returns for media URLs , like the one i have added above – Joy Jan 28 '20 at 16:57
  • it take some time to return * – Joy Jan 28 '20 at 16:58
0
def is_downloadable(url):
    """
    Does the url contain a downloadable resource
    """
    h = requests.head(url, allow_redirects=True)
    header = h.headers
    content_type = header.get('content-type')
    if 'text' in content_type.lower():
        return False
    if 'html' in content_type.lower():
        return False
    return True
url = "https://cdn.podigee.com/uploads/u735/1d4d4b22-528e-4447-823e-b3ca5e25bccb.mp3?v=1578558565&source=webplayer"

print(is_downloadable(url))
Joy
  • 87
  • 7