0

I've got a problem parsing a document with BS4, and I'm not sure what's happening. The response code is OK, the url is fine, the proxies work, everything is great, proxy shuffling works as expected, but soup comes back blank using any parser other than html5lib. The soup that html5lib comes back with stops at the <body> tag.

I'm working in Colab and I've been able to run pieces of this function successfully in another notebook, and have gotten as far as being able to loop through a set of search results, make soup out of the links, and grab my desired data, but my target website eventually blocks me, so I have switched to using proxies.

check(proxy) is a helper function that checks a list of proxies before attempting to make a requests of my target site. The problem seems to have started when I included it in try/except. I'm speculating that maybe it's something to do with the try/except being included in a for loop --- idk.

What's confounding is that I know the site isn't blocking scrapers/robots generally, as I can use BS4 in another notebook piecemeal and get what I'm looking for.

from bs4 import BeautifulSoup as bs
from itertools import cycle
import time
from time import sleep
import requests
import random

head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', "X-Requested-With": "XMLHttpRequest"}
ips = []
proxy_pool = cycle(ips)

def scrape_boxrec():
  search_result_pages = [num for num in range(0, 22700, 20)]
  random.shuffle(search_result_pages)
  for i in search_result_pages:
    search_results_page_attempt.append(i)
    proxy = next(proxy_pool)
    proxies = {
        'http': proxy,
        'https': proxy
    }
    if check(proxy) == True:
      url = 'https://boxrec.com/en/locations/people?l%5Brole%5D=proboxer&l%5Bdivision%5D=&l%5Bcountry%5D=&l%5Bregion%5D=&l%5Btown%5D=&l_go=&offset=' + str(i)
      try: 
        results_source = requests.get(url, headers=head, timeout=5, proxies=proxies)
        results_content = results_source.content
        res_soup = bs(results_content, 'html.parser')
        # WHY IS IT NOT PARSING THIS PAGE CORRECTLY!!!!
      except Exception as ex:
        print(ex)
    else:
      print("Bad Proxy. Moving On")

def check(proxy):
    check_url = 'https://httpbin.org/ip'
    check_proxies = {
        'http': proxy,
        'https': proxy
    }
    try:
      response = requests.get(check_url, proxies=check_proxies, timeout=5)
      if response.status_code == 200:
        return True
    except:
      return False

1 Answers1

0

Since nobody took a crack at it I thought I would come back through and update on a solution - my use of "X-Requested-With": "XMLHttpRequest" in my head variable is what was causing the error. I'm still new to programming, especially with making HTTP requests, but I do know it has something to do with Ajax. Anyways, when I removed that bit from the headers attribute in my request BeautifulSoup parsed the document in full.

This answer as well as this one explains in a lot more detail that this is a common approach to prevent Cross Site Request Forgery, which is why my request was always coming back empty.