1

I am using python 2.7 and version 4.5.1 of Beautiful Soup

I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in

<div class="status online-availability-status">             Sold out online     </div>

This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:

import requests
from bs4 import BeautifulSoup

page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')

soup = BeautifulSoup(page.content, 'html.parser')

avail = soup.findAll('div', {"class": "status online-availability-status"})

But then I just get an empty list for avail. Any idea why?

Any help is greatly appreciated.

PollPenn
  • 643
  • 7
  • 15
  • Are yuo sure you get the page and it contains the required div? – Nurjan Dec 29 '16 at 05:27
  • Might have to do with how the page is loading: trying to load it manually shows a progress bar first, while the page does a background query to check stocks and then display "sold out online". Which means that when the original page is loaded, that content is not present. – VBB Dec 29 '16 at 05:28
  • @Nurzhan yes I'm sure. I'm looking at the page's elements now and it is there. – PollPenn Dec 29 '16 at 05:33
  • the div actually has two classes. In this case you need to pass an array to selector: `{'class': ['status', 'online-availability-status']}`, or just discard the first class – Marat Dec 29 '16 at 05:39
  • @VBB Thanks for your comment. Any suggestions on how to get around this? – PollPenn Dec 29 '16 at 05:39
  • @Marat Thank you. I tried what you suggested: `avail = soup.findAll('div', {'class': ["status", "online-availability-status"]})`. It is still giving me an empty list. – PollPenn Dec 29 '16 at 05:43
  • Checking the source of the page (which is different from the DOM shown by your browsers inspector), the `div` doesn't exist. It's loaded by something else, most likely JavaScript. You'll need to figure out what calls are made after the page loads and request that with BeautifulSoup. – dirn Dec 29 '16 at 05:44
  • Are you sure it is in the HTML and not pulled up by XHR? can you print the page content from Python and check if the div is still there? – Marat Dec 29 '16 at 05:44
  • Actually, page URL would be very helpful – Marat Dec 29 '16 at 05:45

3 Answers3

2

As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})

This was the most popular solution in a question relating to scraping dynamically generated content:

Web-scraping JavaScript page with Python

Community
  • 1
  • 1
Chris Gumb
  • 46
  • 3
0

Availability is loaded in JSON. You don't even need to parse HTML for that:

import urllib
import simplejson

sku = 1048865  # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']
Marat
  • 10,338
  • 2
  • 30
  • 40
0

If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.

Here is a link to generate a User Agent How to use Python requests to fake a browser visit a.k.a and generate User Agent?

or you could figure out your user agent generated when you are viewing the webpage in your own browser https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent