1

I'm trying to send an http request to a website (for ex, Digikey) and read back the full html. For example, I'm using this link: https://www.digikey.com/products/en?keywords=part_number to get a part number such as: https://www.digikey.com/products/en?keywords=511-8002-KIT. However what I get back is not the full html.

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.digikey.com/products/en?keywords=511-8002-KIT')
soup = BeautifulSoup(r.text)
print(soup.prettify())

Output:

<!DOCTYPE html>
<html>
 <head>
  <script>
   var i10cdone =(function(){ function pingBeacon(msg){ var i10cimg = document.createElement('script'); i10cimg.src='/i10c@p1/botox/file/nv-loaded.js?status='+window.encodeURIComponent(msg); i10cimg.onload = function(){ (document.head || document.documentElement).removeChild(i10cimg) }; i10cimg.onerror = function(){ (document.head || document.documentElement).removeChild(i10cimg) }; ( document.head || document.documentElement).appendChild(i10cimg) }; pingBeacon('loaded'); if(String(document.cookie).indexOf('i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo')>=0) { document.cookie = 'i10c.bdddb=;path=/';}; var error=''; function errorHandler(e) { if (e && e.error && e.error.stack ) { error=e.error.stack; } else if( e && e.message ) { error = e.message; } else { error = 'unknown';}} if(window.addEventListener) { window.addEventListener('error',errorHandler, false); } else { if ( window.attachEvent ){ window.attachEvent('onerror',errorHandler); }} return function(){ if (window.removeEventListener) {window.removeEventListener('error',errorHandler); } else { if (window.detachEvent) { window.detachEvent('onerror',errorHandler); }} if(error) { pingBeacon('error-' + String(error).substring(0,500)); document.cookie='i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo;path=/'; }}; })();
  </script>
  <script src="/i10c@p1/client/latest/auto/instart.js?i10c.nv.bucket=pci&amp;i10c.nv.host=www.digikey.com&amp;i10c.opts=botox&amp;bcb=1" type="text/javascript">
  </script>
  <script type="text/javascript">
   INSTART.Init({"apiDomain":"assets.insnw.net","correlation_id":"1553546232:4907a9bdc85fe4e8","custName":"digikey","devJsExtraFlags":"{\"disableQuerySelectorInterception\" :true,  'rumDataConfigKey':'/instartlogic/clientdatacollector/getconfig/monitorprod.json','custName':'digikey','propName':'northamerica'}","disableInjectionXhr":true,"disableInjectionXhrQueryParam":"instart_disable_injection","iframeCommunicationTimeout":3000,"nanovisorGlobalNameSpace":"I10C","partialImage":false,"propName":"northamerica","rId":"0","release":"latest","rum":false,"serveNanovisorSameDomain":true,"third_party":["IA://www.digikey.com/js/geotargeting.js"],"useIframeRpc":false,"useWrapper":false,"ver":"auto","virtualDomains":4,"virtualizeDomains":["^auth\\.digikey\\.com$","^authtest\\.digikey\\.com$","^blocked\\.digikey\\.com$","^dynatrace\\.digikey\\.com$","^search\\.digikey\\.com$","^www\\.digikey\\.ca$","^www\\.digikey\\.com$","^www\\.digikey\\.com\\.mx$"]}
);
  </script>
  <script>
   typeof i10cdone === 'function' && i10cdone();
  </script>
 </head>
 <body>
  <script>
   setTimeout(function(){document.cookie="i10c.eac23=1";window.location.reload(true);},30);
  </script>
 </body>
</html>

The reason I need the full html is to search into it for specific keywords, such as do the terms "Lead free" or "Through hole" appear in the particular part number result. I'm not only doing this for Digikey, but also other sites.

Any help would be appreciated!

Thanks!

EDIT:

Thank you all for your suggestions/answers. More info here for others who're interested in this: Web-scraping JavaScript page with Python

Shanteshwar Inde
  • 1,378
  • 4
  • 14
  • 22
ItM
  • 163
  • 1
  • 4
  • 13
  • 2
    This is because the website is rendered in javascript, which means you'll need a browser to retrieve all of the rendered script. Check out [`selenium`](https://selenium-python.readthedocs.io/) – C.Nivs Mar 25 '19 at 20:47

2 Answers2

2

Most likely the parts of the page you are looking for includes content that is generated dynamically using Javascript.

Visit view-source:https://www.digikey.com/products/en?keywords=part_number on your browser and you will see requests is fetching the full html - it's just not executing the Javascript code.

If you right-click the and click inspect (Chrome), you will see the final DOM that is created after the javascript code is executed.

To get the rendered content, you would need to use a full web-driver like Selenium that is capable of executing the Javascript to render the full page.

Here an example of how to achieve that using Selenium:

How can I parse a website using Selenium and Beautifulsoup in python?

In [8]: from bs4 import BeautifulSoup
In [9]: from selenium import webdriver
In [10]: driver = webdriver.Firefox()
In [11]: driver.get('http://news.ycombinator.com')
In [12]: html = driver.page_source
In [13]: soup = BeautifulSoup(html)
In [14]: for tag in soup.find_all('title'):
   ....:     print tag.text
   ....:     
   ....:     
Hacker News
gtalarico
  • 3,087
  • 1
  • 15
  • 29
1

The issue would be because the javascript of the page does not have time to run and therefore populate the necessary HTML elements. One solution to this would be to implement a webdriver using selenium:

from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source

Often this is a lot more inefficient since you have to fully wait for the page to load. One way to get around this would to be to look for various API's that the website provides to access the data you want directly, I would recommend doing some research into what those might be

Here are some of the potential APIs you can use to get the data directly

https://api-portal.digikey.com/product

NightShade
  • 326
  • 1
  • 15
  • Seems like the APIs have limits on how many searches you can do per day and Selenium is painfully slow for searching thousands of parts. Thanks though! – ItM Mar 26 '19 at 00:18
  • Selenium isn’t necessarily the part that makes it “slow”, it’s the page running the script. Selenium will take for how ever long as it takes the page to render. If you need it fast as stated above, you need to get the data directly (ie. from the API) or just have to wait for the page to render. – chitown88 Mar 26 '19 at 07:08