I'm using diffbot to scrape products. It gets things right on most sites, and if it doesn't the custom API usually allows me to easily tweak until correct. However there are a few cases that are baffling me.
I know diffbot doesn't execute javascript in the custom API preview window, but for the product
endpoint, it should always execute it when a request is made to the API (e.g. from the diffbot client in a Python shell).
Foot asylum
For products on this website, e.g. https://www.footasylum.com/hugo-boss-three-pack-tshirt-103678/, the offerPrice
field is empty. I can see the price is in a div#priceFrm
, so I try to edit and add a custom selector on that field to this effect. However even when making a new API call from the Python shell, the response is 'offerPrice': ''
.
This price is obviously being added by Javascript, but why can't diffbot deal with that? What can I do about it?
I can also see the price I want can be found in some JSON data inside a <script>
. Normally I could just scrape it from there, with //script[contains(text(), "dataLayer")]/text()
followed by a regex. However in another diffbot custom field I defined a selector script:contains(dataLayer)
and even this is blank.
Any ideas on getting the price from this product with diffbot?
Nike
I'm also trying to get the price from https://www.nike.com/gb/t/flyknit-trainer-shoe-GBXjsV/AH8396-600
The first problem is the preview window of custom API just gives a 500 error weirdly.
Next I edit the offerPrice
field with a custom selector of div[data-test=product-price]
, however this field doesn't hit anything - even when called from client in Python shell.
Footlocker
Finally on this site https://www.footlocker.co.uk/en/p/jordan-1-flight-2-men-shoes-6671?v=314100340604#!searchCategory=all diffbot cannot seem to get product image.
The images are loaded by "scene7", and with XPATH can be found with //div[@class="s7thumb"][@data-namespace="s7classic"]/@style
and then parsing out the "background-url".
I tried to at least get the style attribute with diffbot using the selector div.s7thumb div[data-namespace=s7classic]
and then adding the Attribute filter "style", but again nothing at all is returned.