0

I'm using diffbot to scrape products. It gets things right on most sites, and if it doesn't the custom API usually allows me to easily tweak until correct. However there are a few cases that are baffling me.

I know diffbot doesn't execute javascript in the custom API preview window, but for the product endpoint, it should always execute it when a request is made to the API (e.g. from the diffbot client in a Python shell).

Foot asylum

For products on this website, e.g. https://www.footasylum.com/hugo-boss-three-pack-tshirt-103678/, the offerPrice field is empty. I can see the price is in a div#priceFrm, so I try to edit and add a custom selector on that field to this effect. However even when making a new API call from the Python shell, the response is 'offerPrice': ''.

This price is obviously being added by Javascript, but why can't diffbot deal with that? What can I do about it?

I can also see the price I want can be found in some JSON data inside a <script>. Normally I could just scrape it from there, with //script[contains(text(), "dataLayer")]/text() followed by a regex. However in another diffbot custom field I defined a selector script:contains(dataLayer) and even this is blank.

Any ideas on getting the price from this product with diffbot?

Nike

I'm also trying to get the price from https://www.nike.com/gb/t/flyknit-trainer-shoe-GBXjsV/AH8396-600

The first problem is the preview window of custom API just gives a 500 error weirdly.

Next I edit the offerPrice field with a custom selector of div[data-test=product-price], however this field doesn't hit anything - even when called from client in Python shell.

Footlocker

Finally on this site https://www.footlocker.co.uk/en/p/jordan-1-flight-2-men-shoes-6671?v=314100340604#!searchCategory=all diffbot cannot seem to get product image.

The images are loaded by "scene7", and with XPATH can be found with //div[@class="s7thumb"][@data-namespace="s7classic"]/@style and then parsing out the "background-url".

I tried to at least get the style attribute with diffbot using the selector div.s7thumb div[data-namespace=s7classic] and then adding the Attribute filter "style", but again nothing at all is returned.

fpghost
  • 2,462
  • 2
  • 26
  • 46

1 Answers1

1

In some cases, specific rendering of certain elements will be blocked either by Diffbot's renderer or by a target site's anti-block measures. That's why Diffbot has X-eval functionality which lets you add custom JavaScript into calls which will get executed on a target site, as if running from the console. In this case, something like the following helps:

function() {
    start();
    setTimeout(function() {
        price = document.querySelector("[itemprop="
            Offers "] [itemprop="
            price "]");
        currency = document.querySelector("[itemprop="
            Offers "] [itemprop="
            priceCurrency "]").getAttribute("content");
        price.parentElement.setAttribute("style", "");
        price.parentElement.innerHTML += '<h1 class="thePrice">' + price.innerText + " " + currency + '</h1>';
        setTimeout(function() {
            end();
        }, 500);
    }, 500);
}

This has been applied as a fix and the price returns now.

Swader
  • 10,807
  • 14
  • 46
  • 82
  • The price now works out-of-the-box for the Footasylum target, but it seems price is still missing for Nike and images still missing for Footlocker. – fpghost Apr 19 '18 at 11:26
  • For the `X-Eval` example, which target of the 3 was this for? or was it just an illustrative example? – fpghost Apr 19 '18 at 11:27
  • Note that the web preview for Nike still 500s, so it's difficult to edit anything. Think that needs a fix. – fpghost Apr 19 '18 at 18:23
  • Nike has iframe protection so won't be renderable in preview probably ever, unfortunately. If a site wants to stop scrapers, they can. There's simply not much that can be done about that. But diffbot's new renderers should improve things dramatically in the coming months, stay tuned. – Swader Apr 20 '18 at 08:26