10

I'm trying to fetch an entire webpage using JavaScript by plugging in the URL. However, the website is built as a Single Page Application (SPA) that uses JavaScript / backbone.js to dynamically load most of it's contents after rendering the initial response.

So for example, when I route to the following address:

https://connect.garmin.com/modern/activity/1915361012

And then enter this into the console (after the page has loaded):

var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());

Then I'll get the dynamically loaded activity title as well as the statically loaded page footer:

Working Screenshot


However, when I try to load the webpage via an AJAX call with either $.get() or .load(), I only get delivered the initial response (the same as the content when over view-source):

view-source:https://connect.garmin.com/modern/activity/1915361012

So if I use either of the the following AJAX calls:

// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
    var $page = $("<div>").html(data)
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim());
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim()    );
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

I'll still get the initial footer, but won't get any of the other page contents:

Broken - Screenshot


I've tried the solution here to eval() the contents of every script tag, but that doesn't appear robust enough to actually load the page:

jQuery.get(url,function(data) {
    var $page = $("<div>").html(data)
    $page.find("script").each(function() {
        var scriptContent = $(this).html(); //Grab the content of this tag
        eval(scriptContent); //Execute the content
    });
    console.log("%c✖: ", "color:red;",   $page.find(".page-title").text().trim());
    console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});

Q: Any options to fully load a webpage that will scrapable over JavaScript?

KyleMit
  • 45,382
  • 53
  • 367
  • 544
  • 1
    What's the end goal? If you want to obtain the data, perhaps direct access to original data is easier (depending on your source, apparently if you know your ID you can get the data [like that](https://connect.garmin.com/modern/proxy/activity-service/activity/1915361012/splits?_=1504076007555) without cookies or anything). If you really want to load the full page and then mine data from the DOM, the only general solution is using an "instrumentable" headless browser such as [PhantomJS](http://phantomjs.org/) or [Headless Chrome](https://developers.google.com/web/updates/2017/04/headless-chrome) – Hugues M. Aug 30 '17 at 07:02

3 Answers3

3

You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.

The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.

I wanted to try Headless Chrome so let's see what it can do with your page:

Quick check using internal REPL

Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:

% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim() 
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}

NB: to get chrome command line working on a Mac I did this beforehand:

alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"

Using programmatically with Node & Puppeteer

Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.

(Step 0 : Install Node & Yarn if you don't have them)

In a new directory:

yarn init
yarn add puppeteer

Create index.js with this:

const puppeteer = require('puppeteer');
(async() => {
    const url = 'https://connect.garmin.com/modern/activity/1915361012';
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Go to URL and wait for page to load
    await page.goto(url, {waitUntil: 'networkidle'});
    // Wait for the results to show up
    await page.waitForSelector('.page-title');
    // Extract the results from the page
    const text = await page.evaluate(() => {
        const title = document.querySelector('.page-title');
        return title.innerText.trim();
    });
    console.log(`Found: ${text}`);
    browser.close();
})();

Result:

$ node index.js 
Found: Daily Mile - Round 2 - Day 27
Hugues M.
  • 17,453
  • 5
  • 27
  • 56
1

First off: avoid eval - your content security policy should block it and it leaves you open to easy XSS attacks. Scraping bots definitely won't run it.

The problem you're describing is common to all SPAs - when a person visits they get your app shell script, which then loads in the rest of the content - all good. When a bot visits they ignore the scripts and return the empty shell.

The solution is server side rendering. One way to do this is if you're using a JS renderer (say React) and Node.js on the server you can fairly easily build the JS and serve it statically.

However, if you aren't then you'll need to run a headless browser on your server that executes all the JS a user would and then serves up the result to the bot.

Fortunately someone else has already done all the work here. They've put a demo online that you can try out with your site:

Rendertron preview

Keith
  • 133,927
  • 68
  • 273
  • 391
  • I'm agreed on avoiding eval, but eventually need a way to fire the scripts that load page content. Garmin.com isn't **my** site so I can't enable any server side rendering solutions. Renderton doesn't seem to actually load the subsequent content in the same way that simply navigating to the page will, at least not after giving it a couple tries (even your included screenshot is blank). But a headless browser might be something to explore. – KyleMit Aug 30 '17 at 12:16
  • @KyleMit you don't need `eval` for that. Instead of AJAX add a ` – Keith Aug 30 '17 at 12:25
  • @KyleMit you can use a headless browser to render someone else's site, but you may need additional steps. Rendertron can handle Shadow DOM, but that Garmin site is also lazy loading in D3 and Map API JS libraries that are fairly large, so you'll need to wait for those to finish before you create the static copy (Rendertron doesn't wait by default) – Keith Aug 30 '17 at 12:29
0

I think you should know the concept of SPA, SPA is Single Page Application, it is only static html file. when the route changs, the page will create or modify DOM nodes dynamically to achieve the effect of switch page by using Javascript.

Therefore, if you use $.get(), the server will response a static html file that has a stable page, so you won't load what you want.

If you wants to use $.get() , it has two ways, the first is using headless browser, for example, headless chrome, phantomJS and etc. It will help you load the page and you can get dom nodes of the loaded page.The second is SSR (Server Slide Render), if you use SSR, you will get HTML data of page directly by $.get, because the server response HTML data of correspond page when requesting different routes.

Reference:

SSR

the SRR frame of vue: Nuxt.js

PhantomJS

Node API of Headless Chrome

Kermit
  • 974
  • 5
  • 9