Node.js: requesting a page and allowing the page to build before scraping

Question

I've seen some answers to this that refer the askee to other libraries (like phantom.js), but I'm here wondering if it is at all possible to do this in just node.js?

Considering my code below. It requests a webpage using request, then using cheerio it explores the dom to scrape the page for data. It works flawlessly and if everything had gone as planned, I believe it would have outputted a file as i imagined in my head.

The problem is that the page I am requesting in order to scrape, build the table im looking at asynchronously using either ajax or jsonp, i'm not entirely sure how .jsp pages work.
So here I am trying to find a way to "wait" for this data to load before I scrape the data for my new file.

var cheerio = require('cheerio'),
    request = require('request'),
    fs = require('fs');

// Go to the page in question
request({
    method: 'GET',
    url: 'http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp'
}, function(err, response, body) {
    if (err) return console.error(err);
    // Tell Cherrio to load the HTML
    $ = cheerio.load(body);


    // Create an empty object to write to the file later    
    var toSort = {}

    // Itterate over DOM and fill the toSort object
    $('#emb table td.list_right').each(function() {
        var row = $(this).parent();     
        toSort[$(this).text()] = {
            [$("#lastdate").text()]: $(row).find(".idx1").html(),
            [$("#currdate").text()]: $(row).find(".idx2").html()            
        }
    });

    //Write/overwrite a new file
    var stream = fs.createWriteStream("/tmp/shipping.txt"); 
    var toWrite = "";

    stream.once('open', function(fd) {
        toWrite += "{\r\n"
        for(i in toSort){ 
            toWrite += "\t" + i + ": { \r\n";
                for(j in toSort[i]){
                    toWrite += "\t\t" + j + ":" + toSort[i][j] + ",\r\n";               
                }               
            toWrite += "\t" + "}, \r\n";
        }
        toWrite += "}"

        stream.write(toWrite)
        stream.end();
    });
});

The expected result is a text file with information formatted like a JSON object.

It should look something like different instances of this

"QINHUANGDAO - GUANGZHOU (50,000-60,000DWT)": { 
     "2016-09-29": 26.7,
     "2016-09-30": 26.8,
},

But since the name is the only thing that doesn't load async, (the dates and values are async) I get a messed up object.

I tried Actually just setting a setTimeout in various places in the code. The script will only be touched by developers that can afford to run the script several times if it fails a few times. So while not ideal, even a setTimeout (up to maybe 5 seconds) would be good enough.

It turns out the settimeouts don't work. I suspect that once I request the page, I'm stuck with the snapshot of the page "as is" when I receive it, and I'm in fact not looking at a live thing I can wait for to load its dynamic content.

I've wondered investigating how to intercept the packages as they come, but I don't understand HTTP well enough to know where to start.

score 1 · Accepted Answer · edited May 23 '17 at 10:27

1

The setTimeout will not make any difference even if you increase it to an hour. The problem here is that you are making a request against this url: http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp

and their server returns back the html and in this html there are the js and css imports. This is the end of your case, you just have the html and that's it. Instead the browser knows how to use and to parse the html document, so it is able to understand the javascript scripts and to execute/run them and this is exactly your problem. Your program is not able to understand that has something to do with the HTML contents. You need to find or to write a scraper that is able to run javascript. I just found this similar issue on stackoverflow: Web-scraping JavaScript page with Python

The guy there suggests https://github.com/niklasb/dryscrape and it seems that this tool is able to run javascript. It is written in python though.

edited May 23 '17 at 10:27

Community

1
1

answered Oct 04 '16 at 10:43

Stavros Zavrakas

2,747
1
11
27

Your explanation solidified my suspicion, thank you. Though I can't use python, I will investigate how I can parse the page. Maybe I could look into hosting the page and stream the results or something – NachoDawg Oct 04 '16 at 10:48
If you search on google something like: node.js scraper framework you 'll get few results. This article possibly will be helpful: http://blog.webkid.io/nodejs-scraping-libraries/ If you want you can accept the answer :) – Stavros Zavrakas Oct 04 '16 at 10:51
1

With the link in your last comment included, accepting the answer seems appropriate. consider editing it in to the answer :) – NachoDawg Oct 04 '16 at 10:56

score 1 · Answer 2 · answered Oct 04 '16 at 10:57

You are trying to scrape the original page that doesn't include the data you need. When the page is loaded, browser evaluates JS code it includes, and this code knows where and how to get the data.

The first option is to evaluate the same code, like PhantomJS do.

The other (and you seem to be interested in it) is to investigate the page's network activity and to understand what additional requests you should perform to get the data you need. In your case, these are:

http://index.chineseshipping.com.cn/servlet/cbfiDailyGetContrast?SpecifiedDate=&jc=jsonp1475577615267&_=1475577619626

and

http://index.chineseshipping.com.cn/servlet/allGetCurrentComposites?date=Tue%20Oct%2004%202016%2013:40:20%20GMT+0300%20(MSK)&jc=jsonp1475577615268&_=1475577620325

In both requests:

_ is a decache parameter to prevent caching.
jc is a name of a JS wrapper function which should be invoked with the result (https://en.wikipedia.org/wiki/JSONP)

So, scrapping the table template at http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp and performing two additional requests you will be able to combine them into the same data structure you see in the browser.

Node.js: requesting a page and allowing the page to build before scraping

2 Answers2