Crawling with Node.js

Question

Complete Node.js noob, so dont judge me...

I have a simple requirement. Crawl a web site, find all the product pages, and save some data from the product pages.

Simpler said then done.

Looking at Node.js samples, i cant find something similar.

There a request scraper:

request({uri:'http://www.google.com'}, function (error, response, body) {
  if (!error && response.statusCode == 200) {
    var window = jsdom.jsdom(body).createWindow();
    jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
      // jQuery is now loaded on the jsdom window created from 'body'
      jQuery('.someClass').each(function () { /* Your custom logic */ });
    });
  }
});

But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape.

Then there's the http agent way:

var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);

agent.addListener('next', function (err, agent) {
  var window = jsdom.jsdom(agent.body).createWindow();
  jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
    // jQuery is now loaded on the jsdom window created from 'agent.body'
    jquery('.someClass').each(function () { /* Your Custom Logic */ });

    agent.next();
  });
});

agent.addListener('stop', function (agent) {
  sys.puts('the agent has stopped');
});

agent.start();

Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages.

And i cant even get Apricot working, for some reason i'm getting an error.

So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db?

Thanks!

Possible duplicate of [Scrape web pages in real time with Node.js](http://stackoverflow.com/questions/5211486/scrape-web-pages-in-real-time-with-node-js) — user157251, May 31 '16 at 02:35

score 12 · Answer 1 · edited May 25 '12 at 20:30

12

Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io

More details about scraping can be found in the wiki !

edited May 25 '12 at 20:30

Ones and Zeroes

67
1
9

answered Mar 20 '11 at 19:06

Sandro Munda

36,427
21
94
117

1

The best answer for me, simple and fast – Roger Garzon Nieto Aug 16 '13 at 03:36
1

thanks. i was searching for `node crawl addon` on google and found this answer by clicking this question. thanks for sharing. this should be the accepted answer. in the past i did it similar to how the author did it but this is amazing. – GottZ Nov 09 '13 at 19:33

score 8 · Answer 2 · answered Apr 01 '13 at 15:43

I've had pretty good success crawling and scraping with Casperjs. It's a pretty nice library built on top of Phantomjs. I like it because it's fairly succinct. Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'.

var casper = require("casper").create();

var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;

casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
    numberOfLinks = this.evaluate(function() {
        return __utils__.findAll('.nav-selector a').length;
    });
    this.echo(numberOfLinks + " items found");

    // cause jquery makes it easier
    casper.page.injectJs('/PATH/TO/jquery.js');
});


// Capture links
capture = function() {
    links = this.evaluate(function() {
        var link = [];
        jQuery('.nav-selector a').each(function() {
            link.push($(this).attr('href'));
        });
        return link;
    });
    this.then(selectLink);
};

You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. The example for scraping BBC photos was exceptionally helpful when I built my scraper.

This is a view from 10,000 feet of what casper can do. It has a very potent and broad API. I dig it, in case you couldn't tell :).

My full scraping example is here: https://gist.github.com/imjared/5201405.

+1 for Casperjs. Your answer led me to try it out and within 3 hours I got a lot done - it's pretty easy to get into. — Izhaki, Apr 30 '14 at 00:12

Crawling with Node.js

2 Answers2