13

I'm trying to scrape Instagram (built with React) with Node.js / Cheerio. Debugging the document shows an object returned, but it doesn't look like the typical response.

I'm guessing this has to do with React. Is there a way to get around this, and pull the rendered DOM to parse with Cheerio? Or am I missing something entirely?

BenMorel
  • 30,280
  • 40
  • 163
  • 285
Kyle Chadha
  • 2,719
  • 1
  • 24
  • 37
  • No code, no exact error message, no exact steps to reproduce, I'm guessing you're missing [how-to-ask](http://stackoverflow.com/help/how-to-ask). Sure, with you keyboard and your debugger you can scrape even website built with `React` with `Cheerio`. But you may need some stronger tool like `PhantomJS` or `SeleniumHQ` able to run scripts, wait for their execution etc. – xmojmr Apr 18 '15 at 12:32
  • 7
    This is a conceptual question with a binary answer -- thanks for being unhelpful. – Kyle Chadha Apr 18 '15 at 12:55
  • Dear @Kyle, helpfulness is opinion-based. I do believe that your question is not good enough and you can improve it. I don't see a "concept" in your question. The binary answer is yes, it is possible. But what you mean exactly by debugging the document? What document? What's the error message? jsFiddle to reproduce? – xmojmr Apr 18 '15 at 13:01
  • 1
    Fair enough. I've posted the answer below. The code is what is below, minus the User-Agent. Unfortunately no jsFiddle since this is server side code, and no error message since there was a response returned, just not one that was parseable by Cheerio (React creates a virtual DOM). – Kyle Chadha Apr 18 '15 at 13:07

1 Answers1

12

In the general case -- if the website is SEO friendly, you can do it by spoofing the user agent string of a web crawler. This returns a rendered DOM that can be parsed by Cheerio.

In the specific case -- Instagram returns a rendered DOM on its mobile web sites. Spoof the user agent string of a mobile phone and you can parse the data that is returned.

      var options = {
        url: user.instagram_url,
        headers: {
          'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4'
        }
      };

      request(options, function(error, response, html) {
        if (!error) {

          console.log('Scraper running on Instagram user page.');

          // Use Cheerio to load the page.
          var $ = cheerio.load(html);

          // Code to parse the DOM here

        }
      }
Kyle Chadha
  • 2,719
  • 1
  • 24
  • 37
  • Can you explain "_virtual DOM rendered on a mobile web site not parseable by Cheerio_"? Some "see also" hyperlink or some `html` snippet sample returned from the _unspoofed_ query? Something so that someone else can comprehend what kind of problem you've found and solved? I know what's instagram, node.js, cheerio, html, css, javascript, document object model, search engine optimization and other stuff, but still I find it hard to image what do you see when looking at your computer screen... – xmojmr Apr 18 '15 at 13:29
  • @Kyle : I am not able to find a mobile website that can be opened on my desktop for Instagram . please give a link if you have . Thank you . – huzefa biyawarwala Jan 08 '16 at 10:03
  • You have to change your user agent string. You can do so using Chrome browser emulation or in the Cheerio options as I have done above. – Kyle Chadha Jan 08 '16 at 17:53
  • @KyleChadha Thanks for posting this. Did you ever manage to take this concept further for cases when the site returns the same React string whether or not you've used a search engine/mobile UA? – Tim Malone Feb 01 '17 at 07:22
  • @KyleChadha Actually, just found this: http://stackoverflow.com/questions/29972996/how-to-parse-dom-react – Tim Malone Feb 01 '17 at 07:26
  • @TimMalone Hi Tim, didn't have to in my case. This is about 2 years old though, so things may have changed... here's my code in the off chance it's helpful: https://github.com/kylechadha/lookbook-scraper/blob/master/app/services/scraper.js#L106 – Kyle Chadha Feb 01 '17 at 16:49