72

What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extension with JavaScript and whatever more technologies are available. Other JavaScript-libraries are also accepted.

The important thing is to mask the scraping to behave like a normal web-request. No indications of AJAX or XMLHttpRequest, like X-Requested-With: XMLHttpRequest or Origin.

The scraped content must be accessible from JavaScript for further manipulation and presentation within the extension, most probably as a string.

Are there any hooks in any WebKit/Chrome-specific API:s that can be used to make a normal web-request and get the results for manipulation?

var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections

Bonus-points to make this work from a local file on disk, for initial debugging. But if that is the only point is stopping a solution, then disregard the bonus-points.

Seb Nilsson
  • 24,850
  • 29
  • 95
  • 124
  • @buffer Thanks! I think so too, even though 3 people has voted it for closing (??!!). If the answer is "not possible", then that is correct and will be accepted, if nothing else comes along in a while. – Seb Nilsson Jun 30 '11 at 21:16
  • iMacros is doing something similar, although I'm not sure how much help does it offer. https://chrome.google.com/webstore/detail/cplklnmnlbnpmjogncfgfijoopmnlemp – user Jul 01 '11 at 04:01
  • @buffer It seems to only open tabs and listen to already open tabs, not really do requests in the code. At least from what I could find. Thanks for the attempt! :D – Seb Nilsson Jul 01 '11 at 11:23
  • @SebNilsson did you finally find a way? – Christophe Apr 06 '13 at 15:17
  • @Christophe Nopes, no good answer that fulfills the given criteria. – Seb Nilsson Apr 08 '13 at 06:10
  • @SebNilsson are the question's requirements out of curiosity or necessity? – xst May 29 '14 at 12:43
  • @xst Curiosity. But it could open up some potentially interesting personal projects. – Seb Nilsson May 29 '14 at 16:31
  • @Seb Nilsson Wait, you mean web scraping a page that's **not a currently open tab in chrome?** Can you clarify that in the question please? – MGOwen Jul 24 '15 at 12:47

7 Answers7

12

Attempt to use XHR2 responseType = "document" and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type")) with my text/html patch. See https://gist.github.com/1138724 for an example of how I detect responseType = "document support (synchronously checking response === null on an object URL created from a text/html blob).

Use the Chrome WebRequest API to hide X-Requested-With, etc. headers.

P̲̳x͓L̳
  • 3,457
  • 3
  • 25
  • 36
Eli Grey
  • 32,712
  • 13
  • 69
  • 92
10

If you are fine looking at something beyond a Google Chrome Plugin, look at phantomjs which uses Qt-Webkit in background and runs just like a browser incuding making ajax requests. You can call it a headless browser as it doesn't display the output on a screen and can quitely work in background while you are doing other stuff. If you want, you can export out images, pdf out of the pages it fetches. It provides JS interface to load pages, clicking on buttons etc much like you have in a browser. You can also inject custom JS for example jQuery on any of the pages you want to scrape and use it to access the dom and export out desired data. As its using Webkit its rendering behaviour is exactly like Google Chrome.

Another option would be to use Aptana Jaxer which is based on Mozilla Engine and is very good concept in itself. It can be used as a simple scraping tool as well.

Anshul
  • 4,859
  • 2
  • 17
  • 18
  • Really got me going there for a while, but none of them seem to be able to integrate into a Google Chrome Extension unfortunately :( They are both stand-alone products that has to be handled in its own environment. Very nice try though. – Seb Nilsson Aug 25 '11 at 19:36
  • @SebNilsson Forgot to mention that earlier, have edited the answer. I assumed your need to have the solution as chrome extension is solely based on your need to get a real browser interacting with the site. – Anshul Aug 26 '11 at 09:19
8

A lot of tools have been released since this question was asked.

artoo.js is one of them. It's a piece of JavaScript code meant to be run in your browser's console to provide you with some scraping utilities. It can also be used as a chrome extension.

potar
  • 388
  • 4
  • 5
6

Web scraping is kind of convoluted in a Chrome Extension. Some points:

  • You run content scripts for access to the DOM.
  • Background pages (one per browser) can send and receive messages to content scripts. That is, you can run a content script that sets up an RPC endpoint and fires a specified callback in the context of the background page as a response.
  • You can execute content scripts in all frames of a webpage, then stitch the document tree (composed of the 1..N frames that the page contains) together.
  • As S.K. suggested, your background page can send the data as an XMLHttpRequest to some kind of lightweight HTTP server that listens locally.
Novikov
  • 4,243
  • 2
  • 22
  • 36
5

I'm not sure it's entirely possible with just JavaScript, but if you can set up a dedicated PHP script for your extension that uses cURL to fetch the HTML for a page, the PHP script could scrape the page for you and your extension could read it in through an AJAX request.

The actual page being scraped wouldn't know it's an AJAX request, however, because it is being accessed through cURL.

Seb Nilsson
  • 24,850
  • 29
  • 95
  • 124
Steve
  • 1,784
  • 3
  • 14
  • 23
4

I think you can start from this example.

So basically you can try using Extension + Plugin combination. Extension would have access to DOM (including plugin) and drive the process. And Plugin would send actual HTTP requests.

I can recommend using Firebreath as a crossplatform Chrome/Firefox plugin platform, in particular take a look at this example: Firebreath - Making+HTTP+Requests+with+SimpleStreamsHelper

Dmitry Chichkov
  • 592
  • 4
  • 12
3

couldn't you just do some iframe trickery? if you load the url into a dedicated frame, you have the dom in a document object and can do your jquery selections, no?

tim
  • 413
  • 1
  • 6
  • 8
  • I've tried that, but I can't access the content of the Iframe, as specified in the W3C-standards. Was hoping Chrome Extensions would give me something around that. – Seb Nilsson Aug 13 '11 at 10:59
  • You can access iframe content by including content script. This is best solution I have found and I'm using it in many of my extensions. – hamczu Sep 28 '11 at 14:14
  • By launching chrome with $ chrome --disable-web-security you can access iframes, but some sites dont like this and breakout of iframes – denysonique Oct 30 '12 at 19:26