1

I want to be able to access /robots.txt from a variety of sites using JavaScript. This is for a side project that tests the availability of sites, not all of which are under my control. I've tried this:

    $.get(robotsUrl, function() {
            console.log('success!');
    }, "text")
        .fail(function() {
            console.log('failed :(');
        });

However, this fails with

XMLHttpRequest cannot load https://my.test.url/robots.txt. Origin http://localhost:8000 is not allowed by Access-Control-Allow-Origin

MDN's page on Same-Origin-Policy says that it's possible to embed content with some elements, such as <script>, <iframe> <embed>. Could I load /robots.txt from an arbitrary site with any of these? Is there any other way I can access this file on other domains?

Wilfred Hughes
  • 26,027
  • 13
  • 120
  • 177
  • 1
    @Sushanth-- JSONP is out of the question since it is robots.txt. CORS is out of the question since it is for arbitrary sites. – Quentin Aug 04 '13 at 19:07
  • I'm curious what specifically your goals are with this project. Are you just gathering data? Or trying to provide real-time info to users across the internet? – Zach Lysobey Aug 04 '13 at 19:17
  • I want to build a site that detects whether the user's internet connection is being filtered. I have a list of domains that are likely to be blocked. So server-side fetching isn't an option. – Wilfred Hughes Aug 04 '13 at 19:20

4 Answers4

3

You could load it with any of them, you just won't be able to make the data available to JavaScript. That's rather the point of the Same Origin Policy.

If you want to get arbitrary data from arbitrary sites, you need to do it server side.

Quentin
  • 800,325
  • 104
  • 1,079
  • 1,205
  • Would it be possible to detect whether the URL loaded? Ideally I'd like to see the content, but if not, knowing whether or not I got an HTTP 200 would be sufficient. – Wilfred Hughes Aug 04 '13 at 19:09
  • No. The status code is also protected by the same origin policy. (Otherwise, oh look, 200, not 403, my visitor *is* logged in to some other site I don't control). – Quentin Aug 04 '13 at 19:10
  • 1
    Thats a browser security feature that can be turned off (in Chrome at least). http://stackoverflow.com/questions/3102819/chrome-disable-same-origin-policy – Zach Lysobey Aug 04 '13 at 19:13
  • Thanks, that's good to know. Sounds like my only option would be to use `/favicon.ico` in an `` tag instead. – Wilfred Hughes Aug 04 '13 at 19:14
  • 1
    Assuming that there *is* a favicon for the site, and that it is at that URI. – Quentin Aug 04 '13 at 19:15
1

To get around a same origin policy, you need to either have control over the host site and set the allow-origin (not an option here), or load it by a method other than JavaScript (which JSONP does; it is loaded as a standard script).

That means you could display the robots.txt in an iframe, for example, by just setting its src attribute.

If you want to manipulate the contents in JavaScript, that won't work (even after you load the content in an iframe, you're still not allowed to interact with it). Your final option is to set up a proxy. Have a script on your server which when called will load the relevant file and redirect the content. It's not hard to do, but means your server will have higher traffic (and you'll need to lock it down so that it isn't used maliciously).

Dave
  • 36,791
  • 8
  • 53
  • 96
  • OK so since this is to look for blocking, the proxy isn't an option. Your best bets are: look for a resource which has an origin policy on the site, use images (you will need to find a unique image per-site), or just display the sites to the user and let them decide if they worked. I'll clarify that even plugins have similar restrictions. For example, Flash needs a crossdomain.xml file. – Dave Aug 04 '13 at 19:26
1

iframes won't let you peek at the content. You could show it to your user, but I'm guessing you want to analyze it with code.


You could do it on your server. Even if you just have a /cors/robots/domain.tld handler (and others for other files you need to access). This is probably the best way, if it's feasible for your situation.


AnyOrigin, is a free service allows you to make cross-origin requests.

$.getJSON('http://anyorigin.com/get?url=google.com/robots.txt&callback=?', function(data){
    console.log(data.contents); // contents of Robots.txt
});
Brigand
  • 75,952
  • 19
  • 155
  • 166
0

Pretty sure this is possible with Chrome by runnning the browser with the Same Origin Policy disabled: Disable same origin policy in Chrome.

It may be preferable to do something like this outside the context of a browser however, on the command line perhaps using something like CURL?

Community
  • 1
  • 1
Zach Lysobey
  • 12,910
  • 19
  • 85
  • 140