3

I want to read the html source code of say www.google.com with ajax or jquery (I don't just want to display the source, i need to parse it, so having xmlhttp.responseText is nice).

read contents of an external webpage and get specific elements has a nice way of doing it serverside w/ php Can Javascript read the source of any web page? is nice if you are trying to read a page of local domain

yql+JSON is a possibility, as noted in above, but seems slow and a lot of overhead

i'd prefer ajax, cuz I don't need to load a 90k jquery lib, and as far as I can see...

var xmlhttp=null;
var url = 'bot.html?url=http://google.com';  //must redirect in bot.html
//var url='http://www.google.com';  wont work, 0 xmlhttp.status error
if (window.XMLHttpRequest) { // code for IE7+, Firefox, Chrome, Opera, Safari
  xmlhttp=new XMLHttpRequest();  //src says buggy for IE7
} else {// code for IE6, IE5
  xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}

xmlhttp.open("GET",url,true);
xmlhttp.send(null);

xmlhttp.onreadystatechange=function() {
 if (xmlhttp.readyState==4 && xmlhttp.status==200) {
    document.getElementById("result").innerHTML= xmlhttp.responseText;
 }
}

is much the same as jquery...

$("#result").load(url);

unmentioned in other mentioned stackoverflow is how to handle the ?url= . I did (as keeping all js)...

bot.html:
<head>
<script type="text/javascript">
var vars = query.split("&"); 
var pair = vars[0].split("=");
if (pair[0]=='url') {  // ex bot.html?url=http://www.google.com
    alert('hi '+pair[1]);
    window.location = pair[1];
    //top.location.href=pair[1];  or
}
</script>
... above jquery or ajax ...
<div id="result">Fill Me</div>

All this works fine for a local page var url='index.php' (without redirect), HOWEVER, none of this works for external links, like google.com, I can't seem to var url='google.com' and if I try to proxy (as eluded to for jquery, without example, in above mentioned stackoverflow) it loads the source for bot.html (itself) (never doing the alert or redirect), which makes sense i think, cuz it is loading, not doing. I figured I could use the same proxy trick for ajax.

trying to redirect / proxy by .htaccess wont fit for this application

Community
  • 1
  • 1
dako
  • 31
  • 5

1 Answers1

0

I don't see what you're trying to accomplish with the second bit of code in your question (from the bot.html down).

But! I think I have a solution for you. You're probably running up against the same-origin policy (Wikipedia or MDN documentation) which basically states that XMLHttpObjects cannot make requests to domains other than the one they are originally served from. The idea behind this is that without such enforcement at the browser level (in other words, at a higher authority than the JS runtime itself), it would be too easy for an external script to eavesdrop, corrupt or hijack your AJAX requests by changing the domain or parameters such requests were being made to.

The workaround is to use script tags instead. Here's a bit of code I adapted from the jQuery source (search for 'DOMContentLoaded' for the relevant part) to do just that. I also didn't want to include the entire jQuery library to make cross-domain Ajax requests - we were testing for speed of client side actions and some of the test targets didn't require jQuery already, so including it would have skewed the test.

function saveTime() {
    var s = document.createElement("script"), h = document.head || document.getElementsByTagName("head")[0] || document.documentElement;
    s.async = "async";
    s.type = "text/javascript";
    s.onreadystatechange = function(result) {
        // callback function
        // Append the result into the inner HTML here
    };
    s.src = url;
    h.insertBefore(s, h.firstChild);
}

This should get you what you need, but you might have to tweak the type attribute to get raw/complete HTML contents back. It appends a <script> tag with the source you specify in url to the beginning of the <head> tag (or the body, for very old versions of IE). I did not adapt the cleanup code. If you look through the jQuery source, you'll see that they actually have extra handlers for removing the tag from the DOM after the request completes or fails.

Patrick M
  • 9,455
  • 9
  • 56
  • 97
  • i tried this using... var url = 'http://www.google.com'; and s.type = "text/html"; and removed the function saveTime() to just do innards and just put an alert in the onreadystatechange function. Never see the alert. What am I doing wrong? – dako Dec 02 '12 at 22:52
  • I just played with it a bit myself. It appears that Chrome won't dynamically load a script element with `type="text/html"`, and if you use `type="text/javascript"` then you get syntax errors when the browser runs the Javascript Interpreter against the returned html. I wonder how the guts of YQL work then... – Patrick M Dec 02 '12 at 23:19
  • Oh, silly me. YQL runs through querying their server-side API. I was assuming it was a released JS library that you could run self-contained from a browser. – Patrick M Dec 02 '12 at 23:22
  • i've been reading around. It seems I cant do this without php or yql (which as you say is handshaking their server). It makes no sense sense I can iframe an external page (and indirectly get all the code). – dako Dec 03 '12 at 00:17
  • Maybe that's your answer, then. iFrame in the content with a css rule set for `display:none`, use JS to detect when it's finished loading, then get the inner html of the iFrame. (Also theoretical, I haven't tried this.) – Patrick M Dec 03 '12 at 02:56
  • ahh, wont work, the famous "cant get iframe.contentDocument of a child that isn't the same domain". You can disable the security 'feature' in your browser, but doesn't help users... – dako Feb 26 '13 at 22:58
  • In retrospect, I should have known that before making my suggestion. I ran into this very recently iframing content between two of our different production servers as part of a phased migration. The last result to make it ajax-y would be to proxy it through your own server, let that page curl the content and return it as a string. – Patrick M Feb 26 '13 at 23:45