11

In C# or else VB.Net, I need to access to a webpage through a web-proxy service to do a web-scraping on the target url which I am interested to.

Let's give as example a random web-proxy service (really no matter which one, I'm open to suggestions) for example this below, which does not complicate things like others do with hashes in the query (that's a thing that I don't know how to handle):

http://proxyanonimo.es/browse.php?u=http%3a%2f%2furl.com

Then, when i perform an HttpWebRequest to that url I expected to encounter in the response the target url's html content, but instead of that I get this content:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 
<html>
<head>
<title>Proxy Anonimo :: Spanish Web Proxy</title>
<meta name="keywords" content="proxy, webproxy, proxy online, spanish proxy" />
<meta name="description" content="Usa nuestro WebProxy An&#65533;nimo para comprobar como se ve una web desde otro sitio que no sea el ordenador en el que est&#65533;s sentado. Es un acceso remoto desde nuestro servidor." />
 
<style type="text/css">
    html, body {
       text-align: center;
    }
    #wrapper {
       width: 740px;
       margin: 0 auto 0 auto;
       text-align: left;
       padding: 10px;
       background: #eee;
       border: 4px outset #ccc;
    }
    #footer {
       margin: 10px 0 0 0; 
       font-size: 80%;
       color: #ccc;
    }
    #error {
       border: 1px solid red;
       padding: 2px;
       margin: 5px 0 15px 0;
       background: #eee;
    }
    .center { text-align: center; }
 
    /* TOOLTIP HOVER EFFECT */
    #tooltip{ 
       width:20em; background: #fff;
    }
</style>
    <script type="text/javascript">ginf={url:'http://proxyanonimo.es',script:'browse.php',target:{h:'http://myurl.com',p:'/',b:'',u:'http://myurl.com'},enc:{u:'iawpK1Q337kKRtEraNzZubjsx46C64Qd4aqEZ6vR2GrHZTZXxmNPoU7JM4aGYQJROYjBUFiKbxiYh5LEhmjt4g3G83dVHKClyLMhgTRfgX1nSBPYLYhG38a11bMwMcF8',e:'',x:'',p:''},b:'12'}</script>
    <script type="text/javascript" src="http://proxyanonimo.es/includes/main.js?1.4.1"></script></head>
<body>
<div id="wrapper">
 
    <h1 class="center"><a href="index.php">Proxy Anonimo</a></h1>
    <h2 class="center">IPv6 Ready!</h2> 
    <div id="error">Hotlinking directly to proxied pages is not permitted.</div><p style="text-align:right">[<a href="http://proxyanonimo.es/browse.php?u=http%3a%2f%2fmyurl.com&amp;b=12&amp;f=norefer">Reload http://myurl.com</a>]</p>
 
    <h2>Proxy</h2>
 
       Usa nuestro WebProxy An&#65533;nimo para comprobar como se ve una web desde otro sitio que no sea el ordenador en el que est&#65533;s sentado. Es un acceso remoto desde nuestro servidor. Si tu conexi&#65533;n tiene alguna restricci&#65533;n, con nuestro Proxy An&#65533;nimo no tendr&#65533;as que tener problema o por lo menos, asegurarte de si la web es accesible o no. 
 
    <h2>URL</h2>
 
    <form action="includes/process.php?action=update" method="post" onsubmit="return updateLocation(this);">
        <input type="text" name="u" id="input" size="60">
 
 
 
        <!--<input type="submit" value="Go">-->
 
        <h3>Options</h3>
        <ul id="options">
            <li><input type="checkbox" name="encodeURL" id="encodeURL"><label for="encodeURL" class="tooltip" onmouseover="tooltip('Encrypts the URL of the page you are viewing so that it does not contain the target site in plaintext.')" onmouseout="exit();">Encrypt URL</label></li><li><input type="checkbox" name="encodePage" id="encodePage"><label for="encodePage" class="tooltip" onmouseover="tooltip('Helps avoid filters by encrypting the page before sending it and decrypting it with javascript once received.')" onmouseout="exit();">Encrypt Page</label></li><li><input type="checkbox" name="allowCookies" id="allowCookies" checked="checked"><label for="allowCookies" class="tooltip" onmouseover="tooltip('Cookies may be required on interactive websites (especially where you need to log in) but advertisers also use cookies to track your browsing habits.')" onmouseout="exit();">Allow Cookies</label></li><li><input type="checkbox" name="tempCookies" id="tempCookies" checked="checked"><label for="tempCookies" class="tooltip" onmouseover="tooltip('This option overrides the expiry date for all cookies and sets it to at the end of the session only - all cookies will be deleted when you shut your browser. (Recommended)')" onmouseout="exit();">Force Temporary Cookies</label></li><li><input type="checkbox" name="stripTitle" id="stripTitle"><label for="stripTitle" class="tooltip" onmouseover="tooltip('Removes titles from proxied pages.')" onmouseout="exit();">Remove Page Titles</label></li><li><input type="checkbox" name="stripJS" id="stripJS"><label for="stripJS" class="tooltip" onmouseover="tooltip('Remove scripts to protect your anonymity and speed up page loads. However, not all sites will provide an HTML-only alternative. (Recommended)')" onmouseout="exit();">Remove Scripts</label></li><li><input type="checkbox" name="stripObjects" id="stripObjects"><label for="stripObjects" class="tooltip" onmouseover="tooltip('You can increase page load times by removing unnecessary Flash, Java and other objects. If not removed, these may also compromise your anonymity.')" onmouseout="exit();">Remove Objects</label></li>      </ul>
    </form>
 
    <br>
 
    <br><br><br>
 
    <p><a href="http://s07.flagcounter.com/more/xu5M"><img src="http://s07.flagcounter.com/count/xu5M/bg=FFFFFF/txt=000000/border=CCCCCC/columns=8/maxflags=248/viewers=De+donde+nos+visitan/labels=1/pageviews=1/" alt="free counters" border="0"></a></p>
 
 
    <div id="eXTReMe"><a href="http://extremetracking.com/open?login=proxyes">
<img src="http://t1.extreme-dm.com/i.gif" style="border: 0;"
height="38" width="41" id="EXim" alt="eXTReMe Tracker" /></a>
<script type="text/javascript"><!--
EXref="";top.document.referrer?EXref=top.document.referrer:EXref=document.referrer;//-->
</script><script type="text/javascript"><!--
var EXlogin='proxyes' // Login
var EXvsrv='s10' // VServer
EXs=screen;EXw=EXs.width;navigator.appName!="Netscape"?
EXb=EXs.colorDepth:EXb=EXs.pixelDepth;EXsrc="src";
navigator.javaEnabled()==1?EXjv="y":EXjv="n";
EXd=document;EXw?"":EXw="na";EXb?"":EXb="na";
EXref?EXref=EXref:EXref=EXd.referrer;
EXd.write("<img "+EXsrc+"=http://e1.extreme-dm.com",
"/"+EXvsrv+".g?login="+EXlogin+"&amp;",
"jv="+EXjv+"&amp;j=y&amp;srw="+EXw+"&amp;srb="+EXb+"&amp;",
"l="+escape(EXref)+" height=1 width=1>");//-->
</script><noscript><div id="neXTReMe"><img height="1" width="1" alt=""
src="http://e1.extreme-dm.com/s10.g?login=proxyes&amp;j=n&amp;jv=n" />
</div></noscript></div>
 
<p class="center">Powered by <a href="http://www.glype.com/">Glype</a>&reg; v1.4.1.</p> 
</div>
 
<script type="text/javascript">
var infolinks_pid = 1993344;
var infolinks_wsid = 0;
</script>
<script type="text/javascript" src="http://resources.infolinks.com/js/infolinks_main.js"></script>
 
</body>
</html>

Then... this is possibly to do?.

What I'm missing?.

Maybe the web-proxy service that I'm trying is resctricting me something?, maybe another web-proxy service could help me better for my needs?.

barny
  • 5,280
  • 4
  • 16
  • 21
ElektroStudios
  • 17,150
  • 31
  • 162
  • 376

2 Answers2

5

I would like to suggest you use direct proxy IP:port, for example 115.238.225.26:80. Then you could easy handle problem using next code:

HttpWebRequest req = (HttpWebRequest) WebRequest.Create(new Uri("http://example.com"));
WebProxy webproxy = new WebProxy("115.238.225.26", 80);
webproxy.BypassProxyOnLocal = false;
req.Method = "GET";
req.Proxy = webproxy;
HttpWebResponse response = (HttpWebResponse) req.GetResponse();
var respStream = response.GetResponseStream();
var result = "";
if (respStream != null) {
    var strReader = new StreamReader(respStream);
    result = strReader.ReadToEnd();
}

Then in result variable you will find result page content or empty string in case some problems occurs(respStream==null). Additionally it may be required add exceptions handling for this code in case any connection problems occurs or so.

Volodymyr
  • 991
  • 1
  • 12
  • 25
  • 1
    Thanks for answer but my question is related to use an online web proxy service, not a common normal http proxy, I knew how to perform a request using a normal proxy, but my needs aren't that. Thanks anyways. – ElektroStudios Jul 25 '15 at 18:03
  • Not sure I understood you last comment, could you explain more detailed what is the difference? – Volodymyr Jul 25 '15 at 18:35
  • 1
    I mean various things. The web-proxy services puts iframes and js so due to my newbie network knowledges for me its difficult to understand how to retrieve the target url's source code. Also a normal (free)proxy needs daily maintenance because its poor lifetime, so I should retrieve a +5.000 proxy lists everyday, and go checking one by one to find a proxy that its anonymity level its ok for me, its too much hard job that with a web-proxy service I can avoid all those things, just using a web-proxy I could stop worrying about other things. – ElektroStudios Jul 26 '15 at 06:07
5

The main issue you seem to be encountering is that the proxy example you're using requires a POST to update the destination URL you're trying to browse through the proxy. That's why you're not getting any content from the target page, and the error message

<div id="error">Hotlinking directly to proxied pages is not permitted.</div>

I don't know how your code looks like, but it seems like you could use the HttpWebRequest POST Method

WebRequest request = (HttpWebRequest)WebRequest.Create("http://www.glype-proxy.info/includes/process.php?action=update");

var postData = "url="+"http://www.example.com";
postData += "&allowCookies=on";
var data = Encoding.ASCII.GetBytes(postData);

request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;

using (var stream = request.GetRequestStream()) {
    stream.Write(data, 0, data.Length);
}

var response = (HttpWebResponse)request.GetResponse();
var responseString = new StreamReader(response.GetResponseStream()).ReadToEnd();

You're going to need to find or host a proxy that returns the HTML of the page, such as http://www.glype-proxy.info/. Even so, in order for a proxy to function correctly, it must change the link to the page's resources to it's own "proxied" path.

http://www.glype-proxy.info/browse.php?u=https%3A%2F%2Fwww.example.com%2F&b=4&f=norefer

In the URL above, if you want the path to the original resources, you'll have to find all the resources that have been redirected and unencode the path passed in as the u= parameter to this specific proxy. Also, you may wish to ignore additional elements injected by the proxy , in this case the <div id="include"> element.


I believe the proxy you're using works the same way as the "Glype" proxy I used in this example, but I do not have access to it at the time of posting. Also, if you want to use use other proxies, you may want to note that many proxies display the result in an iFrame (probably for XSS prevention, navigation, or skinning).

Note: Generally, using another service outside of a built-in API is a bad practice, since services often get a GUI update or some other change that could break your script. Also, those services could experiences interruptions or just be taken down.

Community
  • 1
  • 1
andrewgu
  • 1,510
  • 16
  • 22