0

My Java app is trying to read content from the following url : https://www.iplocation.net/?query=62.92.63.48

I used the following method :

  StringBuffer readFromUrl(String Url)
  {
    StringBuffer sb=new StringBuffer();
    BufferedReader in=null;
    
    try
    {
      in=new BufferedReader(new InputStreamReader(new URL(Url).openStream()));
      String inputLine;
    
      while ((inputLine=in.readLine()) != null) sb.append(inputLine+"\n");
      in.close();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally 
    {
      try 
      {
        if (in!=null)
        {
          in.close();
          in=null;
        }
      }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return sb;
  }

Usually it works fine for other urls, but for this one, the result is different from what's showing in a browser, it looks like this :

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script>
(function(){function getSessionCookies(){var cookieArray=new Array();var cName=/^\s?incap_ses_/;var c=document.cookie.split(";");for(var i=0;i<c.length;i++){var key=c[i].substr(0,c[i].indexOf("="));var value=c[i].substr(c[i].indexOf("=")+1,c[i].length);if(cName.test(key)){cookieArray[cookieArray.length]=value}}return cookieArray}function setIncapCookie(vArray){var res;try{var cookies=getSessionCookies();var digests=new Array(cookies.length);for(var i=0;i<cookies.length;i++){digests[i]=simpleDigest((vArray)+cookies[i])}res=vArray+",digest="+(digests.join())}catch(e){res=vArray+",digest="+(encodeURIComponent(e.toString()))}createCookie("___utmvc",res,20)}function simpleDigest(mystr){var res=0;for(var i=0;i<mystr.length;i++){res+=mystr.charCodeAt(i)}return res}function createCookie(name,value,seconds){var expires="";if(seconds){var date=new Date();date.setTime(date.getTime()+(seconds*1000));var expires="; expires="+date.toGMTString()}document.cookie=name+"="+value+expires+"; path=/"}function test(o){var res="";var vArray=new Array();for(var j=0;j<o.length;j++){var test=o[j][0];switch(o[j][1]){case"exists":try{if(typeof(eval(test))!="undefined"){vArray[vArray.length]=encodeURIComponent(test+"=true")}else{vArray[vArray.length]=encodeURIComponent(test+"=false")}}catch(e){vArray[vArray.length]=encodeURIComponent(test+"=false")}break;case"value":try{try{res=eval(test);if(typeof(res)==="undefined"){vArray[vArray.length]=encodeURIComponent(test+"=undefined")}else if(res===null){vArray[vArray.length]=encodeURIComponent(test+"=null")}else{vArray[vArray.length]=encodeURIComponent(test+"="+res.toString())}}catch(e){vArray[vArray.length]=encodeURIComponent(test+"=cannot evaluate");break}break}catch(e){vArray[vArray.length]=encodeURIComponent(test+"="+e)}case"plugin_extentions":try{var extentions=[];try{i=extentions.indexOf("i")}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext=indexOf is not a function");break}try{var num=navigator.plugins.length if(num==0||num==null){vArray[vArray.length]=encodeURIComponent("plugin_ext=no plugins");break}}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext=cannot evaluate");break}for(var i=0;i<navigator.plugins.length;i++){if(typeof(navigator.plugins[i])=="undefined"){vArray[vArray.length]=encodeURIComponent("plugin_ext=plugins[i] is undefined");break}var filename=navigator.plugins[i].filename var ext="no extention";if(typeof(filename)=="undefined"){ext="filename is undefined"}else if(filename.split(".").length>1){ext=filename.split('.').pop()}if(extentions.indexOf(ext)<0){extentions.push(ext)}}for(i=0;i<extentions.length;i++){vArray[vArray.length]=encodeURIComponent("plugin_ext="+extentions[i])}}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext="+e)}break}}vArray=vArray.join();return vArray}var o=[["navigator","exists"],["navigator.vendor","value"],["navigator.appName","value"],["navigator.plugins.length==0","value"],["navigator.platform","value"],["navigator.webdriver","value"],["platform","plugin_extentions"],["ActiveXObject","exists"],["webkitURL","exists"],["_phantom","exists"],["callPhantom","exists"],["chrome","exists"],["yandex","exists"],["opera","exists"],["opr","exists"],["safari","exists"],["awesomium","exists"],["puffinDevice","exists"],["navigator.cpuClass","exists"],["navigator.oscpu","exists"],["navigator.connection","exists"],["window.outerWidth==0","value"],["window.outerHeight==0","value"],["window.WebGLRenderingContext","exists"],["document.documentMode","value"],["eval.toString().length","value"]];try{setIncapCookie(test(o));document.createElement("img").src="/_Incapsula_Resource?SWKMTFSR=1&e="+Math.random()}catch(e){img=document.createElement("img");img.src="/_Incapsula_Resource?SWKMTFSR=1&e="+e}})();
</script>
<script>
(function() { 
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D2273746128......6F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
</script></head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body></html>

So what's the proper way to read the html content that shows up in the browser, in this case ?

Edit : After reading the suggestions, I've updated my program to look like the following :

StringBuilder response=new StringBuilder();
String USER_AGENT="Mozilla/5.0",inputLine;
BufferedReader in=null;    

try
{
  HttpURLConnection con=(HttpURLConnection)new URL(Url).openConnection();
  con.setRequestMethod("GET");
  con.setRequestProperty("Accept-Charset","UTF-8");
  con.setRequestProperty("User-Agent",USER_AGENT);                         // Add request header

  int responseCode=con.getResponseCode();
  in=new BufferedReader(new InputStreamReader(con.getInputStream()));
  while ((inputLine=in.readLine())!=null) { response.append(inputLine); }
  in.close();
}
catch (Exception e) { e.printStackTrace(); }
finally 
{
  try { if (in!=null) in.close(); }
  catch (Exception ex) { ex.printStackTrace(); }
}
return response.toString();

Yet still didn't work, the response I got look like this :

<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=24&xinfo=8-75933493-0 0NNN RT(1479758027223 127) q(0 -1 -1 -1) r(0 -1) B12(4,315,0) U10000&incident_id=516000100118713619-514529209419563176&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 516000100118713619-514529209419563176</iframe></body></html>

Could someone show some sample code that works ?

Thanks to @thatguy I've modified my program to look like the following :

import java.util.*;
import java.util.concurrent.*;
import java.io.*;
import java.net.*;
import java.util.Map.Entry;

public class Read_From_Url_Runner implements Callable<String[]>
{
  int Id;
  String Read_From_Url_Result[]=null,IP_Location_Url="https://www.iplocation.net/?query=[IP]",IP="62.92.63.48",Cookie,Result[],A_Url;
  
  public Read_From_Url_Runner(int Id)
  {
    this.Id=Id;
    
    A_Url=IP_Location_Url.replace("[IP]",IP);
    Cookie=getIncapsulaCookie(A_Url);
    Out("Cookie = [ "+Cookie+" ]");
    
    try
    {
      Result=call();
//      for (int i=0;i<Result.length;i++) Out(Result[i]);
    }
    catch (Exception e) { e.printStackTrace(); }
  }
  
  public String[] call() throws InterruptedException
  {
    String Text;
    
    try
    {
      Text=readUrl(A_Url,Cookie);
      Out(Text);
    }
    catch (Exception e)
    {
      Out(" --> Error in data : IP = "+IP);
//    e.printStackTrace();
    }
    return Read_From_Url_Result;
  }
  
  public static String readUrl(String url,String incapsulaCookie)
  {
    StringBuilder response=new StringBuilder();
    String USER_AGENT="Mozilla/5.0",inputLine;
    BufferedReader in=null;

    try
    {
      HttpURLConnection connection=(HttpURLConnection)new URL(url).openConnection();
      connection.setRequestMethod("GET");
      connection.setRequestProperty("Accept","text/html; charset=UTF-8");
      connection.setRequestProperty("User-Agent",USER_AGENT);
      connection.setDoInput(true);
      connection.setDoOutput(true);
      connection.setRequestProperty("Cookie",incapsulaCookie);                           // Set the Incapsula cookie
      Out(connection.getRequestProperty("Cookie"));

      in=new BufferedReader(new InputStreamReader(connection.getInputStream()));
      while ((inputLine=in.readLine())!=null) { response.append(inputLine+"\n"); }
      in.close();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally
    {
      try { if (in!=null) in.close(); }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return response.toString();
  }
  
  public static String getIncapsulaCookie(String url)
  {
    String USER_AGENT="Mozilla/5.0",incapsulaCookie=null,visid=null,incap=null;          // Cookies for Incapsula, preserve order
    BufferedReader in=null;

    try
    {
      HttpURLConnection cookieConnection=(HttpURLConnection)new URL(url).openConnection();
      cookieConnection.setRequestMethod("GET");
      cookieConnection.setRequestProperty("Accept","text/html; charset=UTF-8");
      cookieConnection.setRequestProperty("User-Agent",USER_AGENT);
      cookieConnection.connect();
      
      for (Entry<String,List<String>> header : cookieConnection.getHeaderFields().entrySet())
      {
        if (header.getKey()!=null && header.getKey().equals("Set-Cookie"))               // Incapsula gives you the required cookies
        {
          for (String cookieValue : header.getValue())                                   // Search for the desired cookies
          {
            if (cookieValue.contains("visid")) visid=cookieValue.substring(0,cookieValue.indexOf(";")+1);
            if (cookieValue.contains("incap_ses")) incap=cookieValue.substring(0,cookieValue.indexOf(";"));
          }
        }
      }
      incapsulaCookie=visid+" "+incap;
      cookieConnection.disconnect();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally
    {
      try { if (in!=null) in.close(); }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return incapsulaCookie;
  }
  
  private static void out(String message) { System.out.print(message); }
  private static void Out(String message) { System.out.println(message); }
  
  public static void main(String[] args)
  {
    final Read_From_Url_Runner demo=new Read_From_Url_Runner(0);
  }
}

But this only got the first portion of the response as shown below :

enter image description here

What I really wanted to get is something like the following :

enter image description here

This result was got by running my program at : How to shut down Javafx?

Community
  • 1
  • 1
Frank
  • 28,342
  • 54
  • 158
  • 227
  • You need to make the same request the browser does, essentially. You can probably work out through trial and error which headers are causing the markup to change – Clive Nov 21 '16 at 17:57
  • It's probably a user-agent check I'd imagine. – CollinD Nov 21 '16 at 17:58

1 Answers1

4

The problem you are facing may essentially be the HTTP request header, which you do not set explicitly. Websites are usually delivered in different representations, depending on the attributes in the HTTP header (and payload), as to serve desktop or mobile clients in an appropriate manner. Regarding your code, you do not set anything, so you send a default header, whatever the library sets. If you inspect the concrete HTTP header your browser is sending, there will most likely be differences (like a user-agent or encoding,...). If you rebuild the header in your code, the result should be the same.

Additionally, you could use a HttpUrlConnection, so you can easily set or read the corresponding HTTP header, like in this SO post. Otherwise for URLConnection, look here.

Further investigation

Your method rerieves a special error page, which indicates that the website uses additional security features from Incapsula. The site you get looks like this:

Incapsula error page

As I investigated the headers, I noticed two cookie strings that need to be present, so you get directly to the website, instead of the security check:

visid_incap_...=...
incap_ses_..._...=...

What you can do is land on the error page with a single request, which gives you both cookie strings in the Set-Cookie headers. Then you can directly request the website with the cookie strings set as visid_incap_...=...; incap_ses_..._...=.... You can execute requests multiple times, until the cookie expires. Just check for the error page to detect that. Here is working code, which obviously lacks style and additional checks, but solves your problem. The rest is up to you.

public static String getIncapsulaCookie(String url) {

    String USER_AGENT = "Mozilla/5.0";
    BufferedReader in = null;

    String incapsulaCookie = null;

    try {

        HttpURLConnection cookieConnection =
                (HttpURLConnection) new URL(url).openConnection();
        cookieConnection.setRequestMethod("GET");
        cookieConnection.setRequestProperty("Accept",
                "text/html; charset=UTF-8");
        cookieConnection.setRequestProperty("User-Agent", USER_AGENT);

        // Disable 'keep-alive'
        cookieConnection.setRequestProperty("Connection", "close");

        // Cookies for Incapsula, preserve order
        String visid = null;
        String incap = null;

        cookieConnection.connect();

        for (Entry<String, List<String>> header : cookieConnection
                .getHeaderFields().entrySet()) {

            // Incapsula gives you the required cookies
            if (header.getKey() != null
                    && header.getKey().equals("Set-Cookie")) {

                // Search for the desired cookies
                for (String cookieValue : header.getValue()) {
                    if (cookieValue.contains("visid")) {
                        visid = cookieValue.substring(0,
                                cookieValue.indexOf(";") + 1);
                    }
                    if (cookieValue.contains("incap_ses")) {
                        incap = cookieValue.substring(0,
                                cookieValue.indexOf(";"));
                    }
                }
            }
        }

        incapsulaCookie = visid + " " + incap;

        // Explicitly disconnect, also essential in this method!
        cookieConnection.disconnect();

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            if (in != null)
                in.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }

    return incapsulaCookie;

}

This method extracts the encapsula cookie for you. Here is a modified version of your method, which uses the cookie:

public static String readUrl(String url, String incapsulaCookie) {

    StringBuilder response = new StringBuilder();
    String USER_AGENT = "Mozilla/5.0", inputLine;
    BufferedReader in = null;

    try {

        HttpURLConnection connection =
                (HttpURLConnection) new URL(url).openConnection();
        connection.setRequestMethod("GET");
        connection.setRequestProperty("Accept", "text/html; charset=UTF-8");
        connection.setRequestProperty("User-Agent", USER_AGENT);

        // Set the Incapsula cookie
        connection.setRequestProperty("Cookie", incapsulaCookie);

        in = new BufferedReader(
                new InputStreamReader(connection.getInputStream()));

        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }

        in.close();

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            if (in != null)
                in.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
    return response.toString();

}

As I have observed, the user agent and other attributes do not seem to matter. You can now call getIncapsulaCookie(String url) once or whenever you want a new cookie, to get the cookie and readUrl(String url, String incapsulaCookie) multiple times to request the page, until the cookie expires. The result is the complete HTML page, as seen in this partial image:

enter image description here

Important details: There are two essential commands in the getIncapsulaCookie(...) method, namely cookieConnection.setRequestProperty("Connection", "close"); and cookieConnection.disconnect();. Both are required, if you want to call readUrl(...) immediately after. If you omit these commands, the HTTP connection will be kept alive on the server side after you received the cookie and the next call to readUrl(...) will return the wrong page to you. You can try this by leaving out these commands and instead calling getIncapsulaCookie(...), then wait 5 to 65 seconds and call readUrl(...). You will see that this also works, because the connection times out automatically. See also here.

Community
  • 1
  • 1
thatguy
  • 13,242
  • 6
  • 19
  • 33
  • Updated my answer with remarks to `HttpUrlConnection` and `URLConnection`. – thatguy Nov 21 '16 at 18:16
  • Updated my code, but still doesn't work. Any sample code ? – Frank Nov 21 '16 at 19:59
  • Solution with working code, but a strange encoding bug, added. It solves your problem and explains the cause. – thatguy Nov 22 '16 at 04:07
  • Thanks for the detailed answer, I tried your approach, but it only got the first part of the page, not the content with results, see my edited question. – Frank Nov 22 '16 at 17:09
  • I tested it again and I get the complete HTML website with all the results. The page starts with ` – thatguy Nov 22 '16 at 17:19
  • **Please pay attention to the last part of my answer**. If I do not copy and paste the cookie string, I always get ` – thatguy Nov 22 '16 at 17:24
  • I'm trying to do what you suggested, but just don't know how to "Cut & Paste" the cookie, because the program ran in one piece, saves the cookie to the variable, then pass it to another method, how do I manually paste it into the program ? – Frank Nov 22 '16 at 18:04
  • Write a separate program, with simply a `main` method, where you call `System.out.println(getIncapsulaCookie("YOUR_URL"))`. Copy the string from the console and paste it into your application like `readUrl(String url, "COOKIE_STRING_COPIED_FROM_CONSOLE")`. This is a really disappointing solution I know, because you have to do this each time the cookie expires. Sadly, I do not have a solution for this bug. – thatguy Nov 22 '16 at 18:27
  • OK, I'll try. Meanwhile from what you said, I wonder if we can figure out the encoding by detecting what encoding is used in the cut and past approach and the encoding used when passing it as a parameter, so after we figure that out, we can use that encoding and pass it as a parameter, something like this : readUrl(String url, URLEncoder.encode("COOKIE_STRING_COPIED_FROM_CONSOLE", "UTF-8")) ? – Frank Nov 22 '16 at 19:13
  • Forget the previous conversation. I have investigated the problem further and found the problem, which incurs HTTP keep-alive. See my updated answer, both the code samples and parts of the description. According to my tests, it runs now without any copy-pasting, as you intended. You can directly copy it. – thatguy Nov 26 '16 at 00:29