0

I am trying to download the html content of a web page and getting the 416 status. I found one solution which correctly improves the status code as 200 but still not downloading the proper content. I am very close but missing something. Please help.

Code with 416 status:

    public static void main(String[] args) {
        String URL="http://www.xyzzzzzzz.com.sg/";

        HttpClient client = new org.apache.commons.httpclient.HttpClient();
        org.apache.commons.httpclient.methods.GetMethod method = new org.apache.commons.httpclient.methods.GetMethod(URL);
        client.getHttpConnectionManager().getParams().setConnectionTimeout(AppConfig.CONNECTION_TIMEOUT);
        client.getHttpConnectionManager().getParams().setSoTimeout(AppConfig.READ_DATA_TIMEOUT);
        String html = null; InputStream ios = null;
        try {
            int statusCode = client.executeMethod(method);

            ios = method.getResponseBodyAsStream();
            html = IOUtils.toString(ios, "utf-8");
            System.out.println(statusCode);
        }catch (Exception e) {
            e.printStackTrace();
        } finally {
            if(ios!=null) {
                try {ios.close();} 
                catch (IOException e) {e.printStackTrace();}
            }
            if(method!=null) method.releaseConnection();
        }

        System.out.println(html);
    }
Code with 200 status (but htmlContent is not proper):
    public static void main(String[] args) {

        String URL="http://www.xyzzzzzzz.com.sg/";

        HttpClient client = new org.apache.commons.httpclient.HttpClient();
        org.apache.commons.httpclient.methods.GetMethod method = new org.apache.commons.httpclient.methods.GetMethod(URL);
        client.getHttpConnectionManager().getParams().setConnectionTimeout(AppConfig.CONNECTION_TIMEOUT);
        client.getHttpConnectionManager().getParams().setSoTimeout(AppConfig.READ_DATA_TIMEOUT);
        String html = null; InputStream ios = null;
        try {
            int statusCode = client.executeMethod(method);
            if(statusCode == HttpStatus.SC_REQUESTED_RANGE_NOT_SATISFIABLE) {
                method.setRequestHeader("User-Agent", "Mozilla/5.0");
                method.setRequestHeader("Accept-Ranges", "bytes=100-1500");
                statusCode = client.executeMethod(method);
            }
            ios = method.getResponseBodyAsStream();
            html = IOUtils.toString(ios, "utf-8");
            System.out.println(statusCode);
        }catch (Exception e) {
            e.printStackTrace();
        } finally {
            if(ios!=null) {
                try {ios.close();} 
                catch (IOException e) {e.printStackTrace();}
            }
            if(method!=null) method.releaseConnection();
        }

        System.out.println(html);
    }
Shashank
  • 580
  • 5
  • 27

2 Answers2

0

You can do this with a URL Connection.

Check this post

Using java.net.URLConnection to fire and handle HTTP requests

Community
  • 1
  • 1
sanket
  • 769
  • 4
  • 15
0

Your first sample code works for me without problems, the second sample code works if I remove the set headers code block

if(statusCode == HttpStatus.SC_REQUESTED_RANGE_NOT_SATISFIABLE) {
    method.setRequestHeader("User-Agent", "Mozilla/5.0");
    method.setRequestHeader("Accept-Ranges", "bytes=100-1500");
    statusCode = client.executeMethod(method);
}

It's a bit strange, a LAN config issue maybe (firewall, proxy... etc), anyway HttpClient 3.1 is quite old, using httpclient 4.x from Apache HttpComponents

import org.apache.commons.io.IOUtils;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;

public class Snippet {

    public static void main(String[] args) {

        String url = "http://www.jobstreet.com.sg/";
        HttpClient client = new DefaultHttpClient();
        HttpGet get = new HttpGet(url);
        try {
            HttpResponse res = client.execute(get);
            System.out.println(res.getStatusLine().getStatusCode());
            System.out.println(IOUtils.toString(res.getEntity().getContent()));
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            client.getConnectionManager().shutdown();
        }
    }
}

works as expected.

Try with HttpClient 4, if you still getting the same error then the problem is not in your code.

vzamanillo
  • 9,465
  • 1
  • 31
  • 54
  • I am already using HttpCLient 4. Its not working. Please suggest what else we can do? – Shashank Jan 01 '14 at 07:37
  • vzamanillo Thanks for repling. I am using html 4 only. Plz suggest some other rsolution. – Shashank Jan 03 '14 at 09:37
  • Seems like other in the middle software are modifiying your request, maybe a proxy or firewall, are you in LAN behind a proxy or firewall or are you using a direct connection? try debugging the HTTP headers from request and response to find the Range header, more info at http://tools.ietf.org/html/rfc2616#section-10.4.17 – vzamanillo Jan 03 '14 at 16:57