0

I want to get the html source code of https://www2.cslb.ca.gov/OnlineServices/CheckLicenseII/LicenseDetail.aspx?LicNum=872423 and for that I am using this method but I am not getting the html source code.

public static String getHTML(URL url) {
    HttpURLConnection conn; // The actual connection to the web page
    BufferedReader rd; // Used to read results from the web page
    String line; // An individual line of the web page HTML
    String result = ""; // A long string containing all the HTML
    try {
        conn = (HttpURLConnection) url.openConnection();
        conn.setRequestMethod("GET");
        rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
        while ((line = rd.readLine()) != null) {
            result += line;
        }
        rd.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
    return result;
}
Yatendra
  • 31,339
  • 88
  • 211
  • 291
  • the line `rd.readLine()` is null even for the first time. – Yatendra Nov 15 '11 at 19:40
  • Can you provide us more context? What do you mean by "not getting the html source code"? – JavierIEH Nov 15 '11 at 19:41
  • @JavierIEH The method is returning emtpy string – Yatendra Nov 15 '11 at 19:43
  • Have you tried `curl` (command line utils) to get the html? Some website check if the request is coming from a web browser. – gigadot Nov 15 '11 at 19:45
  • @gigadot I think that server can that whether the request is coming from the browser or not by looking at the `User-Agent` http header. Is there any other way also by which it can check this? – Yatendra Nov 16 '11 at 18:37
  • `User-Agent` and `referer` are the common uses for server to check if the website is coming from web browser or the links are coming from the same host. `referer` is less reliable. Some people block them from sending out due to privacy concern. – gigadot Nov 16 '11 at 18:49

1 Answers1

4

The server filters out Java's default User-Agent. This works:

public static String getHTML(URL url) {
    try {
        final URLConnection urlConnection = url.openConnection();
        urlConnection.addRequestProperty("User-Agent", "Foo?");
        final InputStream inputStream = urlConnection.getInputStream();
        final String html = IOUtils.toString(inputStream);
        inputStream.close();
        return html;
    } catch (Exception e) {
        throw new RuntimeException(e);
    }

Looks like the user agents are black listed. By default my JDK sends:

User-Agent: Java/1.6.0_26

Note that I'm using IOUtils class to simplify example, but the key things is:

urlConnection.addRequestProperty("User-Agent", "Foo?");
Tomasz Nurkiewicz
  • 311,858
  • 65
  • 665
  • 652