58

I am writing a small Java program to get the amount of results for a given Google search term. For some reason, in Java I am getting a 403 Forbidden but I am getting the right results in web browsers. Code:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;


public class DataGetter {

    public static void main(String[] args) throws IOException {
        getResultAmount("test");
    }

    private static int getResultAmount(String query) throws IOException {
        BufferedReader r = new BufferedReader(new InputStreamReader(new URL("https://www.google.com/search?q=" + query).openConnection()
                .getInputStream()));
        String line;
        String src = "";
        while ((line = r.readLine()) != null) {
            src += line;
        }
        System.out.println(src);
        return 1;
    }

}

And the error:

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.google.com/search?q=test
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
    at DataGetter.getResultAmount(DataGetter.java:15)
    at DataGetter.main(DataGetter.java:10)

Why is it doing this?

tckmn
  • 52,184
  • 22
  • 101
  • 145
  • @Perception um... what's an SSL endpoint? (sorry I'm clueless about this kind of stuff) – tckmn Dec 02 '12 at 15:38
  • 2
    SSL (secure socket layer) is a method of ensuring security of data passed back and forth between a client and server. An SSL endpoint is a regular URL, but with ***https*** instead of ***http***. Using SSL is more complicated than regular http because there needs to be handshaking between the client and server. Which in your case is unnecessary, since you can just use the 'normal' http endpoint for Google (http;//www.google.com/search) – Perception Dec 02 '12 at 15:42
  • @Perception if I use normal http:// the same thing happens – tckmn Dec 02 '12 at 15:54
  • Add the query you are working with too the question. – Perception Dec 02 '12 at 15:58

4 Answers4

116

You just need to set user agent header for it to work:

URLConnection connection = new URL("https://www.google.com/search?q=" + query).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();

BufferedReader r  = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));

StringBuilder sb = new StringBuilder();
String line;
while ((line = r.readLine()) != null) {
    sb.append(line);
}
System.out.println(sb.toString());

The SSL was transparently handled for you as could be seen from your exception stacktrace.

Getting the result amount is not really this simple though, after this you have to fake that you're a browser by fetching the cookie and parsing the redirect token link.

String cookie = connection.getHeaderField( "Set-Cookie").split(";")[0];
Pattern pattern = Pattern.compile("content=\\\"0;url=(.*?)\\\"");
Matcher m = pattern.matcher(response);
if( m.find() ) {
    String url = m.group(1);
    connection = new URL(url).openConnection();
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
    connection.setRequestProperty("Cookie", cookie );
    connection.connect();
    r  = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
    sb = new StringBuilder();
    while ((line = r.readLine()) != null) {
        sb.append(line);
    }
    response = sb.toString();
    pattern = Pattern.compile("<div id=\"resultStats\">About ([0-9,]+) results</div>");
    m = pattern.matcher(response);
    if( m.find() ) {
        long amount = Long.parseLong(m.group(1).replaceAll(",", ""));
        return amount;
    }

}

Running the full code I get 2930000000L as a result.

Esailija
  • 130,716
  • 22
  • 250
  • 308
  • Dude, I owe you a keg of beer, this is such a perfect solution to my problem! Can google restrict/throttle your results using this method? – benscabbia Mar 28 '15 at 21:25
  • @gudthing throttling is ip-based, so it's not about the method but whether you change your ip :-) – Esailija Mar 29 '15 at 00:29
  • I see! A simple router restart (for WAN change) will solve the problem :). Thanks again!! – benscabbia Mar 29 '15 at 08:26
  • connection.connect(); will throw exception "already connected" – Java Main May 20 '18 at 22:57
  • @Esailija What should the variable `response` contain? – Harshita Sethi Jun 19 '18 at 13:43
  • The full code link is dead. Can it be re-hosted on a service without expirations? – killjoy Mar 14 '19 at 06:28
  • This is the things which made my day: Now I find out why HTTP url was not working in web api calling, just this is very usefult for me to work in Android 9 and Android 10. connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"); – Nikunjkumar Kapupara Aug 14 '20 at 07:11
  • Amazon Cloudfront httpcon.addRequestProperty("Accept-Encoding", "gzip, deflate, br"); – XX Terror Sep 05 '20 at 15:55
5

For me it worked by adding the header: "Accept": "*/*"

rpajaziti
  • 145
  • 4
  • 13
2

You probably aren't setting the correct headers. Use LiveHttpHeaders (or equivalent) in the browser to see what headers the browser is sending, then emulate them in your code.

Ivan
  • 8,938
  • 11
  • 45
  • 77
Kevin Day
  • 15,263
  • 8
  • 35
  • 66
  • I tried `"https://www.google.com/search?q=" + query + "&rlz=1C1RNNN_enUS371&aq=f&oq=" + query + "&sugexp=chrome,mod=6&sourceid=chrome&ie=UTF-8"`, still didn't work – tckmn Dec 02 '12 at 15:32
  • 1
    @PicklishDoorknob you added a query string parameter, you didn't change the headers. You can set headers with `.setRequestProperty()` on the `URLConnection` object – Esailija Dec 02 '12 at 16:28
  • Here's an SO article that talks about adding request headers: http://stackoverflow.com/questions/480153/how-to-modify-the-header-of-a-httpurlconnection – Kevin Day Dec 02 '12 at 19:58
0

It's because the site uses SSL. Try using the Jersey HTTP Client. You will probably also have to learn a little about HTTPS and the certificates, but I think Jersey can bet set to ignore most of the details relating to the actual security.

  • 1
    No it isn't, it works just by emulating browser http headers like @KevinDay said in his answer. – Esailija Dec 02 '12 at 16:24
  • 3
    @Ben Brunk - there is a good lesson here - at the core, all of programming is built up of layer upon layer of additional abstraction. Understanding the low level is super useful. Using a higher level client like you describe might work - but only because it's making a low level call that you yourself could make if you choose to. I will never forget how illuminating it was for me to sit down and interact with a web server using a telnet client and crafting the HTTP request by hand. Cheerio! – Kevin Day Dec 02 '12 at 20:02
  • Actually, I'm still not sure why that code worked because you typically have to add the site's public certificate to your local Java keystore in order to use SSL like that, even with URLConnection, so something doesn't add up about that URL. Also, what makes you think I never connected to a website using telnet? I do this for a living and I often forget there are a lot of people on this site who are students or hobby programmers. I just try to be hepful. –  Dec 03 '12 at 01:09
  • If the site uses a certificate that has a trust chain to a CA that is included with JAVA in it's cacerts truststore (located in jdk\jre\lib\security) then explicitly adding the sites certificate is not needed. – user472749 Feb 10 '17 at 16:16