0

I was wondering how I can make a search on a website via java. I want to first search a word on the web site. Then web site is going to return me some links. I want to click on these links. They will redirect me to another page and finally I will get data from that page. I checked jsoup in order to parse html page but I don't know how I can make search on web site and click links by using java.

Ahmet Tanakol
  • 821
  • 2
  • 20
  • 35
  • Can you clarify more on the search functionality? What is the ideal source for links? Apache Lucene might be helpful for you. – amit Nov 13 '12 at 20:06
  • there is a search box on the web site. I want to search a drug's properties. So when I write drug name on the search box, it returns a link. So I click on this link. I displays drug's properties on another html page. – Ahmet Tanakol Nov 13 '12 at 20:09
  • I first thought, I can get the html page by making an URL connection, but looking more easy and faster ways to do that. – Ahmet Tanakol Nov 13 '12 at 20:10
  • @AhmetTanakol You're not making a lot of sense. In the original question you seemed to refer to three pages (search page, search results page, page "redirected to"). You can omit the first page if you know how the search request is formulated, but that still leaves two pages, not one. But you're now talking as if there is only one page you have to get. – Robin Green Nov 13 '12 at 20:59

3 Answers3

1

You need to make HTTP requests, just like a browser would. Use e.g. the Network panel in Google Chrome to see what HTTP requests Chrome makes when you do a search manually, ignore the ones that don't matter and write code to simulate the ones that do.

For finding the right search result to request ("click on") you will need to use something like jsoup for that as well.

You could use Selenium instead, but that would be ridiculously heavyweight, unless the site uses some complicated Javascript or plugin to do the search, which is unlikely.

Community
  • 1
  • 1
Robin Green
  • 29,408
  • 13
  • 94
  • 178
1

Take a look at this example. Download latest jar of HtmlUnit. Create new project import these jar and add the folloing class. hope you get your required objective.

package com.examples.htmlunit;

import java.io.IOException;
import java.net.URL;
import java.util.List;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.RefreshHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow;

public class YahooMail {

public static void main(String[] args) throws Exception {

    // Create and initialize WebClient object
    WebClient webClient = new WebClient(BrowserVersion.FIREFOX_2);
    webClient.setThrowExceptionOnScriptError(false);
    webClient.setRefreshHandler(new RefreshHandler() {
public void handleRefresh(Page page, URL url, int arg) throws IOException {
            System.out.println("handleRefresh");
        }

    });

    // visit Yahoo Mail login page and get the Form object
    HtmlPage page = (HtmlPage)  webClient.getPage("https://login.yahoo.com/config/login_verify2?.intl=us&.src=ym");
    HtmlForm form = page.getFormByName("login_form");

    // Enter login and passwd
    form.getInputByName("login").setValueAttribute("@@@@@@@");
    form.getInputByName("passwd").setValueAttribute("@@@@@@@");

    // Click "Sign In" button/link
    page = (HtmlPage) form.getInputByValue("Sign In").click();

    // Click "Inbox" link
    HtmlAnchor anchor = (HtmlAnchor)page.getHtmlElementById("WelcomeInboxFolderLink");
    page = (HtmlPage) anchor.click();

    // Get the table object containing the mails
    HtmlTable dataTable = (HtmlTable) page.getHtmlElementById("datatable");

    // Go through each row and count the row with class=msgnew
    int newMessageCount = 0;
    List rows = (List) dataTable.getHtmlElementsByTagName("tr");
    for (HtmlTableRow row: rows) {
        if (row.getAttribute("class").equals("msgnew")) {
            newMessageCount++;
        }
    }       

    // Print the newMessageCount to screen
    System.out.println("newMessageCount = " + newMessageCount);

    //System.out.println(page.asXml());                 

}
}
Sajid Hussain
  • 348
  • 2
  • 7
0

First you should become familiar with the HTTP Request Protocols. Then it is a simple matter of programming your website to become a socket server and when connected to you send over data that would make sense. I've made a Webserver in Python only using the socket, os and sys library.

The basic HTTP Protocol is the client will send the server

GET /path/file.extension HTTP/1.0 <- Basically GET is the type of request, /path/file.extension is basically the file being requested. and HTTP/1.0 is the protocol Host: yourwebsite.url <- I don't believe this is needed User-Agent: HTTPTool/1.0 <- Basically is like the method they are using to send the HTTP request, like Chrome or Firefox [blank]

The server would response kind of like HTTP/1.0 200 OK <- once again, the protocol, then the message (404 is not found, etc.) Date: Mon, 19 Nov 2012 14:15:45 GMT <- This isn't necesary, but you might as well include it Content-Type: text/html <- Type of content your sending, html is text/html theres also ones for images zips etc. Just google it (it's pretty simple) Content-Length: 12313131 <- How long (in characters) the data is. this is NEEDED [blank]
< html >
< head >
< h2 >Hi< /h2 >
< /head >
< body >
Welcome to my poop
< /body >
< /html >
Then after the server has send the data, it closes the socket. in Java, a string length is:
String blah = "foobar"; int length = blah.length();

For more information about Sockets in Java read this: http://docs.oracle.com/javase/tutorial/networking/sockets/index.html
After that it is a matter of storing the words you want to look up in an array and handling the data send to the client. You also want to be able to understand POST. After that all you do is get the file they want to see, give it to them. and when they search something, look it up in a database return the link or return a item not found.

Brian Smith
  • 55
  • 2
  • 13
  • actually I did what you said already. I opened tcp connections to read web sites byte by byte but then I have to parse again the returned html page. I was hoping to find another solution. – Ahmet Tanakol Nov 13 '12 at 20:26
  • What you could do actually, after reading further on what you want this to do, is that you could simply do a Wikipedia Search for the drug. http://en.wikipedia.org/w/api.php?action=opensearch&search= will return an array(?) of the possible links. Then the link is simply en.wikipedia.org/wiki/ is the link to your page. I'd just loop through them all and link them. – Brian Smith Nov 13 '12 at 20:38
  • @AhmetTanakol If you have already done this you should have stated that in the question! – Robin Green Nov 13 '12 at 20:56