56

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).

For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?

What would this process even be called? I have no idea were to even begin researching this.

Edit: Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:

Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)

I do have Apache Commons

Ram kiran
  • 20,129
  • 14
  • 55
  • 74
James
  • 5,322
  • 9
  • 32
  • 42
  • 1
    You have a problem with LinkedList because LinkedList.peekFirst appeared in java 1.6, and you seem to use earlier version – zamza Aug 09 '11 at 22:49
  • 2
    This process is commonly called "screen scraping" and is used when an API (like SOAP) is not available but a web GUI is. It involves having your application pretend to be a web browser and parse the HTML pages (more or less) manually. I suggest you consider one of the APIs listed below that automate much of the parsing. – Chris Nava Sep 19 '11 at 19:22

10 Answers10

99

Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Java since it supports jQuery like CSS selectors. Also, its class representing a list of nodes, Elements, implements Iterable so that you can iterate over it in an enhanced for loop (so there's no need to hassle with verbose Node and NodeList like classes in the average Java DOM parser).

Here's a basic kickoff example (just put the latest Jsoup JAR file in classpath):

package com.stackoverflow.q2835505;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        String url = "https://stackoverflow.com/questions/2835505";
        Document document = Jsoup.connect(url).get();

        String question = document.select("#question .post-text").text();
        System.out.println("Question: " + question);

        Elements answerers = document.select("#answers .user-details a");
        for (Element answerer : answerers) {
            System.out.println("Answerer: " + answerer.text());
        }
    }

}

As you might have guessed, this prints your own question and the names of all answerers.

Community
  • 1
  • 1
BalusC
  • 992,635
  • 352
  • 3,478
  • 3,452
  • 2
    Wow, this is nice! I have a question though, I just copy and pasted this just to do a test run, but I keep getting this error(look at edited OP) – James May 14 '10 at 22:30
  • 2
    @James: This requires at least Java 1.6 (which is already out over 3 years). The mentioned [`LinkedList#peekFirst()`](http://java.sun.com/javase/6/docs/api/java/util/LinkedList.html#peekFirst%28%29) method was introduced in Java 1.6. Upgrade your JVM (JDK) or configure your IDE (Eclipse?) to Java 6 compliance mode. – BalusC May 14 '10 at 22:38
  • 9
    If any .NET programmers are interested, I've ported jsoup to .NET: http://nsoup.codeplex.com/. Hope this helps anyone. – GeReV May 31 '10 at 09:29
  • @BalusC , That's example you gave: Made My Day!!! wasn't aware of this amazing library!!! I was going crazy with URL FETCH... Jsoup is what I was should have been looking for... Huge Thanks! – Daniel Apr 13 '12 at 20:40
  • @user2602807: Jsoup is a HTML parser, not a HTML client. Just use a HTML client library. See also http://stackoverflow.com/questions/3152138/what-are-the-pros-and-cons-of-the-leading-java-html-parsers – BalusC Sep 15 '14 at 20:16
  • add dependencies gradle: `implementation 'org.jsoup:jsoup:1.13.1'` – Wilmer Jul 13 '20 at 13:46
  • JSoup us a great solution. However, it (relatively) takes a lot of time because it has to read the webpage first. Is there a faster solution? – Cardinal System Feb 18 '21 at 03:53
  • @CardinalSystem: Yes, that was exactly what the OP asked. Jsoup however also just supports taking HTML code in a `String` variable as in `Document document = Jsoup.parse(html);`. See also its documentation. – BalusC Feb 18 '21 at 09:55
10

This is referred to as screen scraping, wikipedia has this article on the more specific web scraping. It can be a major challenge because there's some ugly, mess-up, broken-if-not-for-browser-cleverness HTML out there, so good luck.

sblundy
  • 58,164
  • 22
  • 117
  • 120
6

I would use JTidy - it is simlar to JSoup, but I don't know JSoup well. JTidy handles broken HTML and returns a w3c Document, so you can use this as a source to XSLT to extract the content you are really interested in. If you don't know XSLT, then you might as well go with JSoup, as the Document model is nicer to work with than w3c.

EDIT: A quick look on the JSoup website shows that JSoup may indeed be the better choice. It seems to support CSS selectors out the box for extracting stuff from the document. This may be a lot easier to work with than getting into XSLT.

mdma
  • 54,185
  • 11
  • 85
  • 125
4

You may use an html parser (many useful links here: java html parser).

The process is called 'grabbing website content'. Search 'grab website content java' for further invertigation.

Roman
  • 59,060
  • 84
  • 230
  • 322
3

jsoup supports java 1.5

https://github.com/tburch/jsoup/commit/d8ea84f46e009a7f144ee414a9fa73ea187019a3

looks like that stack was a bug, and has been fixed

2

You could also try jARVEST.

It is based on a JRuby DSL over a pure-Java engine to spider-scrape-transform web sites.

Example:

Find all links inside a web page (wget and xpath are constructs of the jARVEST's language):

wget | xpath('//a/@href')

Inside a Java program:

Jarvest jarvest = new Jarvest();
  String[] results = jarvest.exec(
    "wget | xpath('//a/@href')", //robot! 
    "http://www.google.com" //inputs
  );
  for (String s : results){
    System.out.println(s);
  }
t0mm13b
  • 32,846
  • 7
  • 71
  • 106
lipido
  • 66
  • 3
2

You'd probably want to look at the HTML to see if you can find strings that are unique and near your text, then you can use line/char-offsets to get to the data.

Could be awkward in Java, if there aren't any XML classes similar to the ones found in System.XML.Linq in C#.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Kurru
  • 13,337
  • 16
  • 59
  • 79
2

My answer won't probably be useful to the writer of this question (I am 8 months late so not the right timing I guess) but I think it will probably be useful for many other developers that might come across this answer.

Today, I just released (in the name of my company) an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations. The library itself is quite handy and features many other things all the while being very pluggable. You can have a look to it right here : https://github.com/whimtrip/jwht-htmltopojo

How to use : Basics

Imagine we need to parse the following html page :

<html>
    <head>
        <title>A Simple HTML Document</title>
    </head>
    <body>
        <div class="restaurant">
            <h1>A la bonne Franquette</h1>
            <p>French cuisine restaurant for gourmet of fellow french people</p>
            <div class="location">
                <p>in <span>London</span></p>
            </div>
            <p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>  
            <div class="meals">
                <div class="meal">
                    <p>Veal Cutlet</p>
                    <p rating-color="green">4.5/5 stars</p>
                    <p>Chef Mr. Frenchie</p>
                </div>

                <div class="meal">
                    <p>Ratatouille</p>
                    <p rating-color="orange">3.6/5 stars</p>
                    <p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
                </div>

            </div> 
        </div>    
    </body>
</html>

Let's create the POJOs we want to map it to :

public class Restaurant {

    @Selector( value = "div.restaurant > h1")
    private String name;

    @Selector( value = "div.restaurant > p:nth-child(2)")
    private String description;

    @Selector( value = "div.restaurant > div:nth-child(3) > p > span")    
    private String location;    

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        indexForRegexPattern = 1,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Long id;

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        // This time, we want the second regex group and not the first one anymore
        indexForRegexPattern = 2,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Integer rank;

    @Selector(value = ".meal")    
    private List<Meal> meals;

    // getters and setters

}

And now the Meal class as well :

public class Meal {

    @Selector(value = "p:nth-child(1)")
    private String name;

    @Selector(
        value = "p:nth-child(2)",
        format = "^([0-9.]+)\/5 stars$",
        indexForRegexPattern = 1
    )
    private Float stars;

    @Selector(
        value = "p:nth-child(2)",
        // rating-color custom attribute can be used as well
        attr = "rating-color"
    )
    private String ratingColor;

    @Selector(
        value = "p:nth-child(3)"
    )
    private String chefs;

    // getters and setters.
}

We provided some more explanations on the above code on our github page.

For the moment, let's see how to scrap this.

private static final String MY_HTML_FILE = "my-html-file.html";

public static void main(String[] args) {


    HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();

    HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);

    // If they were several restaurants in the same page, 
    // you would need to create a parent POJO containing
    // a list of Restaurants as shown with the meals here
    Restaurant restaurant = adapter.fromHtml(getHtmlBody());

    // That's it, do some magic now!

}


private static String getHtmlBody() throws IOException {
    byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
    return new String(encoded, Charset.forName("UTF-8"));

}

Another short example can be found here

Hope this will help someone out there!

Louis-wht
  • 487
  • 4
  • 17
1

JSoup solution is great, but if you need to extract just something really simple it may be easier to use regex or String.indexOf

As others have already mentioned the process is called scraping

Anton
  • 2,603
  • 1
  • 14
  • 6
  • Why it would be easier to use regex? I have tried regex and it really can't handle real life html and its possibly dangerous to use parse html. Jsoup is out of the box solution, just few line codes and you do what ever you need to do with your html.. – newbie May 28 '10 at 18:42
  • Oversimplified example - Imagine all you want is to extract the date the page was generated. So you check the html and see something like `07/07/07`. Well, then I would use String.indexOf or some of my own utilities like textBetween("", ""). An added benefit is that you don't have to parse the whole html. I've had success extracting data from html with a home-grown StringScanner class with methods like moveBefore(String what), moveAfter(String what), getTextUpTo(String what), ... It all depends on how complicated your problem is. – Anton May 29 '10 at 19:11
-1

Look into the cURL library. I've never used it in Java, but I'm sure there must be bindings for it. Basically, what you'll do is send a cURL request to whatever page you want to 'scrape'. The request will return a string with the source code to the page. From there, you will use regex to parse whatever data you want from the source code. That's generally how you are going to do it.

Nelson
  • 1
  • 4
    [Don't use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – BalusC May 14 '10 at 21:11