Text Extraction from HTML Java

Question

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.

I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;

FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        try {
            out.write(s);
        } catch (IOException e) {
        }
    }
}

i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p> tag, by saying;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        while(!s.contains("</p>") {
            try {
                out.write(s);
            } catch (IOException e) {
            }
        }
    }
}

But this doesn't work. Could someone please help.

We definitely are seeing a bug in SO's escaping of HTML tags. — Yishai, Sep 06 '09 at 16:55

score 32 · Answer 1 · edited Jun 26 '14 at 06:45

32

jsoup

Another html parser I really liked using was jsoup. You could get all the <p> elements in 2 lines of code.

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements ps = doc.select("p");

Then write it out to a file in one more line

out.write(ps.text());  //it will append all of the p elements together in one long string

or if you want them on separate lines you can iterate through the elements and write them out separately.

edited Jun 26 '14 at 06:45

Basil Bourque

218,480
72
657
915

answered Apr 23 '12 at 14:04

Danny

6,838
8
40
65

1

If a document doesn't use `p` tags (non-semantic mark up), I assume this won't work – sinθ Jun 14 '14 at 23:36
2

@sinθ The Question explicitly asked for `p` elements. This answer is spot-on correct. – Basil Bourque Jun 26 '14 at 06:49
Thanks @Danny, I ♥ this soup ! – frogatto Jan 06 '15 at 17:02

score 10 · Answer 2 · answered Sep 06 '09 at 17:02

10

jericho is one of several posible html parsers that could make this task both easy and safe.

answered Sep 06 '09 at 17:02

Gareth Davis

26,716
11
69
103

skaffman · Answer 3 · 2009-09-06T17:14:13.610

4

JTidy can represent an HTML document (even a malformed one) as a document model, making the process of extracting the contents of a <p> tag a rather more elegant process than manually thunking through the raw text.

edited Sep 06 '09 at 17:14

answered Sep 06 '09 at 17:08

skaffman

381,978
94
789
754

score 0 · Answer 4 · answered Sep 06 '09 at 17:32

0

I've had success using TagSoup & XPath to parse HTML.

http://home.ccil.org/~cowan/XML/tagsoup/

answered Sep 06 '09 at 17:32

Billy Bob Bain

2,754
14
12

score 0 · Answer 5 · answered Sep 06 '09 at 22:04

Use a ParserCallback. Its a simple class thats included with the JDK. It notifies you every time a new tag is found and then you can extract the text of the tag. Simple example:

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class ParserCallbackTest extends HTMLEditorKit.ParserCallback
{
    private int tabLevel = 1;
    private int line = 1;

    public void handleComment(char[] data, int pos)
    {
        displayData(new String(data));
    }

    public void handleEndOfLineString(String eol)
    {
        System.out.println( line++ );
    }

    public void handleEndTag(HTML.Tag tag, int pos)
    {
        tabLevel--;
        displayData("/" + tag);
    }

    public void handleError(String errorMsg, int pos)
    {
        displayData(pos + ":" + errorMsg);
    }

    public void handleMutableTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
        displayData("mutable:" + tag + ": " + pos + ": " + a);
    }

    public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
        displayData( tag + "::" + a );
//      tabLevel++;
    }

    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
        displayData( tag + ":" + a );
        tabLevel++;
    }

    public void handleText(char[] data, int pos)
    {
        displayData( new String(data) );
    }

    private void displayData(String text)
    {
        for (int i = 0; i < tabLevel; i++)
            System.out.print("\t");

        System.out.println(text);
    }

    public static void main(String[] args)
    throws IOException
    {
        ParserCallbackTest parser = new ParserCallbackTest();

        // args[0] is the file to parse

        Reader reader = new FileReader(args[0]);
//      URLConnection conn = new URL(args[0]).openConnection();
//      Reader reader = new InputStreamReader(conn.getInputStream());

        try
        {
            new ParserDelegator().parse(reader, parser, true);
        }
        catch (IOException e)
        {
            System.out.println(e);
        }
    }
}

So all you need to do is set a boolean flag when the paragraph tag is found. Then in the handleText() method you extract the text.

score 0 · Answer 6 · answered Jun 20 '13 at 05:33

Try this.

 public static void main( String[] args )
{
    String url = "http://en.wikipedia.org/wiki/Big_data";

    Document document;
    try {
        document = Jsoup.connect(url).get();
        Elements paragraphs = document.select("p");

        Element firstParagraph = paragraphs.first();
        Element lastParagraph = paragraphs.last();
        Element p;
        int i=1;
        p=firstParagraph;
        System.out.println("*  " +p.text());
        while (p!=lastParagraph){
            p=paragraphs.get(i);
            System.out.println("*  " +p.text());
            i++;
        } 
} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}
}

What is this 'Element' and 'Document'. Is this any third party parser? Show the import lines too — James, Aug 29 '17 at 06:34

Niall · Answer 7 · 2009-09-06T17:08:37.380

Try (if you don't want to use a HTML parser library):


        FileReader fileReader = new FileReader(file);
        BufferedReader buffRd = new BufferedReader(fileReader);
        BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
        String s;
        int writeTo = 0;
        while ((s = br.readLine()) !=null) 
        {
                if(s.contains("<p>"))
                {
                        writeTo = 1;

                        try 
                        {
                            out.write(s);
                    } 
                        catch (IOException e) 
                        {

                    }
                }
                if(s.contains("</p>"))
                {
                        writeTo = 0;

                        try 
                        {
                            out.write(s);
                    } 
                        catch (IOException e) 
                        {

                    }
                }
                else if(writeTo==1)
                {
                        try 
                        {
                            out.write(s);
                    } 
                        catch (IOException e) 
                        {

                    }
                }
}

What happens if the `
` and `
` are on the same line? In this case the string will be written out twice. I guess it really depends on the input. — pjp, Sep 06 '09 at 17:13
You could add some state to see if you have already written out the line before writing it out again. — pjp, Sep 06 '09 at 17:21

score -3 · Answer 8 · answered Sep 06 '09 at 17:14

-3

You may just be using the wrong tool for the job:

perl -ne "print if m|<p>| .. m|</p>|" infile.txt >outfile.txt

answered Sep 06 '09 at 17:14

brianary

7,906
2
33
29

That's a fair cop. Kind of a late hit, though. – brianary Dec 26 '09 at 02:09

Text Extraction from HTML Java

8 Answers8

jsoup

Linked

Related