Text Extraction from HTML using Java including source line number and code

Question

The Question how to extract Text from HTML using Java has been viewed and duplicated a zillion times: Text Extraction from HTML Java

Thanks to the answers found on Stackoverflow my current state of affairs is that I am using JSoup

<!-- Jsoup maven dependency -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.7.3</version>
</dependency>

and this piece or code:

// parse the html from the givne string
Document doc = Jsoup.parse(html);
// loop over children elements of the body tag
for (Element el:doc.select("body").select("*")) {
  // loop over all textnodes of these children
  for (TextNode textNode:el.textNodes()) {
    // make sure there is some text other than whitespace
    if (textNode.text().trim().length()>0) {
        // show:
        //    the original node name
        //    the name of the subnode witht the text 
        //    the text 
        System.out.println(el.nodeName()+"."+textNode.nodeName()+":"+textNode.text());
    }
  }
}

Now I'd also like to show the line number and the original html source code the textNode at hand came from. I doubt JSoup can do this (e.g. see)

and trying a work around like:

int pos = html.indexOf(textNode.outerHtml());

does not reliably find the original html. So I assume I might have to switch to another Library or approach. Jericho-html: is it possible to extract text with reference to positions in source file? has an answer that says "Jericho can do it" as the link above also points out. But the pointer to real working code is missing.

Whith Jericho I got as far as:

Source htmlSource=new Source(html);
boolean bodyFound=false;
// loop over all elements
for (net.htmlparser.jericho.Element el:htmlSource.getAllElements()) {
    if (el.getName().equals("body")) {
        bodyFound=true;
    }
    if (bodyFound) {
        TagType tagType = el.getStartTag().getTagType();
        if (tagType==StartTagType.NORMAL) {
            String text=el.getTextExtractor().toString();
            if (!text.trim().equals("")) {
                int cpos = el.getBegin();               
                System.out.println(el.getName()+"("+tagType.toString()+") line "+   htmlSource.getRow(cpos)+":"+text);
            }
        } // if
    } // if
} // for

Which is pretty good already since it will give you output like:

body(normal) line 91: Some Header. Some Text
div(normal) line 93: Some Header
div(normal) line 95: Some Text

but now the followup problem is that TextExtractor outputs the whole text of all subnodes recursively so that text shows up multiple times.

What would be a working solution that filters as well as the above JSoup solution (please note the correct order of text elements) but shows source lines as the above Jericho Code snippet does?

score 2 · Answer 1 · edited Oct 06 '14 at 11:05

The feature that you need and jsoup is lacking is IMHO more difficult to implement. Go with Jericho and implement something like this, for finding the immediate text nodes.

package main.java.com.adacom.task;

import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;

import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.EndTag;
import net.htmlparser.jericho.Segment;
import net.htmlparser.jericho.Source;
import net.htmlparser.jericho.StartTag;
import net.htmlparser.jericho.StartTagType;
import net.htmlparser.jericho.Tag;
import net.htmlparser.jericho.TagType;

public class MainParser {

    /**
     * @param args
     */
    public static void main(String[] args) {

        String html = "<body><div>divtextA<span>spanTextA<p>pText</p>spanTextB</span>divTextB</div></body>";

        Source htmlSource=new Source(html);
        boolean bodyFound=false;
        // loop over all elements
        for (net.htmlparser.jericho.Element el:htmlSource.getAllElements()) {
            if (el.getName().equals("body")) {
                bodyFound=true;
            }
            if (bodyFound) {
                TagType tagType = el.getStartTag().getTagType();
                if (tagType==StartTagType.NORMAL) {
                    String text = getOwnTextSegmentsString(el);
                    if (!text.trim().equals("")) {
                        int cpos = el.getBegin();               
                        System.out.println(el.getName()+"("+tagType.toString()+") line "+   htmlSource.getRow(cpos)+":"+text);
                    }
                } // if
            } // if
        } // for

    }

    /**
     * this function is not used it's shown here only for reference
     */ 
    public static Iterator<Segment> getOwnTextSegmentsIterator(Element elem) {
        final Iterator<Segment> it = elem.getContent().getNodeIterator();
        final List<Segment> results = new LinkedList<Segment>();
        int tagCounter = 0;
        while (it.hasNext()) {
            Segment cur = it.next();            
            if(cur instanceof StartTag) 
                tagCounter++;
            else if(cur instanceof EndTag) 
                tagCounter--;

            if (!(cur instanceof Tag) && tagCounter == 0) {
                System.out.println(cur);
                results.add(cur);
            }
        }
        return results.iterator();
    }

    public static String getOwnTextSegmentsString(Element elem) {
        final Iterator<Segment> it = elem.getContent().getNodeIterator();
        StringBuilder strBuilder = new StringBuilder();
        int tagCounter = 0;
        while (it.hasNext()) {
            Segment cur = it.next();            
            if(cur instanceof StartTag) 
                tagCounter++;
            else if(cur instanceof EndTag) 
                tagCounter--;

            if (!(cur instanceof Tag) && tagCounter == 0) {
                strBuilder.append(cur.toString() + ' ');
            }
        }
        return strBuilder.toString().trim();
    }

}

I'd like to award the bounty to you for your effort. I am not clear whether your solution would find a text at the end of a body as in the JUnit Test of my answer. I'll check and get back. — Wolfgang Fahl, Oct 04 '14 at 05:49
I've upvoted your question after editing. Unfortunately it does not seem to return the text in the correct order. I have refactored it to implement package com.bitplan.texttools; import java.util.List; public interface TextExtractor { List extractTextSegments(String html); } and it would not create single segments but try to concatenate all text of an element. Thisway the text at the start of a body or div will be put together with the text at the end of a body or div and the text in the subnodes will follow later. It's not what the Jsoup filter above did. — Wolfgang Fahl, Oct 06 '14 at 11:02
I figured that it had bugs but as I mention, this is just an example, to get you going. I'm glad you found a solution. Have fun. — Alkis Kalogeris, Oct 06 '14 at 11:06

score 1 · Accepted Answer · answered Oct 04 '14 at 05:47

Here is a Junit Test testing the expected output and a Jericho based SourceTextExtractor that makes the JUnit Test work which is based on the original Jericho TextExtractor source code.

@Test
public void testTextExtract() {
    // https://github.com/paepcke/CorEx/blob/master/src/extraction/HTMLUtils.java
    String htmls[] = {
            "<!DOCTYPE html>\n" + "<html>\n" + "<body>\n" + "\n"
                    + "<h1>My First Heading</h1>\n" + "\n"
                    + "<p>My first paragraph.</p>\n" + "\n" + "</body>\n" + "</html>",
            "<html>\n"
                    + "<body>\n"
                    + "\n"
                    + "<div id=\"myDiv\" name=\"myDiv\" title=\"Example Div Element\">\n"
                    + "  <h5>Subtitle</h5>\n"
                    + "  <p>This paragraph would be your content paragraph...</p>\n"
                    + "  <p>Here's another content article right here.</p>\n"
                    + "</div>" + "\n" + "Text at end of body</body>\n" + "</html>" };
    int expectedSize[] = { 2, 4 };
    String expectedInfo[][]={
        { 
            "line 5 col 5 to  line 5 col 21: My First Heading",
            "line 7 col 4 to  line 7 col 23: My first paragraph."
        },
        { 
            "line 5 col 7 to  line 5 col 15: Subtitle",
            "line 6 col 6 to  line 6 col 55: This paragraph would be your content paragraph...",
            "line 7 col 6 to  line 7 col 48: Here's another content article right here.",
            "line 8 col 7 to  line 9 col 20: Text at end of body"
        }
    };
    int i = 0;
    for (String html : htmls) {
        SourceTextExtractor extractor=new SourceTextExtractor();
        List<TextResult> textParts = extractor.extractTextSegments(html);
        // List<String> textParts = HTMLCleanerTextExtractor.extractText(html);
        int j=0;
        for (TextResult textPart : textParts) {
            System.out.println(textPart.getInfo());
            assertTrue(textPart.getInfo().startsWith(expectedInfo[i][j]));
            j++;
        }
        assertEquals(expectedSize[i], textParts.size());
        i++;
    }
}

This is an adapted TextExtractor see http://grepcode.com/file_/repo1.maven.org/maven2/net.htmlparser.jericho/jericho-html/3.3/net/htmlparser/jericho/TextExtractor.java/?v=source

/**
 * TextExtractor that makes source line and col references available
 * http://grepcode.com/file_/repo1.maven.org/maven2/net.htmlparser.jericho/jericho-html/3.3/net/htmlparser/jericho/TextExtractor.java/?v=source
 */
public class SourceTextExtractor {

    public static class TextResult {
        private String text;
        private Source root;
        private Segment segment;
        private int line;
        private int col;

        /**
         * get a textResult
         * @param root
         * @param segment
         */
        public TextResult(Source root,Segment segment) {
            this.root=root;
            this.segment=segment;
            final StringBuilder sb=new StringBuilder(segment.length());
            sb.append(segment);
            setText(CharacterReference.decodeCollapseWhiteSpace(sb));
            int spos = segment.getBegin();  
            line=root.getRow(spos);
            col=root.getColumn(spos);

        }

        /**
         * gets info about this TextResult
         * @return
         */
        public String getInfo() {
            int epos=segment.getEnd();

            String result=
                    " line "+   line+" col "+col+
                    " to "+
                    " line "+   root.getRow(epos)+" col "+root.getColumn(epos)+
                    ":"+getText();
            return result;
        }

        /**
         * @return the text
         */
        public String getText() {
            return text;
        }

        /**
         * @param text the text to set
         */
        public void setText(String text) {
            this.text = text;
        }

        public int getLine() {
            return line;
        }

        public int getCol() {
            return col;
        }

    }

    /**
     * extract textSegments from the given html
     * @param html
     * @return
     */
    public List<TextResult> extractTextSegments(String html) {
        Source htmlSource=new Source(html);
        List<TextResult> result = extractTextSegments(htmlSource);
        return result;
    }

    /**
     * get the TextSegments from the given root segment
     * @param root
     * @return
     */
    public List<TextResult> extractTextSegments(Source root) {
        List<TextResult> result=new ArrayList<TextResult>();
        for (NodeIterator nodeIterator=new NodeIterator(root); nodeIterator.hasNext();) {
            Segment segment=nodeIterator.next();
            if (segment instanceof Tag) {
                final Tag tag=(Tag)segment;
                if (tag.getTagType().isServerTag()) {
                    // elementContainsMarkup should be made into a TagType property one day.
                    // for the time being assume all server element content is code, although this is not true for some Mason elements.
                    final boolean elementContainsMarkup=false;
                    if (!elementContainsMarkup) {
                        final net.htmlparser.jericho.Element element=tag.getElement();
                        if (element!=null && element.getEnd()>tag.getEnd()) nodeIterator.skipToPos(element.getEnd());
                    }
                    continue;
                }
                if (tag.getTagType()==StartTagType.NORMAL) {
                    final StartTag startTag=(StartTag)tag;
                    if (tag.name==HTMLElementName.SCRIPT || tag.name==HTMLElementName.STYLE ||  (!HTMLElements.getElementNames().contains(tag.name))) {
                        nodeIterator.skipToPos(startTag.getElement().getEnd());
                        continue;
                    }

                }
                // Treat both start and end tags not belonging to inline-level elements as whitespace:
                if (tag.getName()==HTMLElementName.BR || !HTMLElements.getInlineLevelElementNames().contains(tag.getName())) {
                    // sb.append(' ');
                }
            } else {
                if (!segment.isWhiteSpace())
                    result.add(new TextResult(root,segment));
            }
        }
        return result;
    }

    /**
     * extract the text from the given segment
     * @param segment
     * @return
     */
    public String extractText(net.htmlparser.jericho.Segment pSegment) {

        // http://grepcode.com/file_/repo1.maven.org/maven2/net.htmlparser.jericho/jericho-html/3.3/net/htmlparser/jericho/TextExtractor.java/?v=source
        // this would call the code above
        // String result=segment.getTextExtractor().toString();
        final StringBuilder sb=new StringBuilder(pSegment.length());
        for (NodeIterator nodeIterator=new NodeIterator(pSegment); nodeIterator.hasNext();) {
            Segment segment=nodeIterator.next();
            if (segment instanceof Tag) {
                final Tag tag=(Tag)segment;
                if (tag.getTagType().isServerTag()) {
                    // elementContainsMarkup should be made into a TagType property one day.
                    // for the time being assume all server element content is code, although this is not true for some Mason elements.
                    final boolean elementContainsMarkup=false;
                    if (!elementContainsMarkup) {
                        final net.htmlparser.jericho.Element element=tag.getElement();
                        if (element!=null && element.getEnd()>tag.getEnd()) nodeIterator.skipToPos(element.getEnd());
                    }
                    continue;
                }
                if (tag.getTagType()==StartTagType.NORMAL) {
                    final StartTag startTag=(StartTag)tag;
                    if (tag.name==HTMLElementName.SCRIPT || tag.name==HTMLElementName.STYLE ||  (!HTMLElements.getElementNames().contains(tag.name))) {
                        nodeIterator.skipToPos(startTag.getElement().getEnd());
                        continue;
                    }

                }
                // Treat both start and end tags not belonging to inline-level elements as whitespace:
                if (tag.getName()==HTMLElementName.BR || !HTMLElements.getInlineLevelElementNames().contains(tag.getName())) {
                    sb.append(' ');
                }
            } else {
                sb.append(segment);
            }
        }
        final String result=net.htmlparser.jericho.CharacterReference.decodeCollapseWhiteSpace(sb);
        return result;
    }
}

Text Extraction from HTML using Java including source line number and code

2 Answers2