The Question how to extract Text from HTML using Java has been viewed and duplicated a zillion times: Text Extraction from HTML Java
Thanks to the answers found on Stackoverflow my current state of affairs is that I am using JSoup
<!-- Jsoup maven dependency -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.3</version>
</dependency>
and this piece or code:
// parse the html from the givne string
Document doc = Jsoup.parse(html);
// loop over children elements of the body tag
for (Element el:doc.select("body").select("*")) {
// loop over all textnodes of these children
for (TextNode textNode:el.textNodes()) {
// make sure there is some text other than whitespace
if (textNode.text().trim().length()>0) {
// show:
// the original node name
// the name of the subnode witht the text
// the text
System.out.println(el.nodeName()+"."+textNode.nodeName()+":"+textNode.text());
}
}
}
Now I'd also like to show the line number and the original html source code the textNode at hand came from. I doubt JSoup can do this (e.g. see)
and trying a work around like:
int pos = html.indexOf(textNode.outerHtml());
does not reliably find the original html. So I assume I might have to switch to another Library or approach. Jericho-html: is it possible to extract text with reference to positions in source file? has an answer that says "Jericho can do it" as the link above also points out. But the pointer to real working code is missing.
Whith Jericho I got as far as:
Source htmlSource=new Source(html);
boolean bodyFound=false;
// loop over all elements
for (net.htmlparser.jericho.Element el:htmlSource.getAllElements()) {
if (el.getName().equals("body")) {
bodyFound=true;
}
if (bodyFound) {
TagType tagType = el.getStartTag().getTagType();
if (tagType==StartTagType.NORMAL) {
String text=el.getTextExtractor().toString();
if (!text.trim().equals("")) {
int cpos = el.getBegin();
System.out.println(el.getName()+"("+tagType.toString()+") line "+ htmlSource.getRow(cpos)+":"+text);
}
} // if
} // if
} // for
Which is pretty good already since it will give you output like:
body(normal) line 91: Some Header. Some Text
div(normal) line 93: Some Header
div(normal) line 95: Some Text
but now the followup problem is that TextExtractor outputs the whole text of all subnodes recursively so that text shows up multiple times.
What would be a working solution that filters as well as the above JSoup solution (please note the correct order of text elements) but shows source lines as the above Jericho Code snippet does?