Jsoup: Extract text as a human would read it

Question

I need to extract all of the text from a HTMl fragment.

Example:

INPUT: <p><div>how are</div> you doing?</p><p>I'm doing well</p>

OUTPUT: how are you doing? I'm doing well

I've found questions, such as this one Text Extraction from HTML Java, that deal with similar problems, but they all just remove the <p> tags, but don't remove the inner elements.

Initially, I tried listing through the children of each <p> tag and concatenating their contents and also recursively examining each grandchild and concatenating its children and so on until there was just text. The issue is that some text isn't surrounded by a tag and is just plain.

I've also tried Jsoup.parse(html).select("p").text(), but I get "[]I'm doing well" as the output.

This seems like a very common need for web-crawler type programs, but I can't find a solution.

This is something of an abuse of HTML. A child element implies related but separate content. Hence why you're struggling to find a similar solution. What you need to implement is a recursive solution, to ensure that all children of the current element are parsed before moving into the next one. — christopher, Jun 15 '14 at 00:12
@christopher Ohhhh... By accident, I typed in `div` instead of `span` when I was writing my unit test (which I subsequently copied to this question). If I replace it with a `span` tags, it works now. Sorry to waste everyones' time. — sinθ, Jun 15 '14 at 00:15

score 2 · Answer 1 · answered Jun 15 '14 at 00:14

2

Try this:

Document doc = Jsoup.parse("<p><div>how are</div> you doing?</p><p>I'm doing well</p>");
String body = doc.body().text();

answered Jun 15 '14 at 00:14

Jeeshu Mittal

445
3
14

`doc.body().text()` Only removes `span` tags when it compiles text, but not if a div tag is used. – sinθ Jun 15 '14 at 00:17
That's a really nice solution. +1 from me! – christopher Jun 15 '14 at 08:45

Jsoup: Extract text as a human would read it

1 Answers1