I need to extract all of the text from a HTMl fragment.
Example:
INPUT: <p><div>how are</div> you doing?</p><p>I'm doing well</p>
OUTPUT: how are you doing? I'm doing well
I've found questions, such as this one Text Extraction from HTML Java, that deal with similar problems, but they all just remove the <p>
tags, but don't remove the inner elements.
Initially, I tried listing through the children of each <p>
tag and concatenating their contents and also recursively examining each grandchild and concatenating its children and so on until there was just text. The issue is that some text isn't surrounded by a tag and is just plain.
I've also tried Jsoup.parse(html).select("p").text()
, but I get "[]I'm doing well"
as the output.
This seems like a very common need for web-crawler type programs, but I can't find a solution.