1

I use Jericho HTML Parser 3.1.

I need to extract text from html, handle it and according to this, I need to insert tags to original html.

But for this I need matching between extracted text and source html.

net.htmlparser.jericho.TextExtractor extracts text pretty good, but I was not able to find how to find the location in original file.

Is it possible to do so with Jericho-html?

sergtk
  • 9,884
  • 13
  • 69
  • 122

1 Answers1

2

You cann't do this with the TextExtractor as is, but I've needed to do similar things in the past and the simplest solution is to copy Jericho's TextExtractor implementation and edit it to add your own custom behaviour. It's a pretty simple class so you'll be able to easily see where to add your own hooks.

Wolfgang Fahl
  • 12,097
  • 9
  • 75
  • 150
Joel
  • 27,478
  • 33
  • 104
  • 136
  • Thanks, will try! Do you know other libraries which allow to do this? – sergtk Apr 07 '11 at 10:33
  • 1
    There's also Jericho's Renderer, but again, you would have to modify it yourself, but the text formatting is much better (includes bullets, spacing, links etc... somewhat like Lynx browser html rendering). As for other libraries, no, but if you just want simple text formatting with newlines in the appropriate place you could write a basic implementation yourself using a dom parser - although modifying TextExtractor/Renderer to do what you want will be faster faster, and you get the added benefit of leveraging Jericho's handling of badly formatted HTML. – Joel Apr 07 '11 at 10:42