0

I am able to fetch all the images and get the coordinates from the pdf using pdfbox. But when I parse the pdf using tika server, I get the text only. So how will I know when the image occures so that I can put the image exactly after that text. I am using the code given in the following 1st answer: extract images from pdf using pdfbox

I am using tika server 1.7 I am talking the data of the pdf in the parser and using plain text version. I just want to know while parsing, how I will know that an image is encountered.

I got the HTML output using the praseToHTML() at this link https://tika.apache.org/1.10/examples.html But still this is not giving me the images present in the pdf. Nor it is giving any tag.

Community
  • 1
  • 1
  • 1
    Please mention what TIKA version you're using. See also https://issues.apache.org/jira/browse/TIKA-1396 – Tilman Hausherr Aug 08 '15 at 13:14
  • How are you calling Apache Tika? Are you asking Tika for the HTML version, or the Plain Text version? If the latter, what happens when you switch to the former? – Gagravarr Aug 09 '15 at 07:32
  • I am using tika server 1.7 I am talking the data of the pdf in the parser and using plain text version. I just want to know while parsing, how I will know that an image is encountered. – deepak sharma Aug 10 '15 at 05:33
  • Try with Apache Tika 1.10, and fetching the HTML version. I think you should get an `` tag in the place in the html where the image sits. Would that do you? – Gagravarr Aug 10 '15 at 06:11
  • Can you please suggest me some method or code to extract the html from the pdf file. I googled it but no luck. – deepak sharma Aug 10 '15 at 09:44
  • "code to extract the html from the pdf file" - ??? PDF doesn't contain HTML, except maybe if a HTML file is embedded in a PDF. Or do you mean parse the HTML that is produced by TIKA? Or something else? – Tilman Hausherr Aug 10 '15 at 13:40
  • Yes, I am asking for the code of HTML output produced by using TIKA. – deepak sharma Aug 11 '15 at 05:41
  • I got the HTML output using the praseToHTML() at this link https://tika.apache.org/1.10/examples.html But still this is not giving me the images present in the pdf. – deepak sharma Aug 11 '15 at 08:22
  • If it doesn't work with tika 1.10, and if you don't get an answer here, then try in the TIKA user mailing list. https://mail-archives.apache.org/mod_mbox/tika-user/ Include a link to your file. – Tilman Hausherr Aug 11 '15 at 13:19

0 Answers0