Encounter images in between the text while parsing pdf using tika-server

Question

I am able to fetch all the images and get the coordinates from the pdf using pdfbox. But when I parse the pdf using tika server, I get the text only. So how will I know when the image occures so that I can put the image exactly after that text. I am using the code given in the following 1st answer: extract images from pdf using pdfbox

I am using tika server 1.7 I am talking the data of the pdf in the parser and using plain text version. I just want to know while parsing, how I will know that an image is encountered.

I got the HTML output using the praseToHTML() at this link https://tika.apache.org/1.10/examples.html But still this is not giving me the images present in the pdf. Nor it is giving any tag.

Please mention what TIKA version you're using. See also https://issues.apache.org/jira/browse/TIKA-1396 — Tilman Hausherr, Aug 08 '15 at 13:14
How are you calling Apache Tika? Are you asking Tika for the HTML version, or the Plain Text version? If the latter, what happens when you switch to the former? — Gagravarr, Aug 09 '15 at 07:32
I am using tika server 1.7 I am talking the data of the pdf in the parser and using plain text version. I just want to know while parsing, how I will know that an image is encountered. — deepak sharma, Aug 10 '15 at 05:33
Try with Apache Tika 1.10, and fetching the HTML version. I think you should get an `` tag in the place in the html where the image sits. Would that do you? — Gagravarr, Aug 10 '15 at 06:11
Can you please suggest me some method or code to extract the html from the pdf file. I googled it but no luck. — deepak sharma, Aug 10 '15 at 09:44
"code to extract the html from the pdf file" - ??? PDF doesn't contain HTML, except maybe if a HTML file is embedded in a PDF. Or do you mean parse the HTML that is produced by TIKA? Or something else? — Tilman Hausherr, Aug 10 '15 at 13:40
Yes, I am asking for the code of HTML output produced by using TIKA. — deepak sharma, Aug 11 '15 at 05:41
I got the HTML output using the praseToHTML() at this link https://tika.apache.org/1.10/examples.html But still this is not giving me the images present in the pdf. — deepak sharma, Aug 11 '15 at 08:22
If it doesn't work with tika 1.10, and if you don't get an answer here, then try in the TIKA user mailing list. https://mail-archives.apache.org/mod_mbox/tika-user/ Include a link to your file. — Tilman Hausherr, Aug 11 '15 at 13:19

Encounter images in between the text while parsing pdf using tika-server

0 Answers0