Questions tagged [pdf2htmlex]

pdf2htmlEX renders PDF files in HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display.

pdf2htmlEX renders PDF files in HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display.

pdf2htmlEX is best for text-based PDF files, for example scientific papers with complicated formulas and figures. Text, fonts and formats are natively preserved in HTML such that you can still search and copy. Math formulas, figures and images are also supported. The generated HTML file is static, with optional features powered by JavaScript.

pdf2htmlEX is also a publishing tool, almost 50 options make it flexible for many different use cases: PDF preview, book/magazine publishing, personal resume...

Useful links:

30 questions
5
votes
1 answer

pdf2HtmlEX - Text on html is different than the source pdf

I am using to pdf2htmlEX in order to convert pdf files to html. I also extract the text from the file afterwards. The Problem: I encountered with a file that the text at the converted html is…
Montoya
  • 2,010
  • 2
  • 21
  • 44
4
votes
0 answers

Internal Error: Attempt to output 65872 into a 16-bit field. It will be truncate

I am converting a pdf file to htmldom using pdftohtmlex and getting this error: Internal Error: Attempt to output 65872 into a 16-bit field. It will be truncate and the file may not be useful.
3
votes
1 answer

pdf2htmlEX text selection issue

I have converted the pdf into html using pdf2htmlEX. While selecting more than one lines, when cursor goes between two lines the selection jumps upwards. Some one please help to get this fixed. The issue is already raised here…
3
votes
0 answers

Extract data from pdf

Please don't mark as duplicate. I have already been through many Stackoverflow links but they didn't solve my problem. What I'm trying to do : I have to extract data from around 1,50,000 pdf files. A sample pdf : All these pdf are identical in…
Akshay Soam
  • 1,482
  • 3
  • 20
  • 37
3
votes
2 answers

Running pdf2htmlEX on Heroku

I'm trying to run pdf2htmlEX on Heroku. At first I thought of compiling pdf2htmlEX on a VM with the same stack as Heroku and then including the binary on the git repo. That did not work (I kept getting problems with dependencies). As there is no…
Anthony Silva
  • 193
  • 13
2
votes
1 answer

Replace word even if it has empty HTML tags between it, which breaks it up

So this is a rather odd question, I know that. I use a tool called pdf2htmlEX, which converts a PDF to HTML. So far the results has been pretty damn impressive. I have yet seen a single error in all the PDFs I have converted to HTML. With this HTML,…
MortenMoulder
  • 5,021
  • 6
  • 44
  • 89
2
votes
1 answer

Transforming pdf to html in Python

Python 2.6 I'm trying to parse my pdf files and one way to do that is to transform it into html and extracting headings along with their paragraphs. So, I tried pdf2htmlEX and it converted my pdf into html without disturbing my pdf format... So far,…
Falcon
  • 67
  • 1
  • 1
  • 8
1
vote
0 answers

pdf2htmlEX on Debian 10 for use with Drupal

There's a server migration going on, and they're moving from a Debian 8 to a Debian 10. Everything works great except for pdf2htmlEX. The old server used v0.14.6, to which I tried to compile without success. Using the jessie package results in…
1
vote
1 answer

pdf2htmlEX converts text but not visible (program can't find font file on linux?)

I'm using pdf2htmlEX to convert a pdf to html, and the output displays correctly when it's generated locally on a mac, but not when it's generated in production on amazon linux. Multiple pages have this issue, but I'll use page 22 of this pdf as a…
JustCodin
  • 69
  • 5
1
vote
1 answer

Pdf2htmlEx: The html contains images, how could i have instead graphics as output instead of images?

I have tried every command found in the documentation, how could i get only the text part as output, and not at all the images? https://github.com/coolwanglu/pdf2htmlEX/wiki/Command-Line-Options.
user10556198
1
vote
0 answers

pdf2htmlEX cannot save font to

I have an error converting some pdf files, it is: Internal Error: File Offset wrong for ttf table (name-data), -1 expected 174 Save Failed Cannot save font to C:\Users\test\AppData\Local\Temp//pdf2htmlEX-a14136/__tmp_font1.ttf I'm using Windows…
WP8_CT
  • 147
  • 1
  • 2
  • 12
1
vote
1 answer

Converting multiple files using pdf2htmlEX

How do you use pdf2htmlEX on multiple files or on a folder that contains pdf files? I am able to convert single files just fine, but I obviously don't want to run a command 100 times for 100 files. I couldn't find anything in the documentation and…
Procyon82
  • 45
  • 6
1
vote
2 answers

pdf2htmlEX cannot open or read file

I installed docker and run pdf2htmlEX through it alias pdf2htmlEX="docker run -ti --rm -v ~/pdf:/pdf bwits/pdf2htmlex pdf2htmlEX" pdf2htmlEX -h pdf2htmlEX --zoom 1.3 test.pdf This is my path and the pdf's contained inside: ~/Desktop/pdf$ ls…
Sean
  • 227
  • 4
  • 9
1
vote
0 answers

How to get sticky notes attached to pdf documents while using pdf2htmlEx tool?

Used the option --process-annotation 1 to view annotations in pdf documents This works fine for Highlight Underline Strikethrough Rectangular box And not for Notes added in Sticky notes - the converted html contains only note icon - missing…
Tom Taylor
  • 2,378
  • 1
  • 27
  • 48
1
vote
0 answers

Extract all content from PDF file (not just text, but also tables/diagrams)?

I'd like to reformat PDF main content, so I need to extract its main content, not just text, but also tables, diagrams, etc. with their layout information. I'm only interested in the main part of the content, for example, for technical paper, I'm…
Yu Shen
  • 2,225
  • 3
  • 30
  • 37
1
2