1

I think of developing a tool for commercial usage (I intent to sell it), which will include manipulating document files.

The manipulations will include: 1. concatenating several PDF files into one. 2. converting doc/docx file into a PDF file. 3. breaking a single PDF file into 2 separated PDF files. 4. numbering the pages of a PDF file (with a sequentially running number).

For that matter, I'm looking for a free library or code to help me with the PDF manipulations. I prefer the library to be in C# because my software will be in C# as it has some GUI, but I'll manage with JAVA library too...

I found the "pdftk" library which can help me a lot, but unfortunately it's license doesn't allow commercial use....

Does anyone have an idea of a free library or code which can help me with that?

Thanks a lot!!

user1028741
  • 2,376
  • 4
  • 25
  • 51

2 Answers2

1

If you want to manipulate PDF with java, PDFBox is good choice.

Also you can take a look at itextpdf which has support for java and C#. There is community version for the library.

kame
  • 16,824
  • 28
  • 95
  • 142
AValchev
  • 1,382
  • 13
  • 15
  • Thank you for your fast answer. I checked the two libraries that you mentioned and, to my understanding, they both are under a license which allow me to use them only if my software will be free as well. Am I wrong? – user1028741 Oct 13 '12 at 12:10
  • I'm not quite sure but PDFBox is under Apache License 2, which requires to not modify the code (if you modify the code you have to submit it) and include into your distribution it's license file. – AValchev Oct 13 '12 at 12:12
  • 1
    iText is under GNU Affero General Public License version 3 under which most cloud providers run their SaaS products, but it's less permissive. – AValchev Oct 13 '12 at 12:14
  • Please see number 4 in Apache License: [link](http://www.apache.org/licenses/LICENSE-2.0), isn't it forbidden then to sell a software using this library? or number 6 in the GNU license [link](http://itextpdf.com/terms-of-use/agpl.php) ... Do I have any possibility of using these libraries without breaking these contracts? – user1028741 Oct 13 '12 at 15:25
  • http://stackoverflow.com/questions/1007338/can-i-use-a-library-under-the-apache-software-license-2-0-in-a-commercial-applic – AValchev Oct 13 '12 at 15:38
  • "You can use Apache-licensed libraries in your program so long as you include a copy of the Apache license, and you display a copy of the required copyright notice wherever your program displays copyright notices, for example in an installer package or "about" screen." – AValchev Oct 13 '12 at 15:39
  • Generally speaking you should include the license file and provide user ability to accept this license. – AValchev Oct 13 '12 at 15:41
  • @AValchev: What makes you think you can not modify code under Apache License 2? – Martin Schröder Oct 14 '12 at 23:50
  • I'm just saying that after modifying the code, the modification should be submitted back... – AValchev Oct 15 '12 at 09:26
  • ITextPdf is NOT free... the original question asked for free software only and in an commercial setting a licence must be bought. Unless you yourself release all of your source under the same AGPL. – tigerswithguitars Feb 14 '13 at 09:57
1

Take a look at pdftotext at http://www.foolabs.com/xpdf/download.html.

It provides an option for extracting the contents of a PDF file into a text file. Where it stands out in comparison to other libraries is that it maintains the formatting from the PDF file in the extracted text file. This is really helpful when your PDF contains structural data such as tables and the PDF files are untagged. PDFBox and other libraries fail to maintain the the structure of the contents of your PDF while parsing it.

Once you have the text file extracted from your PDF, you are free to use your favorite programming language to parse the text file.

Take a look at the license policy here : http://www.glyphandcog.com/Xpdf.html. It clearly states that if you directly use he executables without modifying the source code, you are free to redistribute your application that uses the executables. If performance is not a concern, you don't need to touch their source code.

If performance is a concern, you can create a trial version of your application that highlights the functionality but is naturally slow as it will run the executable everytime you want to process a PDF. The paid version can directly call the pdftotext api and will be faster. You can make up for the money spent on licensing very easily. I would have done this if I were you but I already have some big projects on my plate at the moment :)

I can vouch for pdftotext as I have used it myself. All other libraries seem to forget that the users may be interested in keeping the format of the PDF files as it is.

CKing
  • 14,153
  • 4
  • 39
  • 77
  • Thank you @bot, but this library is also under the Apache2 license which forbid the usage of the library without adding the source code to it, etc.. – user1028741 Oct 13 '12 at 15:28
  • @user1028741: This is wrong: The library is under GPL2 (or commercial), and your interpretation of the Apache license is [wrong](http://stackoverflow.com/q/1007338/821436). – Martin Schröder Oct 14 '12 at 23:52