1

I'm using PDFBox to read PDF files. But some characters are not printing well and printing like control characters. Some one help to read the values from the control characters. I've attached the image Kindly have a look at that image Sample PDF:

Screenshot:

enter image description here Sample Code

class PDFManager {

   private PDFParser parser;
   private PDFTextStripper pdfStripper;
   private PDDocument pdDoc ;
   private COSDocument cosDoc ;

   private String Text ;
   private String filePath;
   private File file;

   public PDFManager() {

   }

   public String ToText() throws IOException {
       this.pdfStripper = null;
       this.pdDoc = null;
       this.cosDoc = null;
       file = new File(filePath);
       parser = new PDFParser(new FileInputStream(file));

       parser.parse();
       cosDoc = parser.getDocument();
       pdfStripper = new PDFTextStripper(); 
       pdDoc = new PDDocument(cosDoc);  
       pdDoc.getNumberOfPages();
       pdfStripper.setStartPage(3);
       pdfStripper.setEndPage(4); 
       Text = pdfStripper.getText(pdDoc);

       return Text;
   }

   public void setFilePath(String filePath) {
       this.filePath = filePath;
   }
}
Shaido
  • 22,716
  • 18
  • 57
  • 64
Bharath Babu
  • 67
  • 2
  • 8
  • Please read this: https://pdfbox.apache.org/1.8/faq.html#notext and the next entry there. – Tilman Hausherr Mar 13 '16 at 16:46
  • Tilman Hausherr any other suggestions or references pls – Bharath Babu Mar 13 '16 at 17:09
  • Can you copy & paste from Adobe Reader? If not, then PDFBox won't be able to help you. All you can do then is OCR. – Tilman Hausherr Mar 13 '16 at 17:59
  • is there any chance getting the text from OCR?? kindly help me in this issue – Bharath Babu Mar 13 '16 at 18:07
  • You can convert the PDF to images ( https://stackoverflow.com/questions/23326562/apache-pdfbox-convert-pdf-to-images ) and then run your favourite OCR. But the OCR part isn't really PDFBox. Although there was a project about that: https://issues.apache.org/jira/browse/PDFBOX-1912 but it's not fully "official". – Tilman Hausherr Mar 13 '16 at 18:10
  • Actually these pdf files are generate by web application which is developed in asp.net some pdf files are easily extracted but some files giving errors like this – Bharath Babu Mar 13 '16 at 18:13
  • You'd need to discuss this with the team / vendor who created that web application. If they used a standard tool (e.g. itext), then they should ask the creator of that tool. If you share such a PDF (please edit your question and link to it), you might get some feedback why there is no extractable text. – Tilman Hausherr Mar 13 '16 at 18:22
  • But the problem is I can't able to ask this to the team coz its developed by our competitive vendor, so there s no chance to ask about this issue we have to find it ourself only, if we asked it means they will not reveal it also so kindly help me please – Bharath Babu Mar 13 '16 at 18:35
  • 1
    You can only hope to get help if you share sample pdfs and the pivotal parts of your code on this issue. – mkl Mar 13 '16 at 22:53
  • Hey @mkl its my sample pdf kindly have a look at this bro [link](https://drive.google.com/open?id=0B5_UfP0K0BJKTDlyQ05Eb1FlQWs) and the code – Bharath Babu Mar 15 '16 at 07:37
  • @BharathBabu Following the link I get the message that permissions are missing. – mkl Mar 15 '16 at 14:02
  • Sorry for the late bro i've changed to open here after no more permission needed – Bharath Babu Mar 15 '16 at 17:45
  • bro i need the text from page 3, so no problem about page 1. – Bharath Babu Mar 15 '16 at 18:08
  • The fonts used on page 3 have an encoding which only partially uses standard glyph names. Where sufficiently standard glyph names are used, PDFBox gives you the correct Tamil letter. Where there is a glyph name without any standard meaning, you'll get rubbish. – mkl Mar 16 '16 at 09:27
  • Then how to get correct Tamil letter from that glyph name bro. – Bharath Babu Mar 16 '16 at 10:40

1 Answers1

7

The reason why you get both correct Tamil letters and incorrect control sequences, is that the respective fonts

  • don't have a ToUnicode map and
  • have an Encoding entry using non-standard names for some glyphs.

In such a situation PDFBox cannot properly extract associated characters without help.

To help PDFBox you have to check whether in all the documents (or at least in a sufficiently big subset to be of interest) for each non-standard name the drawn glyphs are identical. If this is the case, you can tell PDFBox to add mappings from each of these names to the Unicode value of the respectively drawn letter to its reservoir of known glyph mappings.

In more detail:

The issue

I'll illustrate the issue here with an example.

On the page 3 mentioned by the OP the first text is drawn using instructions equivalent to these:

/R9 8.04 Tf
0.999418 0 0 1 519.6 791.721 Tm
[<01>6.75242<0C>-0.371893<0D>4.89295<07>3.77727<14>-6.13989<35>-4.51376<02>-5.00233<0F>187.988]TJ 

(I merely changed the representation of the strings to hexadecimal as the individual codes mostly are in the control character range and, therefore, would not display properly here.)

The font R9 of this page does not have a ToUnicode map. Neither are there any ActualText entries. Thus, PDFBox can merely use the Encoding entry of the font:

<<
  /BaseEncoding/WinAnsiEncoding
  /Differences[1
    /u0BC6/u0B9A/g125/u0BC8/u0BA9/g121/u0B9F/u0BAE
    /u0BB1/g123/space/u0BA4/u0BBE/g148/u0BBF/u0B8E
    /g122/u0BAA/u0BAF/g129/g130/g178/g127/u0B92
    /g162/g116/u0B95/u0BC0/g158/u0BA8/u0BB2/colon
    /u0B85/g117/g173/g132/u0BB3/g182/g142/one
    /period/g175/u0BB5/u0BB0/g126/u0B86/u0BC7/g186
    /g156/g131/g143/two/g118/g133/g190/hyphen
    /zero/five/g171/g120/g146/g169/g152/parenleft
    /seven/parenright/three/g180/u0BA3/eight/g136/u0BB4
    /u0B9C/four/six/g124/nine/g135/slash/g172
    /comma/u0B87/numbersign/g128/g147/g160/u0B9E/u0B89
    /u0BB7/g119/g157/g167/g191/g188/g170/g145
    /g181/u0BB8/u0B90/uni25CC/u0BCD/u0BB9/u0BC1/u0B88
    /g163/u0BD7/g184/u0B8F/g174/g153/g138/g185
    /g134/g149/g176]
  /Type/Encoding
>> 

As you see it first claims a base encoding WinAnsiEncoding which can be ignored because more or less all the mappings in the range of the codes the font uses are then replaced in the Differences array.

In the Differences array you can find

  • a number of standard names like comma and two;
  • many names representing unicode code points using the uXXXX scheme, like u0BC6 and u0B9A;
  • one name representing a unicode code point using the uniXXXX scheme: uni25CC; and
  • many completely non-standard names using a gXXX scheme like g121 and g176.

PDFBox supports standard names (obviously) and additionally both used unicode code point naming variants (which are often found and whose interpretation is very straight forward).

It does not support other names out of the box.

Thus, the text extracted for that first text drawing instruction is:

<01> - /u0BC6 - 0BC6 - ெ
<0C> - /u0BA4 - 0BA4 - த
<0D> - /u0BBE - 0BBE - ா
<07> - /u0B9F - 0B9F - ட
<14> - /g129 ?? 0014 - <DEVICE CONTROL FOUR>
<35> - /g118 ?? 0035 - 5
<02> - /u0B9A - 0B9A - ச
<0F> - /u0BBF - 0BBF - ி

resulting in your first line of extracted text:

first line of extracted text

By the way, this corresponds to this section from the actual PDF:

matching PDF content

A possible way to allow proper extraction

PDFBox provides mechanisms allowing you to add names to its map of known names. If those gXXX names regularly represent the same respective character in your documents, therefore, you can tweak PDFBox text extraction to meet your requirements.

The stable PDFBox version 1.8.X use a different mechanism than the 2.0.0 release candidates. Thus:

For PDFBox 1.8.X you have to create a glyph list text file. For each glyph it contains one linewith 2 semicolon-delimited fields, the glyph name and the Unicode scalar value, e.g.

A;0041
AE;00C6

Then you define a system property glyphlist_ext pointing towards that list, e.g. when starting your program

java -Dglyphlist_ext=/path/to/my/extra/glyphs ...

For PDFBox 2.0.0 this mechanism has been replaced and moved multiple times, I have no idea which is the current one.

While working on PDFBOX-2379, an exception has been introduced to be thrown if the above mentioned system property is found:

throw new UnsupportedOperationException("glyphlist_ext is no longer supported, "
    + "use GlyphList.DEFAULT.addGlyphs(Properties) instead");

Unfortunately GlyphList does not have that method addGlyphs anymore.

While working on PDFBOX-2380 it has been removed and replaced:

I've replaced the static DEFAULT glyph list with a getAdobeGlyphList() method, as some PDFBox font internals require this to be the AGL and not some other additional glyph list. The loading and use of the additional glyphlist is application specific and so has been moved to PDFStreamEngine, where the getGlyphList() method can be overridden to pass custom glyph lists to fonts.

Unfortunately, PDFStreamEngine does not have that getGlyphList method anymore.

And I'm not currently in a mood to continue hunting high and low to find that feature again. Arg.

The making-of

In a comment the OP asked how I retrieved the information above from the PDF file in question.

First of all I used a PDF internals browsing application, e.g. iText RUPS or PDFBox PDFDebugger, to inspect the PDF and the PDF specification ISO 32000-1 to understand what I'm inspecting.

The OP in particular pointed to page 3 of his document, so I looked for the first text showing operators (cf. ISO 32000-1 section 9.4.3) in the content stream (cf. ISO 32000-1 section 7.8.2) of that page (cf. ISO 32000-1 section 7.7.3.3):

RUPS page 3 content

These almost are the instructions I quoted above. As you see, though, the character strings unfortunately cannot be inspected here because their contents are mostly in the Unicode control character range. Thus, I saved the contents stream (right-click, context menu) and inspected a hex view of those instructions

Hex page 3 content

Using these information I created the instruction quote above.

The font (cf. ISO 32000-1 section 9.6) selected in those instructions is R9 (cf. ISO 32000-1 section 9.3.1), so I continued by looking at the font resource (cf. ISO 32000-1 section 7.8.3) with that name on page 3, first unsuccessfully searching a ToUnicode entry (cf. ISO 32000-1 section 9.10.3), then succeeding in finding an Encoding (cf. ISO 32000-1 section 9.6.6):

RUPS page 3 font R9 encoding

This I copied and prettied-up somewhat to get the encoding quote above.

From these information I manually created the table with the glyph ids (from the text showing operation in the instruction block), the corresponding name (from the encoding differences), the assumed unicode code point (derived from uXXXX names, for gXXX names the glyph id again), and the character (from one of the many Unicode table sites on the internet).

To find the corresponding section from the actual PDF page I took the final two arguments of the Tm text matrix setting operation (cf. ISO 32000-1 section 9.4.2) taking the aggregated transformation matrix changes (cf. ISO 32000-1 section 8.4) into account. These are the coordinates of the start of the base line of the text drawn by the following text showing instruction.

mkl
  • 77,874
  • 12
  • 103
  • 212
  • Great Bro...@mkl and thanks for your explanation.. Now i've to learn about glyph list.. Thanks for your explanation bro... will u pls give me ur email id to contact you – Bharath Babu Mar 16 '16 at 18:01
  • In case of further questions please ask on stackoverflow. And if an answer sufficiently answers your question, please accept it (click on the tick at its upper left). – mkl Mar 18 '16 at 10:46
  • Bro still i've got problem to getting exact glyph text bro... do u have any sample codes for that.. kindly help me – Bharath Babu Apr 03 '16 at 07:10
  • Have you checked whether *those gXXX names regularly represent the same respective character in your documents*? – mkl Apr 03 '16 at 09:40
  • thatz my problem... I dont know how to separate those glyphs.. bro.. kindly help me regarding that.. – Bharath Babu Apr 03 '16 at 15:33
  • In essence you have to check your documents and determine which glyph they show for which **gXXX** name. I don't have any code ready for that (I usually only have to deal with European languages/alphabets where such funny constructs are seldom-used or non-existent). – mkl Apr 03 '16 at 16:57
  • Will you pls tell me how you extracted like this bro.. kindly guide me for this <01> - /u0BC6 - 0BC6 - ெ <0C> - /u0BA4 - 0BA4 - த <0D> - /u0BBE - 0BBE - ா <07> - /u0B9F - 0B9F - ட <14> - /g129 ?? 0014 - <35> - /g118 ?? 0035 - 5 <02> - /u0B9A - 0B9A - ச <0F> - /u0BBF - 0BBF - ி – Bharath Babu Apr 03 '16 at 17:54
  • @BharathBabu I added some information on how to retrieve those information to my answer. – mkl Apr 04 '16 at 13:06