3

I am using iTextSharp for reading PDF documents but lately it seems that i'm getting a

{"Object reference not set to an instance of an object."}

or NullReferenceException upon getting the text from the page of PdfReader. Before it is working but after this day, it is not already working. I didn't change my code.

Below is my code:

for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(reader, i, its);
            if (currentText.Contains("ADVANCES"))
            {
                return i;
            }
        }

        return 0;

The above code throws a null reference exception, reader is not null and i is obviously not null being an int.

I am instantiating the PDFreader from the input stream

PdfReader reader = new PdfReader(_stream)

Below is the stack trace:

  at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)

To be simple, i tried to create a simple console application that will just read all the text from the PDF file and display it. Below is the code. Result is the same as above, it gives NullReferenceException.

class Program
    {



 static void Main(string[] args)
    {
        Console.WriteLine(ExtractTextFromPdf(@"stockQuotes_03232015.pdf"));
    }

    public static string ExtractTextFromPdf(string path)
    {
        using (PdfReader reader = new PdfReader(path))
        {
            StringBuilder text = new StringBuilder();

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
            }

            return text.ToString();
        }
    }
}

Does anyone know what might be going on here or how i might work around it?

heyou
  • 305
  • 1
  • 5
  • 14
  • 2
    [What is a `NullReferenceException` and how do I fix it?](http://stackoverflow.com/questions/4660142/what-is-a-nullreferenceexception-and-how-do-i-fix-it) – Soner Gönül Mar 23 '15 at 11:55
  • You say your code didn't change, but... maybe you are now trying to extract text from a PDF that is corrupt? Does your code still work with the PDFs you processed before? And also: which version of iTextSharp are you using? – Bruno Lowagie Mar 23 '15 at 12:05
  • No it is not corrupted and i've been using the same PDF file. The PDF can be viewed via Reader App or Adobe Reader. And yes my code still work with the PDFs i've processed before. I'm using 5.5.3 and tried using 5.5.4 and the result was the same. I am getting the PDF file from here (http://www.pse.com.ph/stockMarket/marketInfo-marketActivity.html?tab=4) . It's weird i am getting this kind of error – heyou Mar 23 '15 at 12:08
  • Can you point at any specific problem PDF one can download from there? I just tested with a random one which worked, and I don't intend to try each and every file there. – mkl Mar 23 '15 at 13:03
  • Really? Tried using the current date (March 23, 2015). This PDF file gives me a null exception. – heyou Mar 23 '15 at 13:25
  • Can you provide a very simple 10 or so line example of exactly what you are doing including the `Stream` and `PdfReader` parts? – Chris Haas Mar 23 '15 at 13:35
  • @ChrisHaas Edited my post to include sample code. – heyou Mar 23 '15 at 14:18
  • When I open your sample PDF in Adobe Acrobat I get a message saying that there's an error in the PDF. When I open the PDF in iText RUPS I see a bunch of referenced but missing images which is what the `XObject` exception is about. Unless this is just a personal project, per your license agreement with PSE (bottom of the page you provided) I would get in touch with your representative at PSE and ask them to fix the PDFs on their side. – Chris Haas Mar 23 '15 at 14:48
  • Could it be that the PDFs on that site are somewhat volatile? When I downloaded some PDFs from that site about two hours ago, I received a `stockQuotes_03232015.pdf` (size: 250933 bytes) which could be properly parsed. A few minutes ago I downloaded that single PDF again and retrieved a file `stockQuotes_03232015.pdf` (size: 81062) which is clearly broken ... – mkl Mar 23 '15 at 15:06
  • Personally, that whole site seems volatile. It takes way too much time to load and for being a site that has stock information I'd expect SSL to be present which it isn't. The PDF that I download this morning was the same name as yours but 81079 bytes. – Chris Haas Mar 23 '15 at 16:56
  • @mkl hey can you send it to my email the file you got 03232015 which has 250933 bytes? and let me try it? Thanks. This is my email: yuris932000@yahoo.com – heyou Mar 24 '15 at 05:08
  • @ChrisHaas i tried opening the PDF (81062) bytes in Foxit reader and the foxit reader can successfully view the pdf file. I am not getting any error message. its weird. Maybe you guys are right that the their PDFs are somehow corrupted. Can you please help me to get in touch with them (PSE) about this PDF issue? Thank you so much. – heyou Mar 24 '15 at 05:08
  • *foxit reader can successfully view the pdf* - PDF viewers have a history of trying to display even the most corrupt files. If they display a file, this does not prove that it is valid. *can you send it to my email* - when I'm in office later. – mkl Mar 24 '15 at 05:48
  • @heyou I just sent you the file. – mkl Mar 24 '15 at 08:27
  • hey thanks @mkl. Thank you guys. It looks like the PDF is now working. I was able to parse it successfully. I hope itextsharp can handle weird PDF in the future just like Foxit and chrome pdf viewer. Paging itextsharp devs here. – heyou Mar 24 '15 at 10:40
  • I don't speak for the iText team but blindly handling corrupt and/or invalid data is generally a bad idea. Imagine a zip program that just blindly skipped corrupt files or an image editing program that just skipped a corrupt layer. And I'd say at the library level you especially want to be aware of invalid data. – Chris Haas Mar 24 '15 at 13:13
  • @ChrisHaas You are right, blindly handling them definitively would be wrong. But iText could indeed throw more appropriate exceptions or even provide an error handling interface via which a program could explicitly tell the parser to ignore specific types of errors. – mkl Mar 25 '15 at 08:31
  • @mkl Yes this will be better. I hope that they have a something configuration to handle this kind of scenario where it can ignore the error and continue with the parsing. Thanks a lot guys! – heyou Mar 25 '15 at 13:35

2 Answers2

1

To summarize what has been found out in the comments to the question...

In short

The PDF the OP at first used is invalid: It misses required objects which are of interest to the parser.

Since he finally got hold on a valid version, he now is able to parse successfully.

In detail

Depending on the time and mode of request, the web site the PDFs in question were requested from returned different versions of the same document, sometimes complete, sometimes in an invalid manner incomplete.

The test file was stockQuotes_03232015.pdf, i.e. the PDF containing the data generated on the test day:

The complete file could already be recognized by size, in my downloads it is 250933 bytes long while my incomplete file is 81062 bytes long.

Inspecting the files it looks like the incomplete file has been derived from the complete one by some tool which removed duplicate image streams but forgot to change the references to the removed streams by references to the retained stream object.

mkl
  • 77,874
  • 12
  • 103
  • 212
-1

Please us below codes to read text from PDF. It shows text from PDF in a RichTextBox namely - richTextBox1.

Reference Youtube: https://www.youtube.com/watch?v=22C9N4WP4-s

        using (OpenFileDialog ofd = new OpenFileDialog() { Filter = "PDF files|*.pdf", ValidateNames = true, Multiselect = false })
        {
            if(ofd.ShowDialog() == DialogResult.OK)
            {
                try
                {
                    iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(ofd.FileName);
                    StringBuilder sb = new StringBuilder();
                    for(int i = 1; i<reader.NumberOfPages; i++)
                    {
                        sb.Append(iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader,i));
                    }
                    richTextBox1.Text = sb.ToString();
                    reader.Close();

                }
                catch (Exception ex)
                {
                    MessageBox.Show(ex.Message, "Message", MessageBoxButtons.OK, MessageBoxIcon.Error);
                }
            }
        }
  • 1
    There is no difference between your extraction code and the extraction code of the op. So what solution do you want to present to his problem in your answer? In particular as the actual problem of the op had already been recognized more than two years before: defects in the source PDFs. – mkl Nov 11 '17 at 19:31