0

I have a SQL Server table with a column called Attachment of data type NVARCHAR(MAX). I upload some PDF/Docx file into the field for different rows base on certain criteria. Here is the statement I upload the file into db

UPDATE dbo.[Document] 
SET Attachment = (SELECT BulkColumn FROM OPENROWSET(BULK N'E:\1.pdf', SINGLE_BLOB) blob) 
WHERE ID = 1; 

The upload is successful. My purpose is to use textract or any other similar tool to read the underlying text from the attachment. I see there're a few APIs. As there is no file nor URL involved, I'm guessing the correct API should be Buffere + MIME type, but what exact is a MIME type for PDF and Docx? I tried to put in "application/pdf" for PDF and "application/vnd.openxmlformats-officedocument.wordprocessingml.document" for docx and I get errors:

[Error: Incorrect parameters passed to textract.]

What should be the correct value for the MIME type in this case? or this shouldn't be treated as a buffer? If then what should be the correct API to use?

I'm able to use textract to open the actual physical file and read the contents though.

Appreciate if anyone can advise on this matter.

Community
  • 1
  • 1
Lee
  • 2,448
  • 3
  • 22
  • 44
  • 3
    If you're storing a binary file, the data type should really be `VARBINARY(MAX)` - not `NVARCHAR(MAX)` (which is a **textual** data type) – marc_s Feb 02 '16 at 08:49
  • Thanks, marc_s. That's a correct tip and it helped me to read from Docx. I still see error for pdf though. > [Error: File not correctly recognized as zip file, end of central directory record signature not found] – Lee Feb 02 '16 at 09:01
  • 1
    Ignore that. Issue resolved now. Thanks a lot, Marc_s. – Lee Feb 02 '16 at 09:19

0 Answers0