1

In the end, my ultimate goals are:

  • Read from a URL (what this question is about)
  • Save the retrieved [PDF] content to a BLOB field in a DB (already have that nailed down)
  • Read from the BLOB field and attach that content to an email
  • All without going to a filesystem

The goal with the following method is to get a byte[] that can be used downstream as an email attachment (to avoid writing to disk):

public byte[] retrievePDF() {

         HttpClient httpClient = new HttpClient();

         GetMethod httpGet = new GetMethod("http://website/document.pdf");
         httpClient.executeMethod(httpGet);
         InputStream is = httpGet.getResponseBodyAsStream();

         byte[] byteArray = new byte[(int) httpGet.getResponseContentLength()];

         is.read(byteArray, 0, byteArray.length);

        return byteArray;
}

For a particular PDF, the getResponseContentLength() method returns 101,689 as the length. The strange part is that if I set a break-point and interrogate the byteArray variable, it has 101,689 byte elements, however, after byte #3744 the remaining bytes of the array are all zeroes (0). The resulting PDF is then not readable by a PDF-reader client, like Adobe Reader.

Why would that happen?

Retrieving this same PDF via browser and saving to disk, or using a method like the following (which I patterned after an answer to this StackOverflow post), results in a readable PDF:

public void retrievePDF() {
    FileOutputStream fos = null;
    URL url;
    ReadableByteChannel rbc = null;

    url = new URL("http://website/document.pdf");

    DataSource urlDataSource = new URLDataSource(url);

    /* Open a connection, then set appropriate time-out values */
    URLConnection conn = url.openConnection();
    conn.setConnectTimeout(120000);
    conn.setReadTimeout(120000);

    rbc = Channels.newChannel(conn.getInputStream());

    String filePath = "C:\\temp\\";
    String fileName = "testing1234.pdf";
    String tempFileName = filePath + fileName;

    fos = new FileOutputStream(tempFileName);
    fos.getChannel().transferFrom(rbc, 0, 1 << 24);
    fos.flush();

    /* Clean-up everything */
    fos.close();
    rbc.close();
}

For both approaches, the size of the resulting PDF is 101,689-bytes when doing a Right-click > Properties... in Windows.

Why would the byte array essentially "stop" part-way through?

Community
  • 1
  • 1
PattMauler
  • 390
  • 3
  • 22

3 Answers3

5

InputStream.read reads up to byteArray.length bytes but might not read exactly that much. It returns how many bytes it read. You should call it repeatedly to fully read the data, like this:

int bytesRead = 0;
while (true) {
    int n = is.read(byteArray, bytesRead, byteArray.length);
    if (n == -1) break;
    bytesRead += n;
}
Joe K
  • 17,254
  • 1
  • 31
  • 54
  • What happens if bytesRead >= byteArray.length? I think zero will be returned and you will be stuck in an infinite loop. – Dunes Oct 03 '12 at 23:48
  • Correct. This assumes that byteArray.length >= total bytes in the stream, which in this case is based on the content length given by the server. So I suppose if the server were malicious and decided to lie about the content length, this could in theory happen, but probably won't. – Joe K Oct 03 '12 at 23:52
  • I tried this, and I do get more this way-- 97,943-bytes --but the array still shows the remaining bytes of the 101,689 as zeroes. Also, the resultant PDF still does not open. – PattMauler Oct 04 '12 at 13:35
  • Pardon me, I get 97,944-bytes. (I failed to count the "zeroth" byte.) – PattMauler Oct 04 '12 at 13:49
  • @PattMauler - Note to self, when you copy paste code from StackOverflow, make sure you **replace** the code that is intended to be replaced. (_Hint_: 101689 - 97944 = 3745; that's right... the bytes you read from the "other" call to `is.read` that you forgot to take out.) – PattMauler Oct 04 '12 at 14:01
0

Check the return value of InputStream.read. It's not going to read all at one go. You have to write a loop. Or, better yet, use Apache Commons IO to copy the stream.

bmargulies
  • 91,317
  • 38
  • 166
  • 290
0

101689 = 2^16 + 36153 so it would look like, that there is a 16 bit limitation on buffer size. The difference between 36153 and 3744 maybe stems from the header part having been read in an extra small 1K buffer or so, and already containing some bytes.

Joop Eggen
  • 96,344
  • 7
  • 73
  • 121