6

I have a very large file (several GB) in AWS S3, and I only need a small number of lines in the file which satisfy a certain condition. I don't want to load the entire file in-memory and then search for and print those few lines - the memory load for this would be too high. The right way would be to only load those lines in-memory which are needed.

As per AWS documentation to read from file:

fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
 displayTextInputStream(fullObject.getObjectContent());

private static void displayTextInputStream(InputStream input) throws IOException {
    // Read the text input stream one line at a time and display each line.
    BufferedReader reader = new BufferedReader(new InputStreamReader(input));
    String line = null;
    while ((line = reader.readLine()) != null) {
        System.out.println(line);
    }
    System.out.println();
}

Here we are using a BufferedReader. It is not clear to me what is happening underneath here.

Are we making a network call to S3 each time we are reading a new line, and only keeping the current line in the buffer? Or is the entire file loaded in-memory and then read line-by-line by BufferedReader? Or is it somewhere in between?

sbhatla
  • 946
  • 18
  • 33
  • From the [link you posted](https://docs.aws.amazon.com/AmazonS3/latest/dev/RetrievingObjectUsingJava.html): **Note** Your network connection remains open until you read all of the data or close the input stream. We recommend that you read the content of the stream as quickly as possible. – Johannes Kuhn Jul 24 '18 at 19:44
  • My question is more along the lines of - will the entire file be loaded in-memory, or only the lines I'm reading, or a buffer that's somewhere in between? – sbhatla Jul 25 '18 at 22:20
  • Simply write a small sample application and try to read the file from S3 using the above code.If it would read the hole file into memory at once, you will encounter an OOM for sure. – dpr Aug 03 '18 at 10:11

4 Answers4

10

One of the answer of your question is already given in the documentation you linked:

Your network connection remains open until you read all of the data or close the input stream.

A BufferedReader doesn't know where the data it reads is coming from, because you're passing another Reader to it. A BufferedReader creates a buffer of a certain size (e.g. 4096 characters) and fills this buffer by reading from the underlying Reader before starting to handing out data of calls of read() or read(char[] buf).

The Reader you pass to the BufferedReader is - by the way - using another buffer for itself to do the conversion from a byte-based stream to a char-based reader. It works the same way as with BufferedReader, so the internal buffer is filled by reading from the passed InputStream which is the InputStream returned by your S3-client.

What exactly happens within this client if you attempt to load data from the stream is implementation dependent. One way would be to keep open one network connection and you can read from it as you wish or the network connection can be closed after a chunk of data has been read and a new one is opened when you try to get the next one.

The documentation quoted above seems to say that we've got the former situation here, so: No, calls of readLine are not leading to single network calls.

And to answer your other question: No, a BufferedReader, the InputStreamReader and most likely the InputStream returned by the S3-client are not loading in the whole document into memory. That would contradict the whole purpose of using streams in the first place and the S3 client could simply return a byte[][] instead (to come around the limit of 2^32 bytes per byte-array)

Edit: There is an exception of the last paragraph. If the whole gigabytes big document has no line breaks, calling readLine will actually lead to the reading of the whole data into memory (and most likely to a OutOfMemoryError). I assumed a "regular" text document while answering your question.

Lothar
  • 4,933
  • 1
  • 8
  • 26
  • This statement seems unclear to me - "No, calls of readLine are not leading to single network calls.". Are you suggesting a network call is made per call to readLine api ? – Arshan Qureshi Sep 02 '19 at 18:05
  • @ArshanQureshi That particular statement is the summary of the above. A read of `readLine` is not leading to a network call, because there are byte-based buffers between you and the S3-bucket (at least it seems according to the documentation) that is filled independently from the actual data and you only take a part from that buffer (until the occurrance of a line break) when calling `readLine`. – Lothar Sep 02 '19 at 19:50
3

If you are basically not searching for a specific word/words, and you are aware of the bytes range, you can also use Range header in S3. This should be specifically useful as you are working with a single file of several GB size. Specifying Range not only helps to reduce the memory, but also is faster, as only the specified part of the file is read.

See Is there "S3 range read function" that allows to read assigned byte range from AWS-S3 file?

Hope this helps.

Sreram

Sreram
  • 51
  • 4
0

Depends on the size of the lines in your file. readLine() will continue to build the string fetching data from the stream in blocks the size of your buffer size, until you hit a line termination character. So the memory used will be on the order of your line length + buffer length.

Jesse
  • 585
  • 6
  • 11
0

Only a single HTTP call is made to the AWS infrastructure, and the data is read into memory in small blocks, of which the size may vary and is not directly under your control.

This is very memory-efficient already, assuming each line in the file is a reasonably small size.

One way to optimize further (for network and compute resources), assuming that your "certain condition" is a simple string match, is to use S3 Select: https://aws.amazon.com/s3/features/#s3-select

Alex R
  • 10,320
  • 12
  • 76
  • 145