38

Although they resemble files, objects in Amazon S3 aren't really "files", just like S3 buckets aren't really directories. On a Unix system I can use head to preview the first few lines of a file, no matter how large it is, but I can't do this on a S3. So how do I do a partial read on S3?

jm3
  • 1,510
  • 1
  • 14
  • 19
  • If you're using Python, just use the [smart_open](https://github.com/RaRe-Technologies/smart_open) library and save yourself the trouble. – user124114 Oct 19 '19 at 08:15

4 Answers4

71

S3 files can be huge, but you don't have to fetch the entire thing just to read the first few bytes. The S3 APIs support the HTTP Range: header (see RFC 2616), which take a byte range argument.

Just add a Range: bytes=0-NN header to your S3 request, where NN is the requested number of bytes to read, and you'll fetch only those bytes rather than read the whole file. Now you can preview that 900 GB CSV file you left in an S3 bucket without waiting for the entire thing to download. Read the full GET Object docs on Amazon's developer docs.

jm3
  • 1,510
  • 1
  • 14
  • 19
  • 11
    Sample S3 call: aws s3api get-object --bucket my_bucket --key path/to/my/file/file1.gz file1.gz --range bytes=1000-2000 – hello_harry Mar 13 '17 at 15:34
  • In your example, it would be better to use `Range: bytes=K-N` as you can start from a different value than `0` (see the answer from @Rick W). – The Dude Jun 19 '20 at 08:23
  • A clarification: S3 does not support the entire RFC 2616 specification regarding the Range header in fact it [only supports single ranges and not multiple ones](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#AmazonS3-GetObject-request-header-Range). – fox91 Jan 25 '21 at 14:11
6

The AWS .Net SDK only shows only fixed-ended ranges are possible (RE: public ByteRange(long start, long end) ). What if I want to start in the middle and read to the end? An HTTP range of Range: bytes=1000- is perfectly acceptable for "start at 1000 and read to the end" I do not believe that they have allowed for this in the .Net library.

Rick W
  • 61
  • 1
  • 1
  • The question does not mention .NET so this may be of no help to the person that asked the question – Mark Dec 10 '20 at 10:47
3

Using Python you can preview first records of compressed file.

Connect using boto.

#Connect:
s3 = boto.connect_s3()
bname='my_bucket'
self.bucket = s3.get_bucket(bname, validate=False)

Read first 20 lines from gzip compressed file

#Read first 20 records
limit=20
k = Key(self.bucket)
k.key = 'my_file.gz'
k.open()
gzipped = GzipFile(None, 'rb', fileobj=k)
reader = csv.reader(io.TextIOWrapper(gzipped, newline="", encoding="utf-8"), delimiter='^')
for id,line in enumerate(reader):
    if id>=int(limit): break
    print(id, line)

So it's an equivalent of a following Unix command:

zcat my_file.gz|head -20
Alex B
  • 1,615
  • 1
  • 20
  • 27
1

get_object api has arg for partial read

s3 = boto3.client('s3')
resp = s3.get_object(Bucket=bucket, Key=key, Range='bytes={}-{}'.format(start_byte, stop_byte-1))
res = resp['Body'].read()
lambda
  • 21
  • 6