8

I am using boto3 module in python to interact with S3 and currently I'm able to get the size of every individual key in an S3 bucket. But my motive is to find the space storage of only the top level folders (every folder is a different project) and we need to charge per project for the space used. I'm able to get the names of the top level folders but not getting any details about the size of the folders in the below implementation. The following is my implementation to get the top level folder names.

import boto
import boto.s3.connection

AWS_ACCESS_KEY_ID = "access_id"
AWS_SECRET_ACCESS_KEY = "secret_access_key"
Bucketname = 'Bucket-name' 

conn = boto.s3.connect_to_region('ap-south-1',
   aws_access_key_id=AWS_ACCESS_KEY_ID,
   aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
   is_secure=True, # uncomment if you are not using ssl
   calling_format = boto.s3.connection.OrdinaryCallingFormat(),
   )

bucket = conn.get_bucket('bucket')
folders = bucket.list("", "/")

for folder in folders:
    print(folder.name)

The type of folder here is boto.s3.prefix.Prefix and it doesn't display any details of size. Is there any way to search a folder/object in an S3 bucket by it's name and then fetch the size of that object ?

Nilesh Kumar Guria
  • 101
  • 2
  • 2
  • 6
  • Your code is using `boto`, not `boto3`. – helloV Apr 10 '18 at 23:38
  • Depend on individual needs, you may want to use the S3 storage-inventory than running the list_object iterator https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html – mootmoot Apr 11 '18 at 16:52
  • @mootmoot s3 storage inventory in my case gives some weird gz files and not the actual list of files and objects. – DJ_Stuffy_K Dec 17 '20 at 22:39

5 Answers5

11

In order to get the size of an S3 folder, objects (accessible in the boto3.resource('s3').Bucket) provide the method filter(Prefix) that allows you to retrieve ONLY the files which respect the Prefix condition, and makes it quite optimised.

import boto3

def get_size(bucket, path):
    s3 = boto3.resource('s3')
    my_bucket = s3.Bucket(bucket)
    total_size = 0

    for obj in my_bucket.objects.filter(Prefix=path):
        total_size = total_size + obj.size

    return total_size

So let's say you want to get the size of the folder s3://my-bucket/my/path/ then you would call the previous function like that:

get_size("my-bucket", "my/path/")

Then this of course is easily applicable to top level folders as well

Vzzarr
  • 1,940
  • 1
  • 19
  • 36
  • 1
    Useful out-of-the-box function that also considers subfolders – Guido Jun 27 '19 at 13:15
  • @vzzarr could you update the answer to reflect object versions as well as only that would indicate the true size (same size AWS charges for) – DJ_Stuffy_K Dec 17 '20 at 22:41
  • 1
    @DJ_Stuffy_K when I posted this answer it was working, would you provide any reference to what you mentioned? – Vzzarr Dec 18 '20 at 11:26
  • s3 has object versions if versioning is enabled on a given bucket. ex: https://stackoverflow.com/a/39262411/4590025 + https://stackoverflow.com/a/59779947/4590025 ; without versions a prefix size could be 10G and taking versioning into account it could be 20G; @Vzzarr – DJ_Stuffy_K Dec 18 '20 at 18:44
7

To find the size of the top-level "folders" in S3 (S3 does not really have a concept of folders, but kind of displays a folder structure in the UI), something like this will work:

from boto3 import client
conn = client('s3')

top_level_folders = dict()

for key in conn.list_objects(Bucket='kitsune-buildtest-production')['Contents']:

    folder = key['Key'].split('/')[0]
    print("Key %s in folder %s. %d bytes" % (key['Key'], folder, key['Size']))

    if folder in top_level_folders:
        top_level_folders[folder] += key['Size']
    else:
        top_level_folders[folder] = key['Size']


for folder, size in top_level_folders.items():
    print("Folder: %s, size: %d" % (folder, size))
Josh Kupershmidt
  • 2,112
  • 15
  • 29
  • For a single top-level "folder," it *should* be possible to somehow pass a prefix, but not a delimiter, to `list_objects()`, and the results would only include that one "folder" and its descendants... at least, the REST API supports this. – Michael - sqlbot Apr 11 '18 at 01:18
  • 1
    Thanks for the answer. Implemented something on similar lines along with a paginator as it seems S3 iterates only over the first 1000 objects. – Nilesh Kumar Guria Apr 11 '18 at 11:08
  • @Josh How do I use this with an iam role instead of secret access and keys? – DJ_Stuffy_K Sep 23 '20 at 23:49
  • 1
    @DJ_Stuffy_K if you run code (like this snippet) using `boto3` on, say, an EC2 instance with an IAM role attached, boto3 should automatically detect the role and use it. See the "IAM Roles" section at: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#id3 – Josh Kupershmidt Sep 24 '20 at 15:50
  • @JoshKupershmidt Could you kindly expand your answer to include size calculation from versioned objects as well as only that would show the true size if versioning is enabled on the bucket? – DJ_Stuffy_K Dec 10 '20 at 04:34
  • @NileshKumarGuria could you update the answer to reflect the use of paginators as it will help more people especially those with millions of objects. – DJ_Stuffy_K Dec 17 '20 at 22:40
3

To get more than 1000 objects from S3 by using list_objects_v2, try this

from boto3 import client
conn = client('s3')

top_level_folders = dict()

paginator = conn.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket='bucket', Prefix='prefix')
index = 1
for page in pages:
    for key in page['Contents']:
        print(key['Size'])
        folder = key['Key'].split('/')[index]
        print("Key %s in folder %s. %d bytes" % (key['Key'], folder, key['Size']))

        if folder in top_level_folders:
            top_level_folders[folder] += key['Size']
        else:
            top_level_folders[folder] = key['Size']

for folder, size in top_level_folders.items():
    size_in_gb = size/(1024*1024*1024)
    print("Folder: %s, size: %.2f GB" % (folder, size_in_gb))

if the prefix is notes/ and the delimiter is a slash (/) as in notes/summer/july, the common prefix is notes/summer/. Incase prefix is "notes/" : index = 1 or "notes/summer/" : index = 2

  • How to include versioned data size as well? As only that will show the true size. Also how to use mention profile here? – DJ_Stuffy_K Dec 10 '20 at 04:28
2

Not using boto3, just aws cli, but this quick one-liner serves the purpose. I usually put a tail -1 to get the summary folder size only. Can be a bit slow though, for folders having many objects.

aws s3 ls --summarize --human-readable --recursive s3://bucket-name/folder-name | tail -1

  • @DJ_Stuffy_K, you can install the AWS CLI for Windows following this link : https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-windows.html . – Yeamin Rajeev Sep 25 '20 at 02:05
  • I already have awscli on windows installed but when I use tail -1 it says command is unrecognized. – DJ_Stuffy_K Sep 25 '20 at 04:07
  • 1
    @DJ_Stuffy_K, sorry for this very late reply. I've found if you have PowerShell installed this command would work: ``` Get-Content filenamehere -Wait -Tail 30 ``` You can find a few other solutions here : https://stackoverflow.com/questions/187587/a-windows-equivalent-of-the-unix-tail-command – Yeamin Rajeev Jan 24 '21 at 12:08
1
def find_size(name, conn):
  for bucket in conn.get_all_buckets():
    if name == bucket.name:
      total_bytes = 0
      for key in bucket:
        total_bytes += key.size
        total_bytes = total_bytes/1024/1024/1024
      print total_bytes 
galaxyan
  • 5,119
  • 2
  • 15
  • 36