How to process large zip file using AWS lambda?

Question

I want to read 100K+ files from s3 and zip them into single large file. The individual file size can range between few Kb to 1MB and the final zip file could go easily beyond 3GB. Given AWS Lambda has memory limitation of 3GB and tmp directory storage of 512MB. How would you do that using AWS lambda? I am using .Net Core 3

The code below will fail when zip size go beyond 3Gb

    var zipStream = new MemoryStream();
    using (System.IO.Compression.ZipArchive zip = new ZipArchive(zipStream, ZipArchiveMode.Create, true))
    {
        for(int i =0;i<sourceFils.Count;i++)
        {
            
            var zipItem = zip.CreateEntry("file"+i.ToString()+".pdf");

            using (var entryStream = zipItem.Open())
            {
                var source  = GetFileFromS3(sourceFiles[i]);
                await source.CopyToAsync(entryStream);
            }
        }
    }

     //upload zip file to S3. For brevity Upload code is not included.
    _s3Client.Upload(zipStream);

Most of the example i have seen for large file processing, are using Node JS and also don't go beyond 3GB. I am looking for C# .Net Core example. Also I am trying to avoid splitting zip into multiple zip files that are less than 3GB each

1>How would you do this using AWS Lambda without splitting zip file? 2>Is there S3 Stream available that would directly read/write from S3?

If you have a total storage space of 3GB (RAM) + 0.5GB (disk), you obviously cannot go beyond 3.5GB (and probably it's less than this). You need more space -> AWS lambda is not suitable for this task — Camilo Terevinto, Nov 24 '20 at 17:29
You can use swap space when you run out of memory which will use a temp file in place of memory. It runs slower but will resolve issue. See : https://www.reddit.com/r/aws/comments/b2zijf/swap_space_when_using_aws_linux_based_ami/ — jdweng, Nov 24 '20 at 17:35
@jdweng i think thats only valid for EC2 instance not for serverless lambda. I could be wrong though. — LP13, Nov 24 '20 at 18:09
@LP13 : If the machine has a file system (smart card) then it is applicable. Has nothing to do with being serverless. — jdweng, Nov 24 '20 at 18:13
Lambda is event driven. If you're doing this once then it would be easiest to spin up an EC2 with a reasonable amount of disk space, run the program, and get rid of the EC2. If you need to do it frequently then create a EC2 AMI with the correct "stuff" and use it to create a temporary EC2 and run it as needed. — stdunbar, Nov 24 '20 at 20:14
If the use case is that you need to react on file upload you can use AWS Batch Jobs as EventBridge (formerly known as CloudWatch Events) event target https://docs.aws.amazon.com/batch/latest/userguide/batch-cwe-target.html — jimmone, Nov 26 '20 at 16:49

score 0 · Answer 1 · answered Dec 02 '20 at 05:23

Starting today (December 1, 2020), you can allocate up to 10 GB of memory. This may be enough for your purposes, at least for now. https://aws.amazon.com/blogs/aws/new-for-aws-lambda-functions-with-up-to-10-gb-of-memory-and-6-vcpus/

Another option may be to utilize Amazon EFS for storage if you can adapt your code to avoid requiring it all to be in memory. EFS support for Lambda was launched earlier this year. https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/

How to process large zip file using AWS lambda?

1 Answers1