1

I would like to launch multiple Amazon EC2 spot instances (fleet?) using a custom AMI (docker?) for performing a deep-learning training task. I would like all the instances to share a common set of files for the purposes of training the model.

The idea here is not to lose training history and keep a backup in EBS (network drive?) when the spot instance is terminated by AWS due to pricing-limit/demand. The task state can be updated in a file and then resumed when instances are available.

Is it possible to launch all instances and let them work cooperatively to complete the training task? What kind of a setup could accomplish this?

John Rotenstein
  • 165,783
  • 13
  • 223
  • 298

1 Answers1

2

Firstly, you might be interested in the Deep Learning AMI from the AWS Marketplace, which comes fully-configured with popular Deep Learning tools.

If the software you are using wishes to save its data to a local file system (as opposed to Amazon S3), then you could use Deep Learning AMI to share a file system amongst multiple Amazon EC2 instances (including Spot instances). Amazon EFS is similar to a NAS and can be used simultaneously across multiple instances.

The EFS volume could be mounted via a User Data script, together with a setup script to load and run your desired application (which can be easier than making a new AMI).

John Rotenstein
  • 165,783
  • 13
  • 223
  • 298
  • Thanks for pointing out the DL AMI. Your inputs are greatly appreciated. As I see, the spot instance fleet is a very valuable and cost-effective tool in AWS. I will experiment and post my learning on this thread. I am also looking at their API to automate some of the tasks. – svanimisetti Feb 21 '17 at 03:14
  • Hi @SampathVanimisetti, if this or any answer has solved your question please consider [accepting it](http://meta.stackexchange.com/q/5234/179419) by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – John Rotenstein Feb 21 '17 at 03:40
  • Apologies! New around here as you may have noticed. I tried upvoting, but it seems I need reputation points before I am able to do so. I have accepted the answer. – svanimisetti Feb 21 '17 at 21:42
  • I am assuming that the options you indicated can be instantiated using the Amazon EC2 API. If I use the API to launch a spot instance fleet, some of the instances _may_ be terminated due to pricing/demand. I understand there are different [storage options](http://stackoverflow.com/questions/29575877/aws-efs-vs-ebs-vs-s3-differences-when-to-use). What is the most cost-efficient option for the purposes of training and storing DL training data and models. I can see that the spot fleet instance _may_ need to be started multiple times before the training can be completed. – svanimisetti Feb 21 '17 at 21:55
  • As mentioned in my answer, you might consider using Amazon EFS (a shared disk between instances). Other choices are Amazon S3 (but this would need software to specifically handle it) or a database hosted outside of the Spot instances (eg Amazon RDS or Amazon DynamoDB). – John Rotenstein Feb 21 '17 at 22:16
  • Thank you once again John. Are there any tutorials / walkthrough that can help me get started with what I want to achieve? For example, high levels steps like this: 1. Create disk image with software stack for AMI. 2. Store it in a particular location in S3/EFS. 3. Create a VMI and define some storage and network configuration. 4. Define a spot instance fleet with some configuration. 5. Launch spot instance fleet with cost / budget constraints 6. Keep checking for some criteria that define completion of training job. 7. Keep launching the fleet until the criteria are met. – svanimisetti Feb 24 '17 at 22:55
  • There is unlikely to be a tutorial for your specific use-case, but there's lots of information out there about each individual step. The [AWS Documentation](https://aws.amazon.com/documentation/) and [re:Invent 2016 presentations](http://aws-reinvent-audio.s3-website.us-east-2.amazonaws.com/2016/2016.html) are a great resource. – John Rotenstein Feb 26 '17 at 05:00
  • Sampath did you find any further information? I am interested in the same thing. https://aws.amazon.com/hpc/cfncluster/ handles spot instances/job scheduling and works with AMIs (but not docker images?). then https://seanbe.github.io/blog/deep-learning-aws/ discusses nvidia-docker (but not job scheduling/dealing with spot instances..) – seanv507 Jul 21 '17 at 14:13