20

I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows:

  1. How to define the cluster to be used (by clusted_id)
  2. How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes)

Am I missing something?

George Stocker
  • 55,025
  • 29
  • 167
  • 231
eran
  • 12,534
  • 28
  • 87
  • 133

4 Answers4

30

Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.

You create a new cluster by calling the boto.emr.connection.run_jobflow() function. It will return the cluster ID which EMR generates for you.

First all the mandatory things:

#!/usr/bin/env python

import boto
import boto.emr
from boto.emr.instance_group import InstanceGroup

conn = boto.emr.connect_to_region('us-east-1')

Then we specify instance groups, including the spot price we want to pay for the TASK nodes:

instance_groups = []
instance_groups.append(InstanceGroup(
    num_instances=1,
    role="MASTER",
    type="m1.small",
    market="ON_DEMAND",
    name="Main node"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="CORE",
    type="m1.small",
    market="ON_DEMAND",
    name="Worker nodes"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="TASK",
    type="m1.small",
    market="SPOT",
    name="My cheap spot nodes",
    bidprice="0.002"))

Finally we start a new cluster:

cluster_id = conn.run_jobflow(
    "Name for my cluster",
    instance_groups=instance_groups,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    enable_debugging=True,
    log_uri="s3://mybucket/logs/",
    hadoop_version=None,
    ami_version="2.4.9",
    steps=[],
    bootstrap_actions=[],
    ec2_keyname="my-ec2-key",
    visible_to_all_users=True,
    job_flow_role="EMR_EC2_DefaultRole",
    service_role="EMR_DefaultRole")

We can also print the cluster ID if we care about that:

print "Starting cluster", cluster_id
Vilsepi
  • 1,297
  • 14
  • 18
  • Any updates to this answer with `boto3` instead of boto? – Navneet Dec 27 '16 at 21:17
  • @vilsepi this gives me this error Amazon EMR Cluster (Cluster made in python) has terminated with errors at 2017-10-02 08:21 UTC with a reason of VALIDATION_ERROR. ideas? – thebeancounter Oct 02 '17 at 08:26
14

I believe the minimum amount of Python that will launch an EMR cluster with boto3 is:

import boto3

client = boto3.client('emr', region_name='us-east-1')

response = client.run_job_flow(
    Name="Boto3 test cluster",
    ReleaseLabel='emr-5.12.0',
    Instances={
        'MasterInstanceType': 'm4.xlarge',
        'SlaveInstanceType': 'm4.xlarge',
        'InstanceCount': 3,
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2SubnetId': 'my-subnet-id',
        'Ec2KeyName': 'my-key',
    },
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole'
)

Notes: you'll have to create EMR_EC2_DefaultRole and EMR_DefaultRole. The Amazon documentation claims that JobFlowRole and ServiceRole are optional, but omitting them did not work for me. That could be because my subnet is a VPC subnet, but I'm not sure.

Jose Quinteiro
  • 379
  • 3
  • 7
  • 1
    +1 for listing out the [new API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow) method – y2k-shubham Oct 29 '18 at 12:27
  • Hi, a bit late to the thread, do you know if it's possible to create and launch a cluster with a Custom AMI using boto3? – Gudzo Jul 25 '19 at 07:27
  • 1
    I've added an example to GitHub that shows how to create both short- and long-lived clusters and add steps using Boto3. It's here: [aws-doc-sdk-examples](https://github.com/awsdocs/aws-doc-sdk-examples/blob/904ccdab93220d59eb45bdc9c299288036406e5d/python/example_code/emr/emr_basics.py#L18). – Laren Crawford Aug 25 '20 at 18:49
5

I use the following code to create EMR with flink installed, and includes 3 instance groups. Reference document: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow

import boto3

masterInstanceType = 'm4.large'
coreInstanceType = 'c3.xlarge'
taskInstanceType = 'm4.large'
coreInstanceNum = 2
taskInstanceNum = 2
clusterName = 'my-emr-name'

emrClient = boto3.client('emr')

logUri = 's3://bucket/xxxxxx/'
releaseLabel = 'emr-5.17.0' #emr version
instances = {
    'Ec2KeyName': 'my_keyxxxxxx',
    'Ec2SubnetId': 'subnet-xxxxxx',
    'ServiceAccessSecurityGroup': 'sg-xxxxxx',
    'EmrManagedMasterSecurityGroup': 'sg-xxxxxx',
    'EmrManagedSlaveSecurityGroup': 'sg-xxxxxx',
    'KeepJobFlowAliveWhenNoSteps': True,
    'TerminationProtected': False,
    'InstanceGroups': [{
        'InstanceRole': 'MASTER',
        "InstanceCount": 1,
            "InstanceType": masterInstanceType,
            "Market": "SPOT",
            "Name": "Master"
        }, {
            'InstanceRole': 'CORE',
            "InstanceCount": coreInstanceNum,
            "InstanceType": coreInstanceType,
            "Market": "SPOT",
            "Name": "Core",
        }, {
            'InstanceRole': 'TASK',
            "InstanceCount": taskInstanceNum,
            "InstanceType": taskInstanceType,
            "Market": "SPOT",
            "Name": "Core",
        }
    ]
}
bootstrapActions = [{
    'Name': 'Log to Cloudwatch Logs',
    'ScriptBootstrapAction': {
        'Path': 's3://mybucket/bootstrap_cwl.sh'
    }
}, {
    'Name': 'Custom action',
    'ScriptBootstrapAction': {
        'Path': 's3://mybucket/install.sh'
    }
}]
applications = [{'Name': 'Flink'}]
serviceRole = 'EMR_DefaultRole'
jobFlowRole = 'EMR_EC2_DefaultRole'
tags = [{'Key': 'keyxxxxxx', 'Value': 'valuexxxxxx'},
        {'Key': 'key2xxxxxx', 'Value': 'value2xxxxxx'}
        ]
steps = [
    {
        'Name': 'Run Flink',
        'ActionOnFailure': 'TERMINATE_JOB_FLOW',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['flink', 'run',
                     '-m', 'yarn-cluster',
                     '-p', str(taskInstanceNum),
                     '-yjm', '1024',
                     '-ytm', '1024',
                     '/home/hadoop/test-1.0-SNAPSHOT.jar'
                     ]
        }
    },
]
response = emrClient.run_job_flow(
    Name=clusterName,
    LogUri=logUri,
    ReleaseLabel=releaseLabel,
    Instances=instances,
    Steps=steps,
    Configurations=configurations,
    BootstrapActions=bootstrapActions,
    Applications=applications,
    ServiceRole=serviceRole,
    JobFlowRole=jobFlowRole,
    Tags=tags
)
shifu.zheng
  • 503
  • 4
  • 15
  • Hi, a bit late to the thread, do you know if it's possible to create and launch a cluster with a Custom AMI using boto3? – Gudzo Jul 25 '19 at 07:27
  • I don't know if there is a way to create a EMR using custom AMI. But a workaround is to set up your software in bootstrapActions. – shifu.zheng Jul 30 '19 at 00:49
  • https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow you have option for custom amiid – Pruthvi Raj Nov 06 '19 at 22:57
0

My Step Arguments are: bash -c /usr/bin/flink run -m yarn-cluster -yn 2 /home/hadoop/mysflinkjob.jar

Trying execute same run_job_flow, but getting error:

Cannot run program "/usr/bin/flink run -m yarn-cluster -yn 2 /home/hadoop/mysflinkjob.jar" (in directory "."): error=2, No such file or directory

Executing same command from Master node working fine, but not from Python boto3

Seems like issue is due to quotation marks which EMR or boto3 add into Arguments.

UPDATE:

Split ALL your Arguments with white-space. I mean if you need to execute "flink run myflinkjob.jar" pass your Arguments as this list:

['flink','run','myflinkjob.jar']

ADV-IT
  • 483
  • 5
  • 8