21

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify itself in this message so that the recipient knows which cluster the message is about.

Does the master node know its ID (j-*************)? If not, then is there some other piece of identifying information that could allow the message recipient to infer this ID?

I've taken a look through the config files in /home/hadoop/conf, and I haven't found anything useful. I found the ID in /mnt/var/log/instance-controller/instance-controller.log, but it looks like it'll be difficult to grep for. I'm wondering where instance-controller might get that ID from in the first place.

caffreyd
  • 853
  • 1
  • 15
  • 23
bstempi
  • 1,933
  • 1
  • 15
  • 26

5 Answers5

42

You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID.

You can use the pre-installed json parser (jq) to get the jobflow id.

cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId"

(updated as per @Marboni)

jc mannem
  • 2,055
  • 15
  • 22
  • Awesome, I'll check this out! – bstempi Apr 09 '15 at 04:41
  • 3
    See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/Config_JSON.html – ChristopherB Apr 09 '15 at 21:44
  • @jcmannem This folder contains everything that I needs. Even, it avoids AWS throttling API. Filename for my use - /mnt/var/lib/info/job-flow-state.txt Now the problem is, how can I parse this file ? Do you know ? If it is, I can use jackson library. – devsda Aug 12 '16 at 06:35
  • 2
    @devsda You can parse the file with jq that is pre-installed: `cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId"` – Marboni Feb 13 '17 at 12:53
8

You can use Amazon EC2 API to figure out. The example below uses shell commands for simplicity. In real life you should use appropriate API to do this steps.

First you should find out your instance ID:

 INSTANCE=`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`

Then you can use your instance ID to find out the cluster id :

ec2-describe-instances $INSTANCE | grep TAG | grep aws:elasticmapreduce:job-flow-id

Hope this helps.

Vlad
  • 8,407
  • 4
  • 39
  • 60
  • The first query is doable without any special permissions. I assume the second one requires the ability to perform EC2 operations, yes? – bstempi Feb 10 '15 at 06:37
  • 1
    Doesn't work for me, get error on 2nd command "Client.InvalidInstanceID.NotFound: The instance ID 'xxxx' does not exist (Service: AmazonEC2; Status Code: 400; Error Code: InvalidInstanceID.NotFound; Request ID: xxxx-xxxx)" – spats Feb 10 '15 at 06:44
  • Weird. Works fine for me. Are you sure you are using correct AWS access keys? – Vlad Feb 11 '15 at 14:02
  • 1
    In case anyone else is wary of that IP, it's an [official AWS IP](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html) used to distribute meta information to EC2 instances. – makhdumi May 03 '16 at 18:52
  • 1
    @spats you probably just need to specify the region your instance is running in: `ec2-describe-instances --region $INSTANCE` – Phil Apr 24 '20 at 10:12
4

As been specifed above, the information is in the job-flow.json file. This file has several other attributes. So, knowing where it's located, you can do it in a very easy way:

cat /mnt/var/lib/info/job-flow.json | grep jobFlowId | cut -f2 -d: | cut -f2 -d'"'

Edit: This command works in core nodes also.

chomp
  • 1,242
  • 12
  • 28
  • This folder contains everything that I needs. Even, it avoids AWS throttling API. Filename for my use - /mnt/var/lib/info/job-flow-state.txt Now the problem is, how can I parse this file ? Do you know ? If it is, I can use jackson library. – devsda Aug 12 '16 at 06:36
  • I don't know the content of that file, so i don't know how to parse it, maybe you should do another question for that ;) – chomp Aug 12 '16 at 15:29
  • 0 down vote it's possible reading job-flow.json from spark application ? Process p = Runtime.getRuntime() .exec("cat /mnt/var/lib/info/job-flow.json | grep jobFlowId | cut -f2 -d:"); i tried but seems that the process input stream doesn't return any result. Thanks – dnocode Sep 15 '18 at 13:44
2

Another option - query the metadata server:

curl -s http://169.254.169.254/2016-09-02/user-data/ | sed -r 's/.*clusterId":"(j-[A-Z0-9]+)",.*/\1/g'
David Rabinowitz
  • 28,033
  • 14
  • 88
  • 124
0

Apparently the Hadoop MapReduce job has no way to know which cluster it is running on - I was surprised to find this out myself.

BUT: you can use other identifiers for each map to uniquely identify the mapper which is running, and the job that is running.

These are specified in the environment variables passed on to each mapper. If you are writing a job in Hadoop streaming, using Python, the code would be:

import os

if 'map_input_file' in os.environ:
    fileName = os.environ['map_input_file']
if 'mapred_tip_id' in os.environ:
    mapper_id = os.environ['mapred_tip_id'].split("_")[-1]
if 'mapred_job_id' in os.environ:
    jobID = os.environ['mapred_job_id']

That gives you: input file name, the task ID, and the job ID. Using one or a combination of those three values, you should be able to uniquely identify which mapper is running.

If you are looking for a specific job: "mapred_job_id" might be what you want.

Suman
  • 8,407
  • 5
  • 43
  • 61