Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

2717 questions
59
votes
5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…
lauri108
  • 1,083
  • 1
  • 11
  • 16
49
votes
13 answers

Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html I am running as below command on ec2 instance : ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname…
Sam
  • 1,271
  • 5
  • 21
  • 35
38
votes
7 answers

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get…
nickponline
  • 22,615
  • 27
  • 86
  • 138
31
votes
7 answers

Extremely slow S3 write times from EMR/ Spark

I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR? My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours. I was curious into what Spark was doing all…
jspooner
  • 10,146
  • 9
  • 54
  • 81
27
votes
1 answer

How to configure high performance BLAS/LAPACK for Breeze on Amazon EMR, EC2

I am trying to set up an environment to support exploratory data analytics on a cluster. Based on an initial survey of what's out there my target is use Scala/Spark with Amazon EMR to provision the cluster. Currently I'm just trying to get some…
Tim Ryan
  • 970
  • 1
  • 10
  • 19
25
votes
2 answers

AWS VPC identify private and public subnet

I have a VPC in AWS account and there are 5 subnets associated with that VPC. Subnets are of 2 types - Public and private. How to identify which subnet is public and which is private ? Each subnet has CIDR 10.249.?.? range. Basically when I launch…
user1846749
  • 1,255
  • 2
  • 14
  • 28
24
votes
2 answers

Strange spark ERROR on AWS EMR

I have a really simple PySpark script that creates a dataframe from some parquet data on S3 and then call count() method and print out the number of records. I run the script on AWS EMR cluster and I'm seeing following strange WARN…
seiya
  • 1,140
  • 1
  • 11
  • 21
23
votes
5 answers

How do I make matplotlib work in AWS EMR Jupyter notebook?

This is very close to this question, but I have added a few details specific to my question: Matplotlib Plotting using AWS-EMR jupyter notebook I would like to find a way to use matplotlib inside my Jupyter notebook. Here is the code-snippet in…
Matt
  • 4,580
  • 2
  • 23
  • 36
23
votes
3 answers

Amazon EC2 vs. Amazon EMR

I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS. I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR? I want…
Bhavesh Shah
  • 2,989
  • 8
  • 47
  • 69
22
votes
10 answers

pyspark error does not exist in the jvm error when initializing SparkContext

I am using spark over emr and writing a pyspark script, I am getting an error when trying to from pyspark import SparkContext sc = SparkContext() this is the error File "pyex.py", line 5, in sc = SparkContext() File…
thebeancounter
  • 3,373
  • 4
  • 36
  • 80
21
votes
2 answers

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job…
retnuH
  • 1,365
  • 1
  • 10
  • 18
21
votes
5 answers

Does an EMR master node know its cluster ID?

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify…
bstempi
  • 1,933
  • 1
  • 15
  • 26
20
votes
4 answers

How to launch and configure an EMR cluster using boto

I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows: How to define the cluster to be used (by clusted_id) How to configure an launch…
eran
  • 12,534
  • 28
  • 87
  • 133
19
votes
6 answers

Can we consider AWS Glue as a replacement for EMR?

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning or under-provisioning resources, besides running…
Yuva
  • 1,842
  • 3
  • 17
  • 42
19
votes
2 answers

How to tune spark job on EMR to write huge data quickly on S3

I have a spark job where i am doing outer join between two data frames . Size of first data frame is 260 GB,file format is text files which is split into 2200 files and the size of second data frame is 2GB . Then writing data frame output which is…
SUDARSHAN
  • 985
  • 2
  • 24
  • 65
1
2 3
99 100