Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

2717 questions

votes

5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…

asked Nov 24 '16 at 08:33

lauri108

1,083
1
11
16

votes

13 answers

Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html I am running as below command on ec2 instance : ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname…

apache-spark yarn amazon-emr amazon-kinesis

asked Jun 14 '15 at 11:35

Sam

1,271
5
21
35

votes

7 answers

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get…

json hadoop hive amazon-emr emr

asked Jul 13 '12 at 22:37

nickponline

22,615
27
86
138

votes

7 answers

Extremely slow S3 write times from EMR/ Spark

I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR? My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours. I was curious into what Spark was doing all…

amazon-web-services apache-spark amazon-s3 amazon-emr

asked Mar 15 '17 at 23:14

jspooner

10,146
9
54
81

votes

1 answer

How to configure high performance BLAS/LAPACK for Breeze on Amazon EMR, EC2

I am trying to set up an environment to support exploratory data analytics on a cluster. Based on an initial survey of what's out there my target is use Scala/Spark with Amazon EMR to provision the cluster. Currently I'm just trying to get some…

apache-spark amazon-ec2 amazon-emr scala-breeze jblas

asked Jun 16 '16 at 01:01

Tim Ryan

votes

2 answers

AWS VPC identify private and public subnet

I have a VPC in AWS account and there are 5 subnets associated with that VPC. Subnets are of 2 types - Public and private. How to identify which subnet is public and which is private ? Each subnet has CIDR 10.249.?.? range. Basically when I launch…

amazon-web-services amazon-emr amazon-vpc subnet

asked Feb 16 '18 at 16:17

user1846749

1,255
2
14
28

votes

2 answers

Strange spark ERROR on AWS EMR

I have a really simple PySpark script that creates a dataframe from some parquet data on S3 and then call count() method and print out the number of records. I run the script on AWS EMR cluster and I'm seeing following strange WARN…

amazon-web-services apache-spark pyspark amazon-emr

asked Dec 04 '17 at 14:26

seiya

1,140
1
11
21

votes

5 answers

How do I make matplotlib work in AWS EMR Jupyter notebook?

This is very close to this question, but I have added a few details specific to my question: Matplotlib Plotting using AWS-EMR jupyter notebook I would like to find a way to use matplotlib inside my Jupyter notebook. Here is the code-snippet in…

python matplotlib pyspark jupyter-notebook amazon-emr

asked May 22 '19 at 21:00

Matt

4,580
2
23
36

votes

3 answers

Amazon EC2 vs. Amazon EMR

I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS. I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR? I want…

amazon-ec2 amazon-web-services hive amazon-emr

asked Apr 11 '12 at 05:09

Bhavesh Shah

2,989
8
47
69

votes

10 answers

pyspark error does not exist in the jvm error when initializing SparkContext

I am using spark over emr and writing a pyspark script, I am getting an error when trying to from pyspark import SparkContext sc = SparkContext() this is the error File "pyex.py", line 5, in sc = SparkContext() File…

python python-3.x apache-spark pyspark amazon-emr

asked Nov 05 '18 at 20:45

thebeancounter

3,373
4
36
80

votes

2 answers

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job…

apache-spark yarn emr amazon-emr elastic-map-reduce

asked Nov 26 '15 at 14:16

retnuH

1,365
1
10
18

votes

5 answers

Does an EMR master node know its cluster ID?

I want to be able to create EMR clusters, and for those clusters to send messages back to some central queue. In order for this to work, I need to have some sort of agent running on each master node. Each one of those agents will have to identify…

amazon-web-services hadoop amazon-emr

asked Nov 26 '13 at 20:16

bstempi

1,933
1
15
26

votes

4 answers

How to launch and configure an EMR cluster using boto

I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows: How to define the cluster to be used (by clusted_id) How to configure an launch…

python amazon-web-services boto amazon-emr

asked Oct 11 '14 at 11:50

eran

12,534
28
87
133

votes

6 answers

Can we consider AWS Glue as a replacement for EMR?

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning or under-provisioning resources, besides running…

amazon-web-services etl amazon-emr aws-glue

asked Jan 12 '18 at 09:09

Yuva

1,842
3
17
42

votes

2 answers

How to tune spark job on EMR to write huge data quickly on S3

I have a spark job where i am doing outer join between two data frames . Size of first data frame is 260 GB,file format is text files which is split into 2200 files and the size of second data frame is 2GB . Then writing data frame output which is…

apache-spark-sql spark-dataframe hadoop2 amazon-emr

asked Oct 15 '17 at 11:16

SUDARSHAN

2 3

…

99 100 Next