Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

Synonymous tag : elastic-map-reduce amazon-emr

1202 questions

votes

5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…

apache-spark emr amazon-emr bigdata

asked Nov 24 '16 at 08:33

lauri108

1,083
1
11
16

votes

5 answers

How to bootstrap installation of Python modules on Amazon EMR?

I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this?

python amazon-web-services apache-spark emr

asked Jul 20 '15 at 19:41

Evan Zamir

6,460
11
44
68

votes

7 answers

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get…

json hadoop hive amazon-emr emr

asked Jul 13 '12 at 22:37

nickponline

22,615
27
86
138

votes

4 answers

How to restart yarn on AWS EMR

I am using Hadoop 2.6.0 (emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to bring the changes into effect. Is there a command using which I can do this?

hadoop yarn emr

asked Jan 22 '16 at 18:11

nish

6,230
17
60
113

votes

2 answers

Compress file on S3

I have a 17.7GB file on S3. It was generated as the output of a Hive query, and it isn't compressed. I know that by compressing it, it'll be about 2.2GB (gzip). How can I download this file locally as quickly as possible when transfer is the…

amazon-s3 compression hive file-transfer emr

asked Jan 24 '13 at 06:24

Matt Joiner

100,604
94
332
495

votes

3 answers

How do I copy files from S3 to Amazon EMR HDFS?

I'm running hive over EMR, and need to copy some files to all EMR instances. One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to…

amazon-s3 hadoop hive hdfs emr

asked Sep 20 '11 at 14:57

Tomer

votes

2 answers

SQL query in Spark/scala Size exceeds Integer.MAX_VALUE

I am trying to create a simple sql query on S3 events using Spark. I am loading ~30GB of JSON files as following: val d2 =…

sql apache-spark amazon-ec2 emr

asked Feb 15 '17 at 11:07

eexxoo

votes

3 answers

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates because some column types do not match and I get one of…

apache-spark apache-spark-sql spark-dataframe emr parquet

asked Dec 02 '16 at 07:52

V. Samma

2,308
4
25
33

votes

3 answers

Exporting Hive Table to a S3 bucket

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: CREATE TABLE csvimport(id BIGINT, time STRING, log STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH…

amazon-s3 hive elastic-map-reduce emr

asked Feb 28 '12 at 20:48

seedhead

3,444
4
27
38

votes

2 answers

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job…

apache-spark yarn emr amazon-emr elastic-map-reduce

asked Nov 26 '15 at 14:16

retnuH

1,365
1
10
18

votes

4 answers

Spark resources not fully allocated on Amazon EMR

I'm trying to maximize cluster usage for a simple task. Cluster is 1+2 x m3.xlarge, runnning Spark 1.3.1, Hadoop 2.4, Amazon AMI 3.7 The task reads all lines of a text file and parse them as csv. When I spark-submit a task as a yarn-cluster mode, I…

apache-spark yarn emr

asked Jun 08 '15 at 15:47

Michel Lemay

1,944
1
13
34

votes

7 answers

Pyspark --py-files doesn't work

I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html spsark version 1.1.0 ./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \ /home/hadoop/loganalysis/ship-test.py and conf in…

python hadoop apache-spark emr

asked Dec 25 '14 at 05:46

C19

votes

4 answers

AWS EMR Spark Python Logging

I'm running a very simple Spark job on AWS EMR and can't seem to get any log output from my script. I've tried with printing to stderr: from pyspark import SparkContext import sys if __name__ == '__main__': sc =…

python apache-spark emr

asked Mar 06 '17 at 01:05

jarbaugh

votes

5 answers

Spark on yarn mode end with "Exit status: -100. Diagnostics: Container released on a lost node"

I am trying to load a database with 1TB data to spark on AWS using the latest EMR. And the running time is so long that it doesn't finished in even 6 hours, but after running 6h30m , I get some error announcing that Container released on a lost node…

apache-spark yarn emr

asked Jul 02 '16 at 00:39

John Zeng

votes

2 answers

How do you delete an AWS EMR Cluster?

I've been playing around with AWS EMR and I now have a few clusters that are terminated and that I want to delete: However, there is no obvious option to delete them. How do I make them go away?

amazon-web-services emr amazon-emr

asked Nov 11 '15 at 23:01

vy32

24,271
30
100
197

2 3

…

80 81 Next