Questions tagged [emr]

Questions relating to Amazon's Elastic MapReduce (EMR) product.

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

http://aws.amazon.com/elasticmapreduce/

See also

Synonymous tag :

1202 questions
59
votes
5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…
lauri108
  • 1,083
  • 1
  • 11
  • 16
43
votes
5 answers

How to bootstrap installation of Python modules on Amazon EMR?

I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this?
Evan Zamir
  • 6,460
  • 11
  • 44
  • 68
38
votes
7 answers

How do you make a HIVE table out of JSON data?

I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get…
nickponline
  • 22,615
  • 27
  • 86
  • 138
28
votes
4 answers

How to restart yarn on AWS EMR

I am using Hadoop 2.6.0 (emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to bring the changes into effect. Is there a command using which I can do this?
nish
  • 6,230
  • 17
  • 60
  • 113
28
votes
2 answers

Compress file on S3

I have a 17.7GB file on S3. It was generated as the output of a Hive query, and it isn't compressed. I know that by compressing it, it'll be about 2.2GB (gzip). How can I download this file locally as quickly as possible when transfer is the…
Matt Joiner
  • 100,604
  • 94
  • 332
  • 495
27
votes
3 answers

How do I copy files from S3 to Amazon EMR HDFS?

I'm running hive over EMR, and need to copy some files to all EMR instances. One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to…
Tomer
  • 849
  • 3
  • 11
  • 18
25
votes
2 answers

SQL query in Spark/scala Size exceeds Integer.MAX_VALUE

I am trying to create a simple sql query on S3 events using Spark. I am loading ~30GB of JSON files as following: val d2 =…
eexxoo
  • 280
  • 1
  • 4
  • 10
23
votes
3 answers

How to handle changing parquet schema in Apache Spark

I have run into a problem where I have Parquet data as daily chunks in S3 (in the form of s3://bucketName/prefix/YYYY/MM/DD/) but I cannot read the data in AWS EMR Spark from different dates because some column types do not match and I get one of…
V. Samma
  • 2,308
  • 4
  • 25
  • 33
22
votes
3 answers

Exporting Hive Table to a S3 bucket

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: CREATE TABLE csvimport(id BIGINT, time STRING, log STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH…
seedhead
  • 3,444
  • 4
  • 27
  • 38
21
votes
2 answers

Why does Yarn on EMR not allocate all nodes to running Spark jobs?

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0. When I start the job, YARN correctly has allocated all the worker nodes to the spark job…
retnuH
  • 1,365
  • 1
  • 10
  • 18
21
votes
4 answers

Spark resources not fully allocated on Amazon EMR

I'm trying to maximize cluster usage for a simple task. Cluster is 1+2 x m3.xlarge, runnning Spark 1.3.1, Hadoop 2.4, Amazon AMI 3.7 The task reads all lines of a text file and parse them as csv. When I spark-submit a task as a yarn-cluster mode, I…
Michel Lemay
  • 1,944
  • 1
  • 13
  • 34
21
votes
7 answers

Pyspark --py-files doesn't work

I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html spsark version 1.1.0 ./spark/bin/spark-submit --py-files /home/hadoop/loganalysis/parser-src.zip \ /home/hadoop/loganalysis/ship-test.py and conf in…
C19
  • 678
  • 2
  • 7
  • 15
20
votes
4 answers

AWS EMR Spark Python Logging

I'm running a very simple Spark job on AWS EMR and can't seem to get any log output from my script. I've tried with printing to stderr: from pyspark import SparkContext import sys if __name__ == '__main__': sc =…
jarbaugh
  • 423
  • 4
  • 10
20
votes
5 answers

Spark on yarn mode end with "Exit status: -100. Diagnostics: Container released on a *lost* node"

I am trying to load a database with 1TB data to spark on AWS using the latest EMR. And the running time is so long that it doesn't finished in even 6 hours, but after running 6h30m , I get some error announcing that Container released on a lost node…
John Zeng
  • 914
  • 2
  • 7
  • 20
19
votes
2 answers

How do you delete an AWS EMR Cluster?

I've been playing around with AWS EMR and I now have a few clusters that are terminated and that I want to delete: However, there is no obvious option to delete them. How do I make them go away?
vy32
  • 24,271
  • 30
  • 100
  • 197
1
2 3
80 81