Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

2717 questions

votes

2 answers

Spark on Amazon EMR: "Timeout waiting for connection from pool"

I'm running a Spark job on a small three server Amazon EMR 5 (Spark 2.0) cluster. My job runs for an hour or so, fails with the error below. I can manually restart and it works, processes more data, and eventually fails again. My Spark code is…

apache-spark amazon-emr

asked Aug 27 '16 at 21:36

clay

13,176
19
65
150

votes

1 answer

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes…

concurrency limit amazon-emr amazon-athena aws-glue

asked Jul 22 '19 at 12:22

Ilya Kisil

1,693
1
9
24

votes

3 answers

Can't get a SparkContext in new AWS EMR Cluster

i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error: $ spark-shell log4j:ERROR setFile(null,true) call…

amazon-web-services apache-spark pyspark amazon-emr

asked Nov 04 '18 at 12:31

Daniel Dev

votes

4 answers

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1…

amazon-web-services amazon-emr aws-glue cost-management

asked Feb 07 '18 at 11:32

Yuva

1,842
3
17
42

votes

1 answer

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue. The process that I did was: 1 - Run Apache Spark…

amazon-web-services apache-spark amazon-s3 amazon-emr aws-glue

asked Aug 18 '17 at 04:55

Thiago Baldim

5,991
2
26
45

votes

3 answers

Dealing with a large gzipped file in Spark

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each with a 100 GB EBS volume). I am aware that gzip is a…

apache-spark gzip amazon-emr

asked Nov 08 '16 at 17:26

user4601931

4,153
4
21
36

votes

4 answers

How do you automate pyspark jobs on emr using boto3 (or otherwise)?

I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database. My job flow is as follows: Grab the log data from S3 Either use spark dataframes or spark sql to parse the data and write back out to…

python amazon-s3 apache-spark pyspark amazon-emr

asked Apr 19 '16 at 00:33

flybonzai

3,165
7
28
62

votes

5 answers

Spark UI on AWS EMR

I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no…

apache-spark amazon-emr

asked Jul 16 '15 at 16:46

gallamine

votes

6 answers

Session isn't active Pyspark in an AWS EMR cluster

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code: ".. textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x) textRdd.collect().show() .." I got this error: An error was encountered: Invalid status code '400'…

pyspark amazon-emr

asked Sep 23 '19 at 12:48

anat

votes

1 answer

javax.servlet.ServletException: java.util.NoSuchElementException: None.get

I get the error in my program while running using spark-submit in AWS EMR. Well, I could say this doesn't block my program run entirely. It kicks off after getting stuck there for 10-15 minutes. Any help will be highly appreciated. This issue was…

apache-spark amazon-emr

asked Jan 04 '18 at 13:59

Suman Sushovan

votes

4 answers

Running EMR Spark With Multiple S3 Accounts

I have an EMR Spark Job that needs to read data from S3 on one account and write to another. I split my job into two steps. read data from the S3 (no credentials required because my EMR cluster is in the same account). read data in the local HDFS…

apache-spark amazon-s3 amazon-emr

asked Nov 01 '16 at 16:40

jspooner

10,146
9
54
81

votes

3 answers

Emrfs file sync with s3 not working

After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to write to parquet file format on s3 using sqlContext.write:…

amazon-s3 pyspark amazon-emr

asked Oct 03 '16 at 01:03

sakurashinken

2,851
5
23
56

votes

5 answers

Folder won't delete on Amazon S3

I'm trying to delete a folder created as a result of a MapReduce job. Other files in the bucket delete just fine, but this folder won't delete. When I try to delete it from the console, the progress bar next to its status just stays at 0. Have…

amazon-s3 amazon-web-services amazon-emr

asked Mar 23 '12 at 05:05

bgcode

24,347
30
92
158

votes

3 answers

Structured streaming won't write DF to file sink citing /_spark_metadata/9.compact doesn't exist

I'm building a Kafka ingest module in EMR 5.11.1, Spark 2.2.1. My intention is to use Structured Streaming to consume from a Kafka topic, do some processing, and store to EMRFS/S3 in parquet format. Console sink works as expected, file sink does not…

apache-spark amazon-s3 amazon-emr spark-structured-streaming

asked Apr 09 '18 at 20:59

maverik

votes

3 answers

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ …

apache-spark pyspark emr amazon-emr pyspark-sql

asked Feb 07 '17 at 13:51

ebertbm

2,817
8
27
43

Prev 1 2

…

99 100 Next