Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

2717 questions
16
votes
2 answers

Spark on Amazon EMR: "Timeout waiting for connection from pool"

I'm running a Spark job on a small three server Amazon EMR 5 (Spark 2.0) cluster. My job runs for an hour or so, fails with the error below. I can manually restart and it works, processes more data, and eventually fails again. My Spark code is…
clay
  • 13,176
  • 19
  • 65
  • 150
15
votes
1 answer

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes…
Ilya Kisil
  • 1,693
  • 1
  • 9
  • 24
15
votes
3 answers

Can't get a SparkContext in new AWS EMR Cluster

i just set up an AWS EMR Cluster (EMR Version 5.18 with Spark 2.3.2). I ssh into the master maschine and run spark-shell or pyspark and get the following error: $ spark-shell log4j:ERROR setFile(null,true) call…
15
votes
4 answers

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1…
Yuva
  • 1,842
  • 3
  • 17
  • 42
15
votes
1 answer

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue. The process that I did was: 1 - Run Apache Spark…
15
votes
3 answers

Dealing with a large gzipped file in Spark

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each with a 100 GB EBS volume). I am aware that gzip is a…
user4601931
  • 4,153
  • 4
  • 21
  • 36
15
votes
4 answers

How do you automate pyspark jobs on emr using boto3 (or otherwise)?

I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database. My job flow is as follows: Grab the log data from S3 Either use spark dataframes or spark sql to parse the data and write back out to…
flybonzai
  • 3,165
  • 7
  • 28
  • 62
15
votes
5 answers

Spark UI on AWS EMR

I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no…
gallamine
  • 765
  • 2
  • 10
  • 25
14
votes
6 answers

Session isn't active Pyspark in an AWS EMR cluster

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code: ".. textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x) textRdd.collect().show() .." I got this error: An error was encountered: Invalid status code '400'…
anat
  • 505
  • 1
  • 6
  • 17
14
votes
1 answer

javax.servlet.ServletException: java.util.NoSuchElementException: None.get

I get the error in my program while running using spark-submit in AWS EMR. Well, I could say this doesn't block my program run entirely. It kicks off after getting stuck there for 10-15 minutes. Any help will be highly appreciated. This issue was…
Suman Sushovan
  • 209
  • 2
  • 5
14
votes
4 answers

Running EMR Spark With Multiple S3 Accounts

I have an EMR Spark Job that needs to read data from S3 on one account and write to another. I split my job into two steps. read data from the S3 (no credentials required because my EMR cluster is in the same account). read data in the local HDFS…
jspooner
  • 10,146
  • 9
  • 54
  • 81
14
votes
3 answers

Emrfs file sync with s3 not working

After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to write to parquet file format on s3 using sqlContext.write:…
sakurashinken
  • 2,851
  • 5
  • 23
  • 56
13
votes
5 answers

Folder won't delete on Amazon S3

I'm trying to delete a folder created as a result of a MapReduce job. Other files in the bucket delete just fine, but this folder won't delete. When I try to delete it from the console, the progress bar next to its status just stays at 0. Have…
bgcode
  • 24,347
  • 30
  • 92
  • 158
13
votes
3 answers

Structured streaming won't write DF to file sink citing /_spark_metadata/9.compact doesn't exist

I'm building a Kafka ingest module in EMR 5.11.1, Spark 2.2.1. My intention is to use Structured Streaming to consume from a Kafka topic, do some processing, and store to EMRFS/S3 in parquet format. Console sink works as expected, file sink does not…
13
votes
3 answers

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ …
ebertbm
  • 2,817
  • 8
  • 27
  • 43