31

I'm writing to see if anyone knows how to speed up S3 write times from Spark running in EMR?

My Spark Job takes over 4 hours to complete, however the cluster is only under load during the first 1.5 hours.

enter image description here

I was curious into what Spark was doing all this time. I looked at the logs and I found many s3 mv commands, one for each file. Then taking a look directly at S3 I see all my files are in a _temporary directory.

Secondary, I'm concerned with my cluster cost, it appears I need to buy 2 hours of compute for this specific task. However, I end up buying unto 5 hours. I'm curious if EMR AutoScaling can help with cost in this situation.

Some articles discuss changing the file output committer algorithm but I've had little success with that.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

Writing to the local HDFS is quick. I'm curious if issuing a hadoop command to copy the data to S3 would be faster?

enter image description here

jspooner
  • 10,146
  • 9
  • 54
  • 81
  • 1
    I found two great articles on this issue https://hortonworks.github.io/hdp-aws/s3-spark/ https://hortonworks.github.io/hdp-aws/s3-performance/index.html – jspooner Mar 16 '17 at 18:38
  • finally, which implementation did you use? I am stuck with the same problem. – Programmer Jun 14 '17 at 19:36
  • 2
    @JaspinderVirdee write your data to the local HDFS directory then use `s3-dist-cp` to copy your data back to S3. Also if your EMR cluster is missing the `s3-dist-cp` command you have to Hadoop listed in your create-cluster command. example: `--applications Name=Hadoop Name=Spark Name=Ganglia Name=zeppelin` – jspooner Jun 14 '17 at 21:57
  • Do you mean that I use s3 just for backup? For example - Currently, I am using Spark Streaming and my data is partitioned on key - "city" saved on s3 directly. Every stream batch of 1 minute more data comes and added under in each city. How would I append this data on s3 using your strategy -> save on local then copy to s3. – Programmer Jun 15 '17 at 05:42
  • The last step of my stream is save to fs(hdfs or s3). How can Push the changes to s3 after it is successfully written on hdfs? and is it even possible to do with scala/python code? or is s3-dist-cp used from shell only? – Programmer Jun 15 '17 at 10:17
  • @jspooner: I have got the same problem and with s3-dist-cp I get a "503 slow down exception". – ljofre Jul 09 '17 at 04:52
  • @ljofre I have not experienced that yet. How many nodes in your cluster anyhow much data are you transferring? – jspooner Jul 10 '17 at 15:45
  • @jspooner my cluster have 21 nodes r3.8xlarge and 100 Gb of a parquet file – ljofre Jul 10 '17 at 20:46
  • @ljofre That cluster and data seems reasonable but have you tried less nodes? Also consider looking at the S3 prefix options https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3 – jspooner Jul 10 '17 at 23:10
  • Which compression type (if any) were you using? Can you show us your full write command? – Ted Aug 15 '17 at 19:00

7 Answers7

21

What you are seeing is a problem with outputcommitter and s3. the commit job applies fs.rename on the _temporary folder and since S3 does not support rename it means that a single request is now copying and deleting all the files from _temporary to its final destination..

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2") only works with hadoop version > 2.7. what it does is to copy each file from _temporary on commit task and not commit job so it is distributed and works pretty fast.

If you use older version of hadoop I would use Spark 1.6 and use:

sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class","org.apache.spark.sql.parquet.DirectParquetOutputCommitter")

*note that it does not work with specualtion turned on or writing in append mode

**also note that it is deprecated in Spark 2.0 (replaced by algorithm.version=2)

BTW in my team we actually write with Spark to HDFS and use DISTCP jobs (specifically s3-dist-cp) in production to copy the files to S3 but this is done for several other reasons (consistency, fault tolerance) so it is not necessary.. you can write to S3 pretty fast using what I suggested.

user
  • 4,909
  • 7
  • 43
  • 60
Tal Joffe
  • 3,844
  • 2
  • 21
  • 27
  • 2
    Faster but risky. Spark 2 not only removed the file, if you use any committer with the word "direct" in it, you'll get told off and speculation disabled. – stevel Mar 16 '17 at 13:53
  • risky yes but without it is so slow it is unusable... you can have a spark job that finishes in 5 minutes and writes for over an hour and also it is not scalable.. I don't know of anything better if you want to write to s3. – Tal Joffe Mar 16 '17 at 15:35
  • 1
    Ah it's not working for me because I'm using append mode! – jspooner Mar 16 '17 at 15:49
  • @SteveLoughran after reading your answer I get your point. I will probably try your suggestion myself to replace current solution. – Tal Joffe Mar 16 '17 at 16:26
  • s3-dist-cp is creating the temporary files on hdfs which are not getting deleted on sucess, so i am lossing space, Is there any parameter to tell s3-dist-cp drop the temporay files. – loneStar Oct 16 '17 at 15:34
  • @Achyuth yes. --deleteOnSuccess – Tal Joffe Oct 16 '17 at 18:32
  • So deleteOnSucess deletes the source location or tmp files created by s3-distcp – loneStar Oct 16 '17 at 18:43
  • @Achyuth sorry it deletes the sources. didn't pay attention to the question... the temp files are moved to the final destination when the copy succeeds regardless of the --deleteOnSuccess flag. you might still be left with empty folders but I don't think that is something to be concerned about. On some cases we had a python script that deletes all 0 sized folders daily to clean those folders – Tal Joffe Oct 16 '17 at 18:49
  • So best practices would be to write to HDFS, then copy to S3, correct? And do you then delete the data from HDFS so you aren't storing duplicate data on HDFS and S3? – wordsforthewise Feb 20 '20 at 20:25
  • @wordsforthewise I'm not sure if it is still a best practice. I moved to different domains but it should work ok. we had a step in the pipeline that runs after this process and cleans unnecessary temp files. Since this a known issue I would look for newer versions of Spark and see if they solved this problem already. Netflix also had a project to solve this that was replaced by a project called iceberg. I don't know it but it could offer some help: https://github.com/apache/incubator-iceberg – Tal Joffe Feb 24 '20 at 12:15
7

I had similar use case where I used spark to write to s3 and had performance issue. Primary reason was spark was creating lot of zero byte part files and replacing temp files to actual file name was slowing down the write process. Tried below approach as work around

  1. Write output of spark to HDFS and used Hive to write to s3. Performance was much better as hive was creating less number of part files. Problem I had is(also had same issue when using spark), delete action on Policy was not provided in prod env because of security reasons. S3 bucket was kms encrypted in my case.

  2. Write spark output to HDFS and Copied hdfs files to local and used aws s3 copy to push data to s3. Had second best results with this approach. Created ticket with Amazon and they suggested to go with this one.

  3. Use s3 dist cp to copy files from HDFS to S3. This was working with no issues, but not performant

Vikrame
  • 320
  • 1
  • 12
  • 1
    Amazon may have made recent improvements to s3-dist-cp. On our EMR cluster, it has been quite performant, copying ~200GB from HDFS to S3 in under 2 minutes. – Jason Evans Apr 26 '17 at 14:22
7

The direct committer was pulled from spark as it wasn't resilient to failures. I would strongly advice against using it.

There is work ongoing in Hadoop, s3guard, to add 0-rename committers, which will be O(1) and fault tolerant; keep an eye on HADOOP-13786.

Ignoring "the Magic committer" for now, the Netflix-based staging committer will ship first (hadoop 2.9? 3.0?)

  1. This writes the work to the local FS, in task commit
  2. issues uncommitted multipart put operations to write the data, but not materialize it.
  3. saves the information needed to commit the PUT to HDFS, using the original "algorithm 1" file output committer
  4. Implements a job commit which uses the file output commit of HDFS to decide which PUTs to complete, and which to cancel.

Result: task commit takes data/bandwith seconds, but job commit takes no longer than the time to do 1-4 GETs on the destination folder and a POST for every pending file, the latter being parallelized.

You can pick up the committer which this work is based on, from netflix, and probably use it in spark today. Do set the file commit algorithm = 1 (should be the default) or it wont actually write the data.

stevel
  • 9,897
  • 1
  • 31
  • 43
  • Steve, is there an Apache JIRA issue for including the Netflix S3 committer in Hadoop? I can't seem to find one. – Jonathan Kelly Mar 16 '17 at 20:01
  • It's HADOOP-13786. That integration isn't going to be anything you can get into your hands for a while, as it is being mixed with the S3a Phase II work. The original code should work today (netflix use it), and there's a Hadoop 2.8.0 RC out this week, which has all the read and write pipeline speedups – stevel Mar 17 '17 at 09:35
  • Though of course as LI says you are on the EMR team, you have your own s3 client. The netflix Staging committer should work there, the "Magic" one absolutely not, but that's taken second priority to the staging one https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3a_committer.md – stevel Mar 17 '17 at 09:42
  • Thank you, Steve! I just wanted to learn more about that committer, since I hadn't heard about it yet, and I wanted to be able to track the JIRA if one existed. – Jonathan Kelly Mar 20 '17 at 00:04
1

What do you see in spark output? If you see lots of rename operations, read this

Community
  • 1
  • 1
Niros
  • 624
  • 5
  • 17
1

We experienced the same on Azure using Spark on WASB. We finally decided to not use the distrbitued storage directly with spark. We did spark.write to a real hdfs:// destination and develop a specific tool that do : hadoop copyFromLocal hdfs:// wasb:// The HDFS is then our temporary buffer before archiving to WASB (or S3).

0

How large is the file(s) you are writing too? Having one core writing to a very large file is going to be much slower than splitting the file up and have multiple workers write out smaller files.

Ted
  • 463
  • 6
  • 8
0

I was the same issue, I found a solution to change the s3 protocol, originally i was using s3a:// for read and write the data, then I changed to only s3:// and it works perfect, actually my process appends data.