Running Java Spark program on AWS EMR

Question

I'm having problem running Java written spark application on AWS EMR. Locally, everything runs fine. When I submit a job to EMR, I always get "Completed" withing 20 seconds even though job should take minutes. There is no output being produced, no log messages are being printed.

I'm still confused as weather it should be ran as Spark application or CUSTOM_JAR type.

Look of my main method:

public static void main(String[] args) throws Exception {
    SparkSession spark = SparkSession
            .builder()
            .appName("RandomName")
            .getOrCreate();

    //process stuff
    String from_path = args[0];
    String to_path = args[1];
    Dataset<String> dataInput = spark.read().json(from_path).toJSON();
    JavaRDD<ResultingClass> map = dataInput.toJavaRDD().map(row -> convertData(row)); //provided function didn't include here

    Dataset<Row> dataFrame = spark.createDataFrame(map, ResultingClass.class);

    dataFrame
            .repartition(1)
            .write()
            .mode(SaveMode.Append)
            .partitionBy("year", "month", "day", "hour")
            .parquet(to_path);

    spark.stop();
}

I've tried these:

1)

aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=Spark,Name=MyApp,Args=[--deploy-mode,cluster,--master,yarn, \
--conf,spark.yarn.submit.waitAppCompletion=false, \
--class,com.my.class.with.main.Foo,s3://mybucket/script.jar, \
s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE --region us-west-2 --profile default

Completes in 15 sec without error, output result or logs I've added.

2)

aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=CUSTOM_JAR, \
Jar=s3://mybucket/script.jar, \
MainClass=com.my.class.with.main.Foo, \
Name=MyApp, \
Args=[--deploy-mode,cluster, \
--conf,spark.yarn.submit.waitAppCompletion=true, \
s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE \
--region us-west-2 --profile default

Reads parameters wrongly, sees --deploy-mode as first parameter and cluster as second instead of buckets

3)

aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=CUSTOM_JAR, \
Jar=s3://mybucket/script.jar, \
MainClass=com.my.class.with.main.Foo, \
Name=MyApp, \
Args=[s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE \
--region us-west-2 --profile default

I get this: Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession

When I include all dependencies (which I do not need to locally)

I get: Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration I do not want to hardcode the "yarn" into the app.

I find AWS documentation very confusing as to what is the proper way to run this.

Update:

Running command directly on the server does work. So the problem must be in the way I'm defining a cli command.

spark-submit --class com.my.class.with.main.Foo \
    s3://mybucket/script.jar \
    "s3://partitioned-input-data/*/*/*/*/*.txt" \
    "s3://output-bucket/table-name"

I'm not sure I understand why don't you use `--master yarn` in your `Args` parameters... I'm sorry this question is confusing ! :/ — eliasah, Nov 14 '17 at 10:26
@eliasah I didn’t put ‘—master’ because my arguments are being read as ‘main’ method arguments (see point 2) and not ‘spark-submit’ arguments. — Dusan Vasiljevic, Nov 14 '17 at 19:05
@eliasah It's not the S3 because I cannot event find the printing of the arguments before the S3 access happens. My questions is if I should use 'Type=Spark' or 'Type=CUSTOM_JAR' for an application like I have? For custom jar he wants all the libs, including spark ones. When I run this program locally on spark, it only need third party libs. That's what confuses me. — Dusan Vasiljevic, Nov 14 '17 at 19:10

score 1 · Answer 1 · answered Nov 14 '17 at 22:26

1

The 1) was working.

The step overview on the aws console said that the task was finished within 15 seconds, but in reality it was still running on the cluster. It took him an hour to do the work and I can see the result.

I do not know why the step is misreporting the result. I'm using emr-5.9.0 with Ganglia 3.7.2, Spark 2.2.0, Zeppelin 0.7.2.

answered Nov 14 '17 at 22:26

Dusan Vasiljevic

662
6
18

1

In 1) I can see `spark.yarn.submit.waitAppCompletion=false` - have you tried to switch it to `true`? Or just remove it? – Alexandre Dupriez Nov 14 '17 at 22:32
You are right! That's it. I took that from an example for submitting multiple jobs, but I didn't think it will affect the step execution monitoring... – Dusan Vasiljevic Nov 15 '17 at 00:43
Great! Actually the same property was `true` in example 2) so I was wondering if it was a typo - good that it solves it – Alexandre Dupriez Nov 15 '17 at 09:54

Running Java Spark program on AWS EMR

1 Answers1