Spark distribute local file from master to nodes

Question

I used to run Spark locally and distributing file to nodes has never caused me problems, but now I am moving things to Amazon cluster service and things starts to break down. Basically, I am processing some IP using the Maxmind GeoLiteCity.dat, which I placed on the local file system on the master (file:///home/hadoop/GeoLiteCity.dat).

following a question from earlier, I used the sc.addFile:

sc.addFile("file:///home/hadoop/GeoLiteCity.dat")

and call on it using something like:

val ipLookups = IpLookups(geoFile = Some(SparkFiles.get("GeoLiteCity.dat")), memCache = false, lruCache = 20000)

This works when running locally on my computer, but seems to be failing on the cluster (I do not know the reason for the failure, but I would appreciate it if someone can tell me how to display the logs for the process, the logs which are generated from Amazon service do not contain any information on which step is failing).

Do I have to somehow load the GeoLiteCity.dat onto the HDFS? Are there other ways to distribute a local file from the master across to the nodes without HDFS?

EDIT: Just to specify the way I run, I wrote a json file which does multiple steps, the first step is to run a bash script which transfers the GeoLiteCity.dat from Amazon S3 to the master:

#!/bin/bash
cd /home/hadoop
aws s3 cp s3://test/GeoLiteCity.dat GeoLiteCity.dat

After checking that the file is in the directory, The json then execute the Spark Jar, but fails. The logs produced by Amazon web UI does not show where the code breaks.

Will loading the file into AWS S3 and using it from S3 works for you? http://stackoverflow.com/a/31580277/4057655 — sag, Aug 14 '15 at 09:16
I uploaded the file to S3 and used a bash script to load the .dat from S3 to the master, this is what I do: cd /home/hadoop aws s3 cp s3://test/GeoLiteCity.dat GeoLiteCity.dat — GameOfThrows, Aug 14 '15 at 09:24
Instead of copying file into master, can you just read it from S3 itself and use it? Like `sc.textFile()` — sag, Aug 14 '15 at 09:31
yeah, I was reading the post you suggested, the question is that if I use the sc.textFile(), would I have to place the ACCESS_KEY and SECRETE_KEY into the application? I mean, for safety measures, I would prefer not to. — GameOfThrows, Aug 14 '15 at 09:35
You can export the s3 key using `bash export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123 ./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a launch my-spark-cluster`, instead of specifying it in your application. Refer http://spark.apache.org/docs/latest/ec2-scripts.html and http://stackoverflow.com/a/24056830/4057655 — sag, Aug 14 '15 at 09:41
okay I see, I've already dumped my access keys to the env variable, so what should I change in the KEYS in sc.textFile() to? Do I have to do the .set on the spark context as suggest? — GameOfThrows, Aug 14 '15 at 09:49
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/86973/discussion-between-samuel-alexander-and-gameofthrows). — sag, Aug 14 '15 at 09:57

score 1 · Accepted Answer · edited May 23 '17 at 10:26

Instead of copying the file into master, load the file into s3 and read it from there

Refer http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter2/s3.html for reading files from S3.

You need to provide AWS Access Key ID and Secret Key. Either set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or set it programmatically like,

sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)

Then you can just read the file as text file. Like,

 sc.textFile(s3n://test/GeoLiteCity.dat)

Additional reference : How to read input from S3 in a Spark Streaming EC2 cluster application https://stackoverflow.com/a/30852341/4057655

Spark distribute local file from master to nodes

1 Answers1