While submit job with pyspark, how to access static files upload with --files argument?

Question

for example, i have a folder:

/
  - test.py
  - test.yml

and the job is submited to spark cluster with:

gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py"

in the test.py, I want to access the static file I uploaded.

with open('test.yml') as test_file:
    logging.info(test_file.read())

but got the following exception:

IOError: [Errno 2] No such file or directory: 'test.yml'

How to access the file I uploaded?

First thing that comes to me is to add the file to a distributed file system (like HDFS) which the cluster can access. I am sure others would provide a better solution. — Shagun Sodhani, Jan 22 '16 at 05:22

score 18 · Accepted Answer · edited Feb 28 '19 at 01:41

18

Files distributed using SparkContext.addFile (and --files) can be accessed via SparkFiles. It provides two methods:

getRootDirectory() - returns root directory for distributed files
get(filename) - returns absolute path to the file

I am not sure if there are any Dataproc specific limitations but something like this should work just fine:

from pyspark import SparkFiles

with open(SparkFiles.get('test.yml')) as test_file:
    logging.info(test_file.read())

edited Feb 28 '19 at 01:41

Shu

22,644
2
17
38

answered Jan 22 '16 at 07:40

zero323

283,404
79
858
880

it works, thanks !! notes: SparkFiles.get return file path instead file obj!! – lucemia Jan 22 '16 at 07:49

BabyPanda · Answer 2 · 2018-12-29T20:58:06.247

Currently, as Dataproc is not in beta anymore, in order to direct access a file in the Cloud Storage from the PySpark code, submitting the job with --files parameter will do the work. SparkFiles is not required. For example:

gcloud dataproc jobs submit pyspark \
  --cluster *cluster name* --region *region name* \
  --files gs://<BUCKET NAME>/<FILE NAME> gs://<BUCKET NAME>/filename.py

While reading input from gcs via Spark API, it works with gcs connector.

score 0 · Answer 3 · answered Jan 22 '16 at 06:14

Yep, Shagun is right.

Basically when you submit a spark job to spark, it does not serialize the file you want processed over to each worker. You will have to do it yourself.

Typically, you will have to put the file in a shared file system like HDFS, S3 (amazon), or any other DFS that can be accessed by all the workers. As soon as you do that, and specify the file destination in your spark script, the spark job will be able to read and process as you wish.

However, having said this, copying the file into the same destination in ALL of you workers and master's file structure also work. Exp, you can create folders like /opt/spark-job/all-files/ in ALL spark nodes, rsync the file to all of them, and then you can use file in your spark script. But please do not do this. DFS or S3 are way better than this approach.

application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. from http://spark.apache.org/docs/latest/submitting-applications.html — Winston Chen, Jan 22 '16 at 07:18

While submit job with pyspark, how to access static files upload with --files argument?

3 Answers3