14

I am trying to run my spark job on Amazon EKS cluster. My spark job required some static data (reference data) at each data nodes/worker/executor and this reference data is available at S3.

Can somebody kindly help me to find out a clean and performant solution to mount S3 bucket on pods ?

S3 API is an option and I am using it for my input records and output results. But "Reference data" is static data so I dont want to download it in each run/execution of my spark job. In first run job will download the data and upcoming jobs will check if data is already available locally and there is no need to download it again.

Ajeet
  • 535
  • 1
  • 5
  • 15
  • To accomplish this - " I am using org.apache.hadoop.fs.FileUtil.copy in my spark job to copy/download data from S3 to a provided local download location. And this download location is 'K8S local mount volume' so all pods on a node will share a directory of host node." And we can pass the volume name with spark-submit. – Ajeet Jan 25 '19 at 07:29

2 Answers2

6

We recently opensourced a project that looks to automate this steps for you: https://github.com/IBM/dataset-lifecycle-framework

Basically you can create a dataset:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  name: example-dataset
spec:
  local:
    type: "COS"
    accessKeyID: "iQkv3FABR0eywcEeyJAQ"
    secretAccessKey: "MIK3FPER+YQgb2ug26osxP/c8htr/05TVNJYuwmy"
    endpoint: "http://192.168.39.245:31772"
    bucket: "my-bucket-d4078283-dc35-4f12-a1a3-6f32571b0d62"
    region: "" #it can be empty

And then you will get a pvc you can mount in your pods

Yiannis Gkoufas
  • 560
  • 5
  • 15
  • Sadly I found it too hard to find the CRD yaml in the repo, or helm chart, and gave up... – GDev Sep 06 '20 at 23:49
  • Apologies @Gdev it's still under heavy development, can you please have a look in the installation wiki https://github.com/IBM/dataset-lifecycle-framework/wiki/Installation ? Here are also some example yamls, https://github.com/IBM/dataset-lifecycle-framework/tree/master/examples/templates – Yiannis Gkoufas Sep 08 '20 at 06:04
4

in general, you just don't do that. You should instead interact directly with S3 API to retrieve/store what you need (probably via some tools like aws cli).

As you run in AWS, you can have IAM configured in a way that your nodes can access particular data authorized on "infrastructure" level, or you can provide S3 access tokens via secrets/confogmaps/env etc.

S3 is not a filesystem, so don't expect it to behave like one (even if there are FUSE clients that emulate FS for your needs, this is rarely the right solution)

Radek 'Goblin' Pieczonka
  • 16,290
  • 3
  • 40
  • 39
  • 1
    Thank you @Radek for prompt response. Yes I agree S3 API is an option and I am using it for my input records and output results. But "Reference data" is static data so I dont want to download it in each run/execution of my spark job. In first run job will download the data and upcoming jobs will check if data is already available locally and there is no need to download it again. – Ajeet Aug 05 '18 at 13:56