16

I'm trying to read a txt file from S3 with Spark, but I'm getting thhis error:

No FileSystem for scheme: s3

This is my code:

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("first")
sc = SparkContext(conf=conf)
data = sc.textFile("s3://"+AWS_ACCESS_KEY+":" + AWS_SECRET_KEY + "@/aaa/aaa/aaa.txt")

header = data.first()

This is the full traceback:

An error occurred while calling o25.partitions.
: java.io.IOException: No FileSystem for scheme: s3
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

How can I fix this?

Filipe Ferminiano
  • 6,874
  • 20
  • 89
  • 155
  • Did you try `s3n://`? Also, have a look at: https://stackoverflow.com/questions/42754276/naive-install-of-pyspark-to-also-support-s3-access – andrew_reece Oct 14 '17 at 07:26
  • 1
    I tried, but I still get the same error. At took a look at the question is the same thing I'm already using. Im using 2.2 pyspark – Filipe Ferminiano Oct 14 '17 at 14:40
  • 1
    In Scala for access to S3 additional "hadoop-aws" libraries are required. Maybe, you need them too: https://gist.github.com/eddies/f37d696567f15b33029277ee9084c4a0 – pasha701 Oct 14 '17 at 19:59
  • I would suggest using some POSIX-compatible filesystem like http://juicefs.io that support s3 as a backend. You just need to mount the filesystem and then use it like a local directory, your code looks the same either in a local environment or some instance on cloud. – satoru Nov 17 '18 at 00:20

2 Answers2

13

If you are using a local machine you can use boto3:

s3 = boto3.resource('s3')
# get a handle on the bucket that holds your file
bucket = s3.Bucket('yourBucket')
# get a handle on the object you want (i.e. your file)
obj = bucket.Object(key='yourFile.extension')
# get the object
response = obj.get()
# read the contents of the file and split it into a list of lines
lines = response[u'Body'].read().split('\n')

(do not forget to setup your AWS S3 credentials).

Another clean solution if you are using an AWS Virtual Machine (EC2) would be granting S3 permissions to your EC2 and launching pyspark with this command:

pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2

If you add other packages, make sure the format is: 'groupId:artifactId:version' and the packages are separated by commas.

If you are using pyspark from Jupyter Notebooks this will work:

import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
filePath = "s3a://yourBucket/yourFile.parquet"
df = sqlContext.read.parquet(filePath) # Parquet file read example
HagaiA
  • 43
  • 8
gilgorio
  • 407
  • 5
  • 8
  • 2
    so boto3 doesn't work in cluster spark for some reason or what's the problem? I think I can wrap split boto command into DF, right? Otherwise I have same issue with spark 2.4.2 at moment – kensai Apr 25 '19 at 10:56
  • 1
    Just note this doesn't necessarily work, the version you load within pyspark script must be same as version in jars folder of your pyspark installation. My pyspark has in jars hadoo* files of 2.7.3, so I use it for submit args as well and now I can go simply with S3 (no a/n required) – kensai Apr 25 '19 at 11:50
  • 1
    Sorry, but why do you have spaces? Isn't the correct command `pyspark --packages=com.amazonaws:aws-java-sdk:1.11.7755,org.apache.hadoop:hadoop-aws:3.2.1` ? Note that this still fails for me with s3 and s3a urls. – rjurney May 06 '20 at 00:41
  • It worked for me with the space between the arg name and arg value – gilgorio May 08 '20 at 18:35
  • 1
    Hi when I tried `pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 ` it gave me error `[NOT FOUND ] junit#junit;4.11!junit.jar`, and the `No FileSystem for scheme: s3` error is still there, I'm using pyspark 2.4.4, any chance you know why? – Cecilia Feb 05 '21 at 09:36
1

If you're using a jupyter notebook, you must two files to the class path for spark:

/home/ec2-user/anaconda3/envs/ENV-XXX/lib/python3.6/site-packages/pyspark/jars

the two files are :

  • hadoop-aws-2.10.1-amzn-0.jar
  • aws-java-sdk-1.11.890.jar
welkinwalker
  • 1,564
  • 3
  • 13
  • 20