Spark read file from S3 using sc.textFile ("s3n://...)

Question

Trying to read a file located in S3 using spark-shell:

scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12

scala> myRdd.count
java.io.IOException: No FileSystem for scheme: s3n
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    ... etc ...

The IOException: No FileSystem for scheme: s3n error occurred with:

Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
Using s3:// or s3n:// scheme

What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?

Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.

score 44 · Accepted Answer · edited Sep 19 '16 at 10:21

44

Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.

sc.textFile("s3n://bucketname/Filename") now raises another error:

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).

scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt")
lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21

scala> lyrics.count
res1: Long = 9

Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")
myRDD.count

edited Sep 19 '16 at 10:21

Prasad Khode

5,897
11
38
52

answered Jun 15 '15 at 18:24

Polymerase

5,067
6
35
54

1

Spark 1.6.0 with Hadoop 2.4 worked for me. Spark 1.6.0 with Hadoop 2.6 didn't. – Priyank Desai May 05 '16 at 22:06
1

@PriyankDesai For others with the same problem, see https://issues.apache.org/jira/browse/SPARK-7442 and the links in the comment section. – timss Sep 02 '16 at 07:59
See my answer below for the reason why it did not work with Hadoop 2.6 version. – Sergey Bahchissaraitsev Oct 27 '16 at 19:46
Adding following to my SparkContext solved my problem `code` sc.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")`code` – Shrikant Prabhu Sep 15 '17 at 21:35
Note you shouldn't check in the code with your secret key and access key to your code repository. Ideal way is to let your cluster environment assume your IAMRole which has access to S3. I removed access and secret keys code from my program but forgot to remove following piece of code when running on Amazon EMR sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem"), and program started to fail again with above error. – Shrikant Prabhu Oct 12 '17 at 17:57

Sergey Bahchissaraitsev · Answer 2 · 2016-08-07T20:27:34.077

Despite that this question has already an accepted answer, I think that the exact details of why this is happening are still missing. So I think there might be a place for one more answer.

If you add the required hadoop-aws dependency, your code should work.

Starting Hadoop 2.6.0, s3 FS connector has been moved to a separate library called hadoop-aws. There is also a Jira for that: Move s3-related FS connector code to hadoop-aws.

This means that any version of spark, that has been built against Hadoop 2.6.0 or newer will have to use another external dependency to be able to connect to the S3 File System.
Here is an sbt example that I have tried and is working as expected using Apache Spark 1.6.2 built against Hadoop 2.6.0:

libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"

In my case, I encountered some dependencies issues, so I resolved by adding exclusion:

libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0" exclude("tomcat", "jasper-compiler") excludeAll ExclusionRule(organization = "javax.servlet")

On other related note, I have yet to try it, but that it is recommended to use "s3a" and not "s3n" filesystem starting Hadoop 2.6.0.

The third generation, s3a: filesystem. Designed to be a switch in replacement for s3n:, this filesystem binding supports larger files and promises higher performance.

score 17 · Answer 3 · answered Nov 20 '15 at 21:25

17

You can add the --packages parameter with the appropriate jar: to your submission:

bin/spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 code.py

answered Nov 20 '15 at 21:25

Andrew K

1,411
1
16
25

looked promising, but I get failed download for `file:/home/jcomeau/.m2/repository/asm/asm/3.2/asm-3.2.jar` when I do this with: `spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.83,org.apache.hadoop:hadoop-aws:2.7.3 merge.py`. any ideas? – jcomeau_ictx Jan 23 '17 at 21:59

score 9 · Answer 4 · answered May 10 '18 at 15:14

I had to copy the jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.

Details:

Spark 2.3.0
Hadoop downloaded was 2.7.6
Two jar files copied were from (hadoop dir)/share/hadoop/tools/lib/
- aws-java-sdk-1.7.4.jar
- hadoop-aws-2.7.6.jar

score 8 · Answer 5 · edited Sep 17 '16 at 11:21

8

This is a sample spark code which can read the files present on s3

val hadoopConf = sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", s3Key)
hadoopConf.set("fs.s3.awsSecretAccessKey", s3Secret)
var jobInput = sparkContext.textFile("s3://" + s3_location)

edited Sep 17 '16 at 11:21

Prasad Khode

5,897
11
38
52

answered Jun 28 '15 at 16:48

kbt

139
4

score 7 · Answer 6 · answered Dec 20 '16 at 18:01

Ran into the same problem in Spark 2.0.2. Resolved it by feeding it the jars. Here's what I ran:

$ spark-shell --jars aws-java-sdk-1.7.4.jar,hadoop-aws-2.7.3.jar,jackson-annotations-2.7.0.jar,jackson-core-2.7.0.jar,jackson-databind-2.7.0.jar,joda-time-2.9.6.jar

scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3.awsAccessKeyId",awsAccessKeyId)
scala> hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> sqlContext.read.parquet("s3://your-s3-bucket/")

obviously, you need to have the jars in the path where you're running spark-shell from

I ran into this problem also with Spark 2.1.0 and added the latest aws requirements (spark.jars.packages org.apache.hadoop:hadoop-aws:2.7.3) to "spark-defaults.conf", did the trick. — Kieleth, Mar 22 '17 at 17:30

score 4 · Answer 7 · answered Oct 20 '16 at 14:48

There is a Spark JIRA, SPARK-7481, open as of today, oct 20, 2016, to add a spark-cloud module which includes transitive dependencies on everything s3a and azure wasb: need, along with tests.

And a Spark PR to match. This is how I get s3a support into my spark builds

If you do it by hand, you must get hadoop-aws JAR of the exact version the rest of your hadoop JARS have, and a version of the AWS JARs 100% in sync with what Hadoop aws was compiled against. For Hadoop 2.7.{1, 2, 3, ...}

hadoop-aws-2.7.x.jar 
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
+ jackson-*-2.6.5.jar

Stick all of these into SPARK_HOME/jars. Run spark with your credentials set up in Env vars or in spark-default.conf

the simplest test is can you do a line count of a CSV File

val landsatCSV = "s3a://landsat-pds/scene_list.gz"
val lines = sc.textFile(landsatCSV)
val lineCount = lines.count()

Get a number: all is well. Get a stack trace. Bad news.

yes. The spark-hadoop-cloud dependency pulls in what you need. It's not included in ASF releases though. https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud — stevel, Aug 06 '19 at 12:11

score 2 · Answer 8 · answered Aug 13 '15 at 09:10

For Spark 1.4.x "Pre built for Hadoop 2.6 and later":

I just copied needed S3, S3native packages from hadoop-aws-2.6.0.jar to spark-assembly-1.4.1-hadoop2.6.0.jar.

After that I restarted spark cluster and it works. Do not forget to check owner and mode of the assembly jar.

score 2 · Answer 9 · answered Aug 22 '17 at 15:39

2

I was facing the same issue. It worked fine after setting the value for fs.s3n.impl and adding hadoop-aws dependency.

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

answered Aug 22 '17 at 15:39

PDerp15

68
4

1

in python: AttributeError: 'SparkContext' object has no attribute 'hadoopConfiguration' – Uri Goren Jan 18 '18 at 08:45
1

@UriGoren In Python, `hadoopConfiguration` can be accessed through the java implementation: `sc._jsc.hadoopConfiguration` – vaer-k Dec 18 '18 at 23:50

score 1 · Answer 10 · answered Jun 15 '15 at 17:42

S3N is not a default file format. You need to build your version of Spark with a version of Hadoop that has the additional libraries used for AWS compatibility. Additional info I found here, https://www.hakkalabs.co/articles/making-your-local-hadoop-more-like-aws-elastic-mapreduce

score 1 · Answer 11 · answered Jul 18 '15 at 19:35

You probably have to use s3a:/ scheme instead of s3:/ or s3n:/ However, it is not working out of the box (for me) for the spark shell. I see the following stacktrace:

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
        at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
        at $iwC$$iwC$$iwC.<init>(<console>:37)
        at $iwC$$iwC.<init>(<console>:39)
        at $iwC.<init>(<console>:41)
        at <init>(<console>:43)
        at .<init>(<console>:47)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
        ... 68 more

What I think - you have to manually add the hadoop-aws dependency manually http://search.maven.org/#artifactdetails|org.apache.hadoop|hadoop-aws|2.7.1|jar But I have no idea how to add it to spark-shell properly.

You add the path of the jar to spark-shell with the `--jars` parameter, comma-separated. You'll also want to add the `aws-java-sdk-*-jar`. — Tristan Reid, Oct 29 '15 at 13:37

Abdul Mannan · Answer 12 · 2019-12-04T18:01:25.090

Download the hadoop-aws jar from maven repository matching your hadoop version.
Copy the jar to $SPARK_HOME/jars location.

Now in your Pyspark script, setup AWS Access Key & Secret Access Key.

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")

// where spark is SparkSession instance

For Spark scala:

spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")

score -1 · Answer 13 · answered Mar 29 '17 at 14:58

-1

USe s3a instead of s3n. I had similar issue on a Hadoop job. After switching from s3n to s3a it worked.

e.g.

s3a://myBucket/myFile1.log

answered Mar 29 '17 at 14:58

Gaj

21
1

Spark read file from S3 using sc.textFile ("s3n://...)

13 Answers13

Linked

Related