Write Pyspark dataframe to S3

Question

Hello I am new to pyspark and I have a dataframe that I formed using the following method:

spark = SparkSession.builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.option("header",True).csv("input.csv")

I now want to write this df to s3 but I have tried everything available online with no help.

I first tried to set this up

spark.sparkContext.hadoopConfiguration.set("fs.s3n.access.key", "my access key")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.secret.key", "my secret key")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.endpoint", "s3.amazonaws.com")

But for this I get the error:

AttributeError: 'SparkContext' object has no attribute 'hadoopConfiguration'

I also tried the following different methods to write:

df.write.option("header","true").csv("s3://mypath")
df.write.parquet("s3://mypath", mode="overwrite")
df.coalesce(1).write.format('csv').mode('overwrite').option("header", "false")\
.save("s3://mypath")

But for all these I get the same error:

: java.io.IOException: No FileSystem for scheme: s3

I am new to this and I really dont know what to do. Can anyone help me out?

Did you see [this question](https://stackoverflow.com/q/46740670/2129801)? — werner, Mar 22 '21 at 21:00
@werner I did but I get this error ```AnalysisException: Path does not exist: file:/home/ubuntu/Notebooks/s3/mypath ``` Why is it looking for path in ec2? — Kaushik Karalgikar, Mar 22 '21 at 21:08

itIsNaz · Answer 1 · 2021-03-22T21:29:26.897

0

Just change the configuration as :

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3n.access.key", "my access key")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3n.secret.key", "my secret key")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3n.endpoint", "s3.amazonaws.com")

edited Mar 22 '21 at 21:29

answered Mar 22 '21 at 21:08

itIsNaz

447
3
9

Different error now ```AttributeError: 'JavaMember' object has no attribute 'set'``` – Kaushik Karalgikar Mar 22 '21 at 21:11
which spark version you are using ? – itIsNaz Mar 22 '21 at 21:14
spark version is 3.1.1 – Kaushik Karalgikar Mar 22 '21 at 21:17
i will edit the configurations take a look jsut add it the hadoopConfiguration() – itIsNaz Mar 22 '21 at 21:29

Write Pyspark dataframe to S3

1 Answers1