Convert a numpy array to a DataFrame in pyspark to export as csv

Question

I have a numpy array in pyspark and I would like to convert this to a DataFrame so I can write it out as a csv to view it.

I read the data in initially from a DataFrame however I had to convert to an array in order to use numpy.random.normal(). Now I want to convert the data back so I can write it out as a csv to view it.

I have tried the following directly on the array

zarr.write.csv("/mylocation/inHDFS")

however I get the following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'write'

Any ideas?

score 2 · Answer 1 · answered Nov 29 '18 at 11:15

Numpy array and Spark Dataframe are totally different structures. The first one is local and doesn't have column names, the second is distributed (or distribute-ready in local mode) and has columns with strong typing.

I'd recommend to convert the numpy array to Pandas DF first as described here: Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?, and then convert it to spark one using:

df = spark.createDataFrame(pandas_df)
df.write.csv('/hdfs/path')

Thanks, the thing is however that I don't have pandas as I am using pyspark — Taylrl, Nov 29 '18 at 11:28

score 2 · Accepted Answer · answered Nov 29 '18 at 13:52

Firstly I needed to convert the numpy array to an rdd as follows;

zrdd = spark.sparkContext.parallelize([zarr])

Then convert this to a DataFrame using the following (where we also now define the column header);

df = zrdd.map(lambda x: x.tolist()).toDF(["SOR"])

This I could then write out as per normal like such;

df.write.csv("/hdfs/mylocation")

Convert a numpy array to a DataFrame in pyspark to export as csv

2 Answers2