-1

I have a numpy array in pyspark and I would like to convert this to a DataFrame so I can write it out as a csv to view it.

I read the data in initially from a DataFrame however I had to convert to an array in order to use numpy.random.normal(). Now I want to convert the data back so I can write it out as a csv to view it.

I have tried the following directly on the array

zarr.write.csv("/mylocation/inHDFS")

however I get the following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'write'

Any ideas?

Taylrl
  • 2,187
  • 2
  • 25
  • 37

2 Answers2

2

Numpy array and Spark Dataframe are totally different structures. The first one is local and doesn't have column names, the second is distributed (or distribute-ready in local mode) and has columns with strong typing.

I'd recommend to convert the numpy array to Pandas DF first as described here: Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?, and then convert it to spark one using:

df = spark.createDataFrame(pandas_df)
df.write.csv('/hdfs/path')
Mariusz
  • 10,916
  • 3
  • 46
  • 58
2

Firstly I needed to convert the numpy array to an rdd as follows;

zrdd = spark.sparkContext.parallelize([zarr])

Then convert this to a DataFrame using the following (where we also now define the column header);

df = zrdd.map(lambda x: x.tolist()).toDF(["SOR"])

This I could then write out as per normal like such;

df.write.csv("/hdfs/mylocation")
Taylrl
  • 2,187
  • 2
  • 25
  • 37