0

I have some code that reads a parquet file and then displays it, like this:

c = spark.sparkContext
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
lines = sqlContext.read.parquet("hdfs:////home/records/")
lines.take(100)

This works fine, but I want to create a CSV file from the output, which is this:

[Row(trans_key=1130, job_id=2005972, rec=1, old_id=833715, amount=2, temp_value=0.55, loc_id=31642),
[Row(trans_key=1230, job_id=2005972, rec=4, old_id=832715, amount=22, temp_value=0.99, loc_id=31642),
[Row(trans_key=1930, job_id=2905972, rec=5, old_id=831715, amount=32, temp_value=0.33, loc_id=31642),
[Row(trans_key=1430, job_id=2705972, rec=6, old_id=833775, amount=20, temp_value=0.10, loc_id=31642),

I am looking to create a CSV file with column headers, comma separated data, and the data. Like this:

trans_key,job_id,rec,old_id,amount,temp_value,loc_id
1130,2005972,1,833715,2,0.55,31642
1230,2005972,4,832715,22,0.99,31642
1430,2705972,6,833775,20,0.10,31642

I am stuck on how to turn my results from the parquet file into a CSV file. Can you help me?

Alper t. Turker
  • 29,733
  • 7
  • 65
  • 101
Steve McAffer
  • 315
  • 1
  • 5
  • 13

1 Answers1

1

This should do

lines.repartition(1).write.format('com.databricks.spark.csv').save('path+my.csv',header = 'true')

pratiklodha
  • 984
  • 11
  • 20
  • Thanks, this helps alot. Also, after submitting this, I did find answers elsewhere on your site. I will do a better job of researching first. – Steve McAffer Feb 02 '18 at 16:48