How to transform rdd to dataframe in pyspark 1.6.1?

Question

Any examples on how to transform rdd to dataframe and transform dataframe back to rdd in pyspark 1.6.1? toDF() can not be used in 1.6.1?

For example, I have a rdd like this:

data = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
                       ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])

Favio Vázquez · Accepted Answer · 2017-10-10T21:26:59.867

If for some reason you can't use .toDF() method cannot, the solution I propose is this:

data = sqlContext.createDataFrame(sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
                   ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)]))

This will create a DF with names "_n" where n is the number of the column. If you want to rename the columns I suggest that you look this post: How to change dataframe column names in pyspark?. But all you need to do is:

data_named = data.selectExpr("_1 as One", "_2 as Two", "_3 as Three", "_4 as Four", "_5 as Five")

Now let's see the DF:

data_named.show()

And this will output:

+---+---+-----+----+----+
|One|Two|Three|Four|Five|
+---+---+-----+----+----+
|  a|  b|    c|   1|   4|
|  o|  u|    w|   9|   3|
|  s|  q|    a|   8|   6|
|  l|  g|    z|   8|   3|
|  a|  b|    c|   9|   8|
|  s|  q|    a|  10|  10|
|  l|  g|    z|  20|  20|
|  o|  u|    w|  77|  77|
+---+---+-----+----+----+

EDIT: Try again, because you should be able to use .toDF() in spark 1.6.1

score 0 · Answer 2 · answered Oct 10 '17 at 20:53

I do not see a reason why rdd.toDF cannot be used in pyspark for spark 1.6.1. Please check spark 1.6.1 python docs for example on toDF(): https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext

As per your requirement,

rdd = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])

#rdd to dataframe
df = rdd.toDF() 
## can provide column names like df2 = df.toDF('col1', 'col2','col3,'col4') 

#dataframe to rdd
rdd2 = df.rdd

How to transform rdd to dataframe in pyspark 1.6.1?

2 Answers2