1

Any examples on how to transform rdd to dataframe and transform dataframe back to rdd in pyspark 1.6.1? toDF() can not be used in 1.6.1?

For example, I have a rdd like this:

data = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
                       ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])
thewaywewere
  • 6,704
  • 11
  • 37
  • 42
yanachen
  • 2,373
  • 5
  • 20
  • 46

2 Answers2

1

If for some reason you can't use .toDF() method cannot, the solution I propose is this:

data = sqlContext.createDataFrame(sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
                   ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)]))

This will create a DF with names "_n" where n is the number of the column. If you want to rename the columns I suggest that you look this post: How to change dataframe column names in pyspark?. But all you need to do is:

data_named = data.selectExpr("_1 as One", "_2 as Two", "_3 as Three", "_4 as Four", "_5 as Five")

Now let's see the DF:

data_named.show()

And this will output:

+---+---+-----+----+----+
|One|Two|Three|Four|Five|
+---+---+-----+----+----+
|  a|  b|    c|   1|   4|
|  o|  u|    w|   9|   3|
|  s|  q|    a|   8|   6|
|  l|  g|    z|   8|   3|
|  a|  b|    c|   9|   8|
|  s|  q|    a|  10|  10|
|  l|  g|    z|  20|  20|
|  o|  u|    w|  77|  77|
+---+---+-----+----+----+

EDIT: Try again, because you should be able to use .toDF() in spark 1.6.1

Favio Vázquez
  • 160
  • 1
  • 10
0

I do not see a reason why rdd.toDF cannot be used in pyspark for spark 1.6.1. Please check spark 1.6.1 python docs for example on toDF(): https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext

As per your requirement,

rdd = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])

#rdd to dataframe
df = rdd.toDF() 
## can provide column names like df2 = df.toDF('col1', 'col2','col3,'col4') 

#dataframe to rdd
rdd2 = df.rdd
joshi.n
  • 449
  • 3
  • 7