Spark Python: Converting multiple lines from inside a loop into a dataframe

Question

I have a loop that is going to create multiple rows of data which I want to convert into a dataframe.

Currently I am creating a CSV format string and inside the loop keep appending to it along separated by a newline. I am creating a CSV file so that I can also save it as a text file for other processing.

File Header:

output_str="Col1,Col2,Col3,Col4\n"

Inside for loop:

output_str += "Val1,Val2,Val3,Val4\n"

I then create an RDD by splitting it with the newline and then convert in into the dataframe as follows.

output_rdd = sc.parallelize(output_str.split("\n")) 
output_df = output_rdd.map(lambda x: (x, )).toDF()

It creates a dataframe but only has 1 column. I know that is because of the map function where I am making it into a list with only 1 item in the set. What I need is a list with multiple items. So perhaps I should be calling split() function on every line to get a list. But I am getting a feeling that there should be a much more straight-forward way. Appreciate any help. Thanks.

Edit: To give more information, using Spark SQL I have filtered my dataset to those rows that contain the problem. However the rows contain information in following format (separated by '|'). And I need to extract those values from column 3 which has corresponding flag set to 1 in column 4 (Here it is 0xcd)

Field1|Field2|0xab,0xcd,0xef|0x00,0x01,0x00

So I am collecting the output at the driver and then parsing the last 2 columns after which I am left with regular strings that I want to put back in a dataframe. I am not sure if I can achieve the same using Spark SQL to parse the output in the manner I want.

I achieved it using pandas pd.read.csv() function. (Courtesy: [How to create a Pandas DataFrame from String](https://stackoverflow.com/questions/22604564/how-to-create-a-pandas-dataframe-from-string)) — Nikhil Utane, Jun 19 '17 at 05:18

score 0 · Answer 1 · answered Jun 16 '17 at 17:53

0

Yes, indeed your current approach seems a little too complicated... Creating large string in Spark Driver and then parallelizing it with Spark is not really performant.

First of all question from where you are getting your input data? In my opinion you should use one of existing Spark readers to read it. For example you can use:

In next step you can preprocess it using Spark DataFrame or RDD API depending on your use case.

answered Jun 16 '17 at 17:53

Piotr Kalański

579
1
3
7

Thanks for your response. I am analyzing a huge data file and the input is the output of that processing. Basically I have parsed thousands of lines and identified the problem set which is relatively small. This set I need to store in a file and also do further processing on. Should I then just save it to a file and read it back? However there should be some way to achieve the same without using a file. Thanks. – Nikhil Utane Jun 17 '17 at 04:33
Added more information to the question as to why I am doing it this way. – Nikhil Utane Jun 17 '17 at 04:43
No, you shouldn't save it in file. You can do all processing using Spark. I suggest using Spark SQL functions: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#module-pyspark.sql.functions. For example for split you can use: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.split – Piotr Kalański Jun 17 '17 at 10:46

score 0 · Answer 2 · answered Jan 04 '19 at 16:36

A bit late, but currently you're applying a map to create a tuple for each row containing the string as the first element. Instead of this, you probably want to split the string, which can easily be done inside the map step. Assuming all of your rows have the same number of elements you can replace:

output_df = output_rdd.map(lambda x: (x, )).toDF()

with

output_df = output_rdd.map(lambda x: x.split()).toDF()

Spark Python: Converting multiple lines from inside a loop into a dataframe

2 Answers2