3

I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.

I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,

property_sql_df.show() shows,

+--------------+------------+--------------------+--------------------+
|            id|country_code|       name|          hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
|  BOND-9129450|          US|Scotron Home w/Ga...|90cb0946cf4139e12...|
|  BOND-1742850|          US|Sited in the Mead...|d5c301f00e9966483...|
|  BOND-3211356|          US|NEW LISTING - Com...|811fa26e240d726ec...|
|  BOND-7630290|          US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
|  BOND-7175508|          US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+

What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.

Expected Output

{
  "90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
  "d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}

What I have tried so far,

%%time
duplicate_property_list = {}
for ind in property_sql_df.collect(): 
     hashed_value = ind.hash_of_cc_pn_li
     property_id = ind.id
     if hashed_value in duplicate_property_list:
         duplicate_property_list[hashed_value].append(property_id) 
     else:
         duplicate_property_list[hashed_value] = [property_id] 

What I get now on the console:

java.lang.OutOfMemoryError: Java heap space

and showing this error on Jupyter notebook output

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)
Always Sunny
  • 29,081
  • 6
  • 43
  • 74
  • Not sure if it ultimately help, but as a first step you might try to distribute list collection as `df.groupby("hash_of_cc").agg(collect_list("id")).show()`. And then take another hard look and decide whether you really need to collect your full dataframe to the driver. – mazaneicha Jul 26 '20 at 18:00

4 Answers4

2

making a dictionary from pyspark dataframe should not be very costly

This is true in terms of runtime, but this will easily take up a lot of space. Especially if you're doing property_sql_df.collect(), at which point you're loading your entire dataframe into driver memory. At 6.5M rows, you'll already hit 65GB if each row has 10KB, or 10K characters, and we haven't even gotten to the dictionary yet.

First, you can collect just the columns you need (e.g. not name). Second, you can do the aggregation upstream in Spark, which will save some space depending on how many ids there are per hash_of_cc_pn_li:

rows = property_sql_df.groupBy("hash_of_cc_pn_li") \
  .agg(collect_set("id").alias("ids")) \
  .collect()

duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }
Eric Doi
  • 56
  • 3
  • Thanks for the elaborated answer sir. – Always Sunny Jul 26 '20 at 18:13
  • Just a quick question the same error happened when I try to convert my pyspark dataframe to pandas using df.toPandas(), what should I do in that case? – Always Sunny Jul 26 '20 at 18:17
  • 1
    `df.toPandas()` is even worse than `df.collect()` in terms of how much space it's going to take up. What to do instead depends on what you're trying to do after this. If, say, the point is just to have a csv file with these mappings, then you can just go directly from the dataframe (after using `collect_list`/`collect_set`) and save it to a file via `df.coalesce(1).write.option('sep',';').csv(path)`. – Eric Doi Jul 26 '20 at 18:27
  • No luck same issue occuring – Always Sunny Jul 27 '20 at 05:39
  • followup question: https://stackoverflow.com/questions/63109775/creating-dictionary-in-loop-a-small-pyspark-dataframe-showing-outofmemoryerror – Always Sunny Jul 27 '20 at 06:29
1

Here's how to make a sample DataFrame with your data:

data = [
    ("BOND-9129450", "90cb"),
    ("BOND-1742850", "d5c3"),
    ("BOND-3211356", "811f"),
    ("BOND-7630290", "d5c3"),
    ("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])

Let's aggregate the data in a Spark DataFrame to limit the number of rows that are collected on the driver node. We'll use the two_columns_to_dictionary function defined in quinn to create the dictionary.

agg_df = df.groupBy("hash_of_cc_pn_li").agg(F.max("hash_of_cc_pn_li").alias("hash"), F.collect_list("id").alias("id"))
res = quinn.two_columns_to_dictionary(agg_df, "hash", "id")
print(res) # => {'811f': ['BOND-3211356'], 'd5c3': ['BOND-1742850', 'BOND-7630290'], '90cb': ['BOND-9129450', 'BOND-7175508']}

This might work on a relatively small, 6.5 million row dataset, but it won't work on a huge dataset. "I think making a dictionary from pyspark dataframe should not be very costly" is only true for DataFrames that are really tiny. Making a dictionary from a PySpark DataFrame is actually very expensive.

PySpark is a cluster computing framework that benefits from having data spread out across nodes in a cluster. When you call collect all the data is moved to the driver node and the worker nodes don't help. You'll get an OutOfMemory exception whenever you try to move too much data to the driver node.

It's probably best to avoid the dictionary entirely and figure out a different way to solve the problem. Great question.

Powers
  • 12,561
  • 7
  • 60
  • 82
  • Thanks for the elaborated answer sir. – Always Sunny Jul 26 '20 at 18:13
  • Just a quick question the same error happened when I try to convert my pyspark dataframe to pandas using df.toPandas(), what should I do in that case? – Always Sunny Jul 26 '20 at 18:17
  • 1
    @AlwaysSunny - Yep, toPandas also collects all the data on the driver node and should be avoided unless the dataset is tiny. – Powers Jul 26 '20 at 18:41
  • No luck same issue occuring – Always Sunny Jul 27 '20 at 05:39
  • @AlwaysSunny - Yep, not surprising, that's a common error whenever you try to collect a large dataset. Feel free to ask another question that explains what you're trying to accomplish with the dictionary and I can see if there is another way to solve your problem with another approach. – Powers Jul 27 '20 at 05:43
  • Ok sir got your point, But with simple 5000 records it is casuing the same error. – Always Sunny Jul 27 '20 at 06:07
  • followup question: https://stackoverflow.com/questions/63109775/creating-dictionary-in-loop-a-small-pyspark-dataframe-showing-outofmemoryerror – Always Sunny Jul 27 '20 at 06:29
1

From Spark-2.4 we can use groupBy,collect_list,map_from_arrays,to_json built in functions for this case.

Example:

df.show()
#+------------+-----------------+
#|          id| hash_of_cc_pn_li|
#+------------+-----------------+
#|BOND-9129450|90cb0946cf4139e12|
#|BOND-7175508|90cb0946cf4139e12|
#|BOND-1742850|d5c301f00e9966483|
#|BOND-7630290|d5c301f00e9966483|
#+------------+-----------------+
df.groupBy(col("hash_of_cc_pn_li")).\
agg(collect_list(col("id")).alias("id")).\
selectExpr("to_json(map_from_arrays(array(hash_of_cc_pn_li),array(id))) as output").\
show(10,False)
#+-----------------------------------------------------+
#|output                                               |
#+-----------------------------------------------------+
#|{"90cb0946cf4139e12":["BOND-9129450","BOND-7175508"]}|
#|{"d5c301f00e9966483":["BOND-1742850","BOND-7630290"]}|
#+-----------------------------------------------------+

To get one dict use another agg with collect_list.

df.groupBy(col("hash_of_cc_pn_li")).\
agg(collect_list(col("id")).alias("id")).\
agg(to_json(map_from_arrays(collect_list(col("hash_of_cc_pn_li")),collect_list(col("id")))).alias("output")).\
show(10,False)
#+---------------------------------------------------------------------------------------------------------+
#|output                                                                                                   |
#+---------------------------------------------------------------------------------------------------------+
#|{"90cb0946cf4139e12":["BOND-9129450","BOND-7175508"],"d5c301f00e9966483":["BOND-1742850","BOND-7630290"]}|
#+---------------------------------------------------------------------------------------------------------+
Shu
  • 22,644
  • 2
  • 17
  • 38
  • Thanks for the elaborated answer sir. – Always Sunny Jul 26 '20 at 18:13
  • I am using spark 3.0 so I can use collect_list – Always Sunny Jul 26 '20 at 18:15
  • Just a quick question the same error happened when I try to convert my pyspark dataframe to pandas using df.toPandas(), what should I do in that case? – Always Sunny Jul 26 '20 at 18:17
  • 1
    @AlwaysSunny, Try to avoid converting to pandas dataframe and if possible use spark in built functions to get the output expected. pandas collects the data and causes driver failure. https://stackoverflow.com/questions/39811054/pyspark-simple-re-partition-and-topandas-fails-to-finish-on-just-600-000-rows – Shu Jul 26 '20 at 18:27
  • No luck same issue occuring – Always Sunny Jul 27 '20 at 05:39
  • followup question: https://stackoverflow.com/questions/63109775/creating-dictionary-in-loop-a-small-pyspark-dataframe-showing-outofmemoryerror – Always Sunny Jul 27 '20 at 06:29
1

Adding accepted answer from linked post for posterity. The answer solves the problem by leveraging write.json method and preventing the collection of too-large dataset to the Driver here:

https://stackoverflow.com/a/63111765/12378881

napoleon_borntoparty
  • 1,372
  • 1
  • 3
  • 16