7

My problem is as follows:

  • I have a large dataframe called details containing 900K rows and the other one containing 80M rows named attributes.

  • Both have a column A on which I would like to do a left-outer join, the left dataframe being deatils.

  • There are only 75K unique entries in column A in the dataframe details. The dataframe attributes 80M unique entries in column A.

What is the best possible way to achieve the join operation?

What have I tried?

  • The simple join i.e. details.join(attributes, "A", how="left_outer") just times out (or gives out of memory).

  • Since there are only 75K unique entries in column A in details, we don't care about the rest in the dataframe in attributes. So, first I filter that using:

    uniqueA = details.select('A').distinct().collect()
    uniqueA = map(lambda x: x.A, uniqueA)
    attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
    

    I thought this would work out because the attributes table comes down from 80M rows to mere 75K rows. However, it still takes forever to complete the join (and it never completes).

  • Next, I thought that there are too many partitions and the data to be joined is not on the same partition. Though, I don't know how to bring all the data to the same partition, I figured repartitioning may help. So here it goes.

    details_repartitioned = details.repartition("A")
    attributes_repartitioned = attributes.repartition("A")
    
  • The above operation brings down the number of partitions in attributes from 70K to 200. The number of partitions in details are about 1100.

    details_attributes = details_repartitioned.join(broadcast(
    attributes_repartitioned), "A", how='left_outer')  # tried without broadcast too
    

After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals behind repartitioning. If someone could shed light on this, it would be great.

P.S. I have already seen this question but that does not answer this question.

Autonomous
  • 8,478
  • 1
  • 32
  • 71

1 Answers1

7

Details table has 900k items with 75k distinct entries in column A. I think the filter on the column A you have tried is a correct direction. However, the collect and followed by the map operation

attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA)) 

this is too expensive. An alternate approach would be

uniqueA = details.select('A').distinct().persist(StorageLevel.DISK_ONLY)
uniqueA.count // Breaking the DAG lineage
attrJoined = attributes.join(uniqueA, "inner")

Also, you probably need to set the shuffle partition correctly if you haven't done that yet.

One problem could happen in your dataset is that skew. It could happen among 75k unique values only a few joining with a large number of rows in the attribute table. In that case join could take much longer time and may not finish.

To resolve that you need to find the skewed values of column A and process them separately.

Avishek Bhattacharya
  • 5,052
  • 3
  • 27
  • 45
  • Could you elaborate what do you mean by and how to do that? – Autonomous Apr 20 '18 at 07:07
  • 1
    You can do that while doing a spark-submit. You can do --conf spark.sql.shuffle.partitions=5000 or you can set that from spark sql. spark.sql("set spark.sql.shuffle.partitions=5000") – Avishek Bhattacharya Apr 20 '18 at 07:09
  • 1
    Thank you. It works. Probably the real trick was to set partitions to 5000. Now, I want to see if the filtering to bring down 77M to 70K was really necessary. One more question is, to break the DAG lineage, is it enough to do `uniqueA.count` or do we have to call count such as `uniqueA.count()`? – Autonomous Apr 20 '18 at 08:04
  • In scala uniqueA.count or uniqueA.count() means the same.. not sure in python – Avishek Bhattacharya Apr 20 '18 at 08:24
  • Ok... then mostly it is ‘count()’. – Autonomous Apr 20 '18 at 08:36
  • Actually it looks to me that breaking the DAG lineage was the most important step. I have two questions. 1. I already read about what is DAG lineage but why is important to break it? 2. Is there any more efficient method than `count()` to break the DAG lineage? (because I don't really need to count the elements). – Autonomous Apr 22 '18 at 17:37
  • 2
    Yes breaking DAG is the most important. To break the DAG you need to do an Action on the Dataframe , like count, collect, save etc. Because all the operation in spark is lazy except the action. I found count easier than explicit save. Not sure if there are any alternate way to achieve that – Avishek Bhattacharya Apr 22 '18 at 17:47
  • I understand Spark operations are lazy. In other words, let's say I let my operations build i.e. increasing the DAG lineage (however, not long enough to have [stackoverflow](https://stackoverflow.com/questions/34461804/stackoverflow-due-to-long-rdd-lineage)). Now, my question is in this case, why would periodicly breaking the DAG lineage would be important? – Autonomous Apr 22 '18 at 17:52
  • 2
    Probably that is because if a stage at the bottom of a long DAG fails; spark needs to re-compute the whole DAG again. This generates subsequent failures and eventually failing the job. However if we break the DAG , any stage failure doesn't need complete re-computation. This could be a reason although I am not sure completely. – Avishek Bhattacharya Apr 22 '18 at 17:58
  • @Avishek Bhattacharya -- nice.. this trick also works. a vote up for ques and answer :-) – vikrant rana May 02 '19 at 15:05
  • @AvishekBhattacharya: Please help me I am facing the same problem, can you pls tell me how you have decided shuffle partition number i.e. 5000. I have around 10 millions records and while performing joins and all it throws error org.apache.spark.shuffle.MetadataFetchFailedException. I have given --conf spark.sql.shuffle.partitions=400 – adarsh2109 Dec 04 '19 at 07:50
  • One thing is to look for what is the total amount of shuffled data. Say it is 7000gb . This is in serialized from. You want to give each executors approximately 100mb as after de serialization the size increases in memory. So you make 700*1024/100, 8k shuffle partitions. You need to make some experiments to hit the sweet spot. But more or less that's the trick I used. – Avishek Bhattacharya Dec 04 '19 at 07:57