-2

What happened when I do join operation on two jdbc table, doc says spark2.2 will start 2 stage to read table data, and 1 stage to do join, when shuffle between stage 1 and stage 2, spark use HashPartitioner to partition the data. But HOW will spark calculate the hash number?
The situation is I haven't set any column in any spark configuration, and it turns out spark have server data skew problem. enter image description here

here is my spark conf:
enter image description here

no123ff
  • 219
  • 3
  • 14
  • Possible duplicate of [How does HashPartitioner work?](https://stackoverflow.com/q/31424396/6910411). – zero323 Feb 11 '18 at 11:43

1 Answers1

1

The hash partitioner works off the join key. It will hash and then mod the join key by spark.sql.numPartitions. If youre running into dataskew issues, filter out the skewed keys into a seperate dataframe. Do a broadcast join (using a hint) on the skewed keys dataframe, and then do a regular join on the un-skewed keys.

Joe Widen
  • 2,015
  • 1
  • 11
  • 20