How does Spark HashPartioner work with jdbc data source?

Question

What happened when I do join operation on two jdbc table, doc says spark2.2 will start 2 stage to read table data, and 1 stage to do join, when shuffle between stage 1 and stage 2, spark use HashPartitioner to partition the data. But HOW will spark calculate the hash number?
The situation is I haven't set any column in any spark configuration, and it turns out spark have server data skew problem.

here is my spark conf:

Possible duplicate of [How does HashPartitioner work?](https://stackoverflow.com/q/31424396/6910411). — zero323, Feb 11 '18 at 11:43

score 1 · Answer 1 · answered Feb 12 '18 at 03:12

The hash partitioner works off the join key. It will hash and then mod the join key by spark.sql.numPartitions. If youre running into dataskew issues, filter out the skewed keys into a seperate dataframe. Do a broadcast join (using a hint) on the skewed keys dataframe, and then do a regular join on the un-skewed keys.

How does Spark HashPartioner work with jdbc data source?

1 Answers1