How hashing algorithm works in Dataset.repartition

Question

I was doing a basic repartion on dataset. I have data like below in file test.csv

abc,1
def,2
ghi,3
jkl,4
mno,5

I am reading in dataframe like

val df= spark.read.csv("test.csv")
val repart=df.repartition(5,col("_c1"))
repart.write.csv("/home/partfiles/")

Now after writing the data it has created 5 part files which is correct. But in this process out of that only three part files are having proper data like below.

part00000 -empty
part00001 -jkl,4
part00002 -empty
part00003 -ghi,3
part00004 - abc,1
            def,2
            mno,5

But as i have repartitioned based on the 2nd column and all the datas are different ideally it should create 5 different part files.
As per the Dataset API document

Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.

Then i have googled few things and found this amazing article on partitioning (How does HashPartitioner work?).
As mentioned in this article DataSet uses Murmur3Hash algorithm. So i have written the small code to get the hash value based on this SO article (How can I use Scala's MurmurHash implementation: scala.util.MurmurHash3?).

class Murmur3{
  import scala.util.hashing.{ MurmurHash3 => MH3 }
  val values= (1 to 5).map(p=> p.toString)
  val result = values.map(n => (n,MH3.stringHash(n,MH3.stringSeed)))
  def resultVal(): Unit ={
    val dn= result.map( d=> d._1 -> (d._2,d._2 % 5)) //
    dn.foreach(println)
  }
}

Which gives me this value . Output is like (number,(hasvalue, hashvalue%5))

(1,(-1672130795,0))
(2,(382493853,3))
(3,(1416458177,2))
(4,(1968144336,1))
(5,(2100358791,1))

Now from this data it has to generate 4 part files. But how 3 part files got generated. Please let me know how hashpartitioning is working in case of Dataset.

score 1 · Accepted Answer · answered Feb 16 '18 at 18:21

The mistake you've made, is the assumption that hashing is done on a Scala string. In practice Spark hashes on unsafe byte array directly.

So the expression is equivalent to

import org.apache.spark.sql.functions.hash

Seq("1", "2", "3", "4", "5").toDF.select(
  when(hash($"value") % 5 > 0, hash($"value") % 5 )
    .otherwise(hash($"value") % 5 + 5)
).show
// +-----------------------------------------------------------------------------------------+
// |CASE WHEN ((hash(value) % 5) > 0) THEN (hash(value) % 5) ELSE ((hash(value) % 5) + 5) END|
// +-----------------------------------------------------------------------------------------+
// |                                                                                        4|
// |                                                                                        4|
// |                                                                                        3|
// |                                                                                        1|
// |                                                                                        4|
// +-----------------------------------------------------------------------------------------+

which gives the observed distribution.

How hashing algorithm works in Dataset.repartition

1 Answers1