I was doing a basic repartion on dataset. I have data like below in file test.csv
abc,1
def,2
ghi,3
jkl,4
mno,5
I am reading in dataframe like
val df= spark.read.csv("test.csv")
val repart=df.repartition(5,col("_c1"))
repart.write.csv("/home/partfiles/")
Now after writing the data it has created 5 part files which is correct. But in this process out of that only three part files are having proper data like below.
part00000 -empty
part00001 -jkl,4
part00002 -empty
part00003 -ghi,3
part00004 - abc,1
def,2
mno,5
But as i have repartitioned based on the 2nd column and all the datas are different ideally it should create 5 different part files.
As per the Dataset API document
Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.
Then i have googled few things and found this amazing article on partitioning (How does HashPartitioner work?).
As mentioned in this article DataSet uses Murmur3Hash algorithm. So i have written the small code to get the hash value based on this SO article (How can I use Scala's MurmurHash implementation: scala.util.MurmurHash3?).
class Murmur3{
import scala.util.hashing.{ MurmurHash3 => MH3 }
val values= (1 to 5).map(p=> p.toString)
val result = values.map(n => (n,MH3.stringHash(n,MH3.stringSeed)))
def resultVal(): Unit ={
val dn= result.map( d=> d._1 -> (d._2,d._2 % 5)) //
dn.foreach(println)
}
}
Which gives me this value . Output is like (number,(hasvalue, hashvalue%5))
(1,(-1672130795,0))
(2,(382493853,3))
(3,(1416458177,2))
(4,(1968144336,1))
(5,(2100358791,1))
Now from this data it has to generate 4 part files. But how 3 part files got generated. Please let me know how hashpartitioning is working in case of Dataset.