I understand that partitionBy
function partitions my data. If I use rdd.partitionBy(100)
it will partition my data by key into 100 parts. i.e. data associated with similar keys will be grouped together
- Is my understanding correct?
- Is it advisable to have number of partitions equal to number of available cores? Does that make processing more efficient?
- what if my data is not in key,value format. Can i still use this function?
- lets say my data is serial_number_of_student,student_name. In this case can i partition my data by student_name instead of the serial_number?