I am a bit confused about the number of partitions
and its size in spark
. Some tutorials refers number of partitions = number of blocks in HDFS(64 mb or 128mb)
and other refers number of partitions = number of cores
in cluster. So if my data is of 1 gig size and stored in HDFS
of 128 mb block size
and cluster
has supoose 10 cores, so what will be the number of partitions in this case? Is it 8 or 10?
Thanks in advance
Asked
Active
Viewed 54 times
0
A B
- 1,829
- 1
- 19
- 39
-
When you import from hdfs it is number of blocks in hdfs . – Indrajit Swain Feb 02 '18 at 06:50
-
refering https://techmagie.wordpress.com/2015/12/19/understanding-spark-partitioning/ which mentions for sc.textFile(), partition.size = sc.defaultParallelism or number of file blocks , whichever is greater. So for my scenario Is the answer is 10? – A B Feb 02 '18 at 09:28