Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)
5 Answers
You need to call getNumPartitions()
on the DataFrame's underlying RDD, e.g., df.rdd.getNumPartitions()
. In the case of Scala, this is a parameterless method: df.rdd.getNumPartitions
.
![](../../users/profiles/4601931.webp)
- 4,153
- 4
- 21
- 36
-
3minus the (), so not entirely correct - at least not with SCALA mode – thebluephantom Aug 07 '18 at 19:36
-
3Does this cause a *conversion* (_expensive_) from `DF` to `RDD` ? – StephenBoesch Jan 16 '19 at 20:58
-
4This IS expensive – StephenBoesch Jan 19 '19 at 06:34
-
@javadba Do you have an answer that doesn't appeal to the RDD API? – user4601931 Jun 02 '19 at 02:21
-
No I do not: and it is unfortunate that spark does not manage the metadata better along the lines of hive. Your answer is correct but also is my observation that this is costly. – StephenBoesch Jun 02 '19 at 03:07
-
And about Hive tables, make sense to convert to RDD to get number of partitions? Supposing same result in Hive `show patitions tatbleName`, but `rdd.getNumPartions` better performance (and more elegant). – Peter Krauss Nov 24 '19 at 16:14
dataframe.rdd.partitions.size
is another alternative apart from df.rdd.getNumPartitions()
or df.rdd.length
.
let me explain you this with full example...
val x = (1 to 10).toList
val numberDF = x.toDF(“number”)
numberDF.rdd.partitions.size // => 4
To prove that how many number of partitions we got with above... save that dataframe as csv
numberDF.write.csv(“/Users/Ram.Ghadiyaram/output/numbers”)
Here is how the data is separated on the different partitions.
Partition 00000: 1, 2
Partition 00001: 3, 4, 5
Partition 00002: 6, 7
Partition 00003: 8, 9, 10
Update :
@Hemanth asked a good question in the comment... basically why number of partitions are 4 in above case
Short answer : Depends on cases where you are executing. since local[4] I used, I got 4 partitions.
Long answer :
I was running above program in my local machine and used master as local[4] based on that it was taking as 4 partitions.
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[4]").getOrCreate()
If its spark-shell in master yarn I got the number of partitions as 2
example : spark-shell --master yarn
and typed same commands again
scala> val x = (1 to 10).toList
x: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> val numberDF = x.toDF("number")
numberDF: org.apache.spark.sql.DataFrame = [number: int]
scala> numberDF.rdd.partitions.size
res0: Int = 2
- here 2 is default parllelism of spark
- Based on hashpartitioner spark will decide how many number of partitions to distribute. if you are running in
--master local
and based on yourRuntime.getRuntime.availableProcessors()
i.e.local[Runtime.getRuntime.availableProcessors()]
it will try to allocate those number of partitions. if your available number of processors are 12 (i.e.local[Runtime.getRuntime.availableProcessors()])
and you have list of 1 to 10 then only 10 partitions will be created.
NOTE:
If you are on a 12-core laptop where I am executing spark program and by default the number of partitions/tasks is the number of all available cores i.e. 12. that means
local[*]
ors"local[${Runtime.getRuntime.availableProcessors()}]")
but in this case only 10 numbers are there so it will limit to 10
keeping all these pointers in mind I would suggest you to try on your own
![](../../users/profiles/647053.webp)
- 30,194
- 13
- 83
- 107
-
Thanks for the great answer. I am curious why a list of 10 numbers was divided into 4 partitions when converted to a DF. Can you kindly provide some explanation, please? – Hemanth Jul 30 '19 at 17:47
convert to RDD then get the partitions length
DF.rdd.partitions.length
![](../../users/profiles/199148.webp)
- 7,136
- 9
- 36
- 66
![](../../users/profiles/7800693.webp)
- 227
- 1
- 3
val df = Seq(
("A", 1), ("B", 2), ("A", 3), ("C", 1)
).toDF("k", "v")
df.rdd.getNumPartitions
![](../../users/profiles/7082221.webp)
- 3,246
- 16
- 30
-
Please read this [how-to-answer](http://stackoverflow.com/help/how-to-answer) for providing quality answer. – thewaywewere Apr 22 '17 at 17:35
One more Interesting way to get number of partitions is 'using mapPartitions' transformation. Sample Code -
val x = (1 to 10).toList
val numberDF = x.toDF()
numberDF.rdd.mapPartitions(x => Iterator[Int](1)).sum()
Spark experts are welcome to comment on its performance.
![](../../users/profiles/11420984.webp)
- 873
- 1
- 6
- 13