Methods taken into consideration (Spark 2.2.1
):
DataFrame.repartition
(the two implementations that takepartitionExprs: Column*
parameters)DataFrameWriter.partitionBy
Note: This question doesn't ask the difference between these methods
From docs of partitionBy
:
If specified, the output is laid out on the file system similar to
Hive
's partitioning scheme. As an example, when we partition aDataset
by year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
From this, I infer that the order of column arguments will decide the directory layout; hence it is relevant.
From docs of repartition
:
Returns a new
Dataset
partitioned by the given partitioning expressions, usingspark.sql.shuffle.partitions
as number of partitions. The resultingDataset
is hash partitioned.
As per my current understanding, repartition
decides the degree of parallelism in handling the DataFrame
. With this definition, behaviour of repartition(numPartitions: Int)
is straightforward but the same can't be said about the other two implementations of repartition
that take partitionExprs: Column*
arguments.
All things said, my doubts are following:
- Like
partitionBy
method, is the order of column inputs relevant inrepartition
method too? - If the answer to above question is
- No: Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a
SQL
query withGROUP BY
on same columns? - Yes: Please explain the behaviour of
repartition(columnExprs: Column*)
method
- No: Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a
- What is the relevance of having both
numPartitions: Int
andpartitionExprs: Column*
arguments in the third implementation ofrepartition
?