DISCLAIMER: I don't have a definitive answer and don't want to act as an authoritative source either, but have spent some time on parquet support in Spark 2.2+ and am hoping that my answer can help us all to get closer to the right answer.
Does Parquet on S3 avoid pulling the data for unused columns from S3 and only retrieve the file chunks it needs, or does it pull the whole file?
I use Spark 2.3.0-SNAPSHOT that I built today right from the master.
parquet
data source format is handled by ParquetFileFormat which is a FileFormat.
If I'm correct, the reading part is handled by buildReaderWithPartitionValues method (that overrides the FileFormat
's).
buildReaderWithPartitionValues
is used exclusively when FileSourceScanExec
physical operator is requested for so-called input RDDs that are actually a single RDD to generate internal rows when WholeStageCodegenExec
is executed.
With that said, I think that reviewing what buildReaderWithPartitionValues
does may get us closer to the final answer.
When you look at the line you can get assured that we're on the right track.
// Try to push down filters when filter push-down is enabled.
That code path depends on spark.sql.parquet.filterPushdown
Spark property that is turned on by default.
spark.sql.parquet.filterPushdown Enables Parquet filter push-down optimization when set to true.
That leads us to parquet-hadoop's ParquetInputFormat.setFilterPredicate iff the filters are defined.
if (pushed.isDefined) {
ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get)
}
The code gets more interesting a bit later when the filters are used when the code falls back to parquet-mr (rather than using the so-called vectorized parquet decoding reader). That's the part I don't really understand (except what I can see in the code).
Please note that the vectorized parquet decoding reader is controlled by spark.sql.parquet.enableVectorizedReader
Spark property that is turned on by default.
TIP: To know what part of the if
expression is used, enable DEBUG
logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
logger.
In order to see all the pushed-down filters you could turn INFO
logging level of org.apache.spark.sql.execution.FileSourceScanExec
logger on. You should see the following in the logs:
INFO Pushed Filters: [pushedDownFilters]
I do hope that if it's not close to be a definitive answer it has helped a little and someone picks it up where I left off to make it one soon. Hope dies last :)