10

Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).

Further explanation might be helpful since it might involve understanding on distributed file system.

rendybjunior
  • 522
  • 6
  • 19
  • This aws release guide https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html says `S3 select` does pushdown filtering only on csv, json and gzip files. It does not mention parquet files. I was looking around if snappy parquet is supported. – Dorjee May 17 '20 at 17:00

5 Answers5

14

I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .

  • I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3.
  • I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file.
  • I then used the Spark History Server to see how much data each job had as input.

Results:

  • For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.
  • For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down.

I will add more details about the tests and results when I have time.

user1355682
  • 171
  • 1
  • 4
5

Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down).

See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L313 for the pushdown logic.

Daniel Darabos
  • 25,678
  • 9
  • 94
  • 106
  • 8
    According to Emily Curtin at Spark Summit, it does depend on the "file system" (in this case object store), as S3 doesn't support random access. https://youtu.be/_0Wpwj_gvzg?t=1307 – andresp Sep 02 '17 at 14:08
  • Thanks! And another upvoted answer also says I'm wrong. I looked up the code again in Spark 2.2.0 and it still does not seem to depend on the file system. But it could somehow indirectly depend on it. – Daniel Darabos Sep 04 '17 at 12:36
  • 2
    But S3 does have random access: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html#ExampleGetRangeRequestHeaders And Hortonworks talks about filter pushdown on S3: https://hortonworks.github.io/hdp-aws/s3-spark/index.html#reading-orc-and-parquet-datasets – Daniel Darabos Sep 04 '17 at 12:36
  • Interesting. Maybe S3 support for random access was added after the talk? – andresp Sep 04 '17 at 13:23
  • I think it's been there for a while https://stackoverflow.com/questions/36436057/s3-how-to-do-a-partial-read-seek-without-downloading-the-complete-file – rendybjunior Mar 14 '18 at 04:25
2

Here's the keys I'd recommend for s3a work

spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false

spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000

spark.sql.hive.metastorePartitionPruning true

For committing the work. use the S3A "zero rename committer" (hadoop 3.1+) or the EMR equivalent. The original FileOutputCommitters are slow and unsafe

stevel
  • 9,897
  • 1
  • 31
  • 43
  • What's the best way to set these values? I use Spark in an AWS EMR Cluster and I have set these (summary-metadata and algorithm version) in my Scala spark script. And I have been using filterPushdown and mergeSchema only as options for Parquet read/write. But I want to somehow get rid of the _$folder$ files which are written to S3. – V. Samma Feb 10 '17 at 13:03
  • EMR? beyond my knowledge, I'm afraid. – stevel Jun 25 '20 at 16:49
  • EMR is Elastic MapReduce service on AWS. But I'm not even sure if this question is relevant. I haven't used it for years now and your reply wasn't the fastest either :D – V. Samma Jun 26 '20 at 17:06
  • Sorry, I meant "I don't know anything about EMR internals and especially its closed source connector to s3". And yeah, not that timely. That was from spark.defaults conf file BTW – stevel Jun 28 '20 at 18:23
1

Recently I tried this with Spark 2.4 and seems like Pushdown predicate works with s3.

This is the spark sql query:

explain select * from default.my_table where month = '2009-04' and site = 'http://jdnews.com/sports/game_1997_jdnsports__article.html/play_rain.html' limit 100;

And here is the part of output:

PartitionFilters: [isnotnull(month#6), (month#6 = 2009-04)], PushedFilters: [IsNotNull(site), EqualTo(site,http://jdnews.com/sports/game_1997_jdnsports__article.html/play_ra...

Which clearly stats that PushedFilters is not empty.

Note: The used table was created on top of AWS S3

Ashish
  • 5,535
  • 2
  • 22
  • 25
0

Spark uses the HDFS parquet & s3 libraries so the same logic works. (and in spark 1.6 they've added even a faster shortcut for flat schema parquet files)

Arnon Rotem-Gal-Oz
  • 23,410
  • 2
  • 43
  • 66