33

One of the great benefits of the Parquet data storage format is that it's columnar. If I've got a 'wide' dataset with hundreds of columns, but my query only touches a few of those, then it's possible read only the data that stores those few columns, and skip the rest.

Presumably this feature works by reading a bit of metadata at the head of a parquet file that indicates the locations on the filesystem for each column. The reader can then seek on disk to read in only the necessary columns.

Does anyone know whether spark's default parquet reader correctly implements this kind of selective seeking on S3? I think it's supported by S3, but there's a big difference between theoretical support and an implementation that properly exploits that support.

Jacek Laskowski
  • 64,943
  • 20
  • 207
  • 364
conradlee
  • 10,743
  • 15
  • 46
  • 81
  • 7
    I ask this because I've noticed that some of the features that spark/parquet advertise aren't properly implemented yet, such as the predicate push down that enables only certain partitions to be read. I found that surprising and started wondering how much of parquet/spark actually work as advertised. – conradlee Sep 26 '16 at 12:40

4 Answers4

18

This needs to be broken down

  1. Does the Parquet code get the predicates from spark (yes)
  2. Does parquet then attempt to selectively read only those columns, using the Hadoop FileSystem seek() + read() or readFully(position, buffer, length) calls? Yes
  3. Does the S3 connector translate these File Operations into efficient HTTP GET requests? In Amazon EMR: Yes. In Apache Hadoop, you need hadoop 2.8 on the classpath and set the properly spark.hadoop.fs.s3a.experimental.fadvise=random to trigger random access.

Hadoop 2.7 and earlier handle the aggressive seek() round the file badly, because they always initiate a GET offset-end-of-file, get surprised by the next seek, have to abort that connection, reopen a new TCP/HTTPS 1.1 connection (slow, CPU heavy), do it again, repeatedly. The random IO operation hurts on bulk loading of things like .csv.gz, but is critical to getting ORC/Parquet perf.

You don't get the speedup on Hadoop 2.7's hadoop-aws JAR. If you need it you need to update hadoop*.jar and dependencies, or build Spark up from scratch against Hadoop 2.8

Note that Hadoop 2.8+ also has a nice little feature: if you call toString() on an S3A filesystem client in a log statement, it prints out all the filesystem IO stats, including how much data was discarded in seeks, aborted TCP connections &c. Helps you work out what's going on.

2018-04-13 warning:: Do not try to drop the Hadoop 2.8+ hadoop-aws JAR on the classpath along with the rest of the hadoop-2.7 JAR set and expect to see any speedup. All you will see are stack traces. You need to update all the hadoop JARs and their transitive dependencies.

stevel
  • 9,897
  • 1
  • 31
  • 43
11

DISCLAIMER: I don't have a definitive answer and don't want to act as an authoritative source either, but have spent some time on parquet support in Spark 2.2+ and am hoping that my answer can help us all to get closer to the right answer.


Does Parquet on S3 avoid pulling the data for unused columns from S3 and only retrieve the file chunks it needs, or does it pull the whole file?

I use Spark 2.3.0-SNAPSHOT that I built today right from the master.

parquet data source format is handled by ParquetFileFormat which is a FileFormat.

If I'm correct, the reading part is handled by buildReaderWithPartitionValues method (that overrides the FileFormat's).

buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for so-called input RDDs that are actually a single RDD to generate internal rows when WholeStageCodegenExec is executed.

With that said, I think that reviewing what buildReaderWithPartitionValues does may get us closer to the final answer.

When you look at the line you can get assured that we're on the right track.

// Try to push down filters when filter push-down is enabled.

That code path depends on spark.sql.parquet.filterPushdown Spark property that is turned on by default.

spark.sql.parquet.filterPushdown Enables Parquet filter push-down optimization when set to true.

That leads us to parquet-hadoop's ParquetInputFormat.setFilterPredicate iff the filters are defined.

if (pushed.isDefined) {
  ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get)
}

The code gets more interesting a bit later when the filters are used when the code falls back to parquet-mr (rather than using the so-called vectorized parquet decoding reader). That's the part I don't really understand (except what I can see in the code).

Please note that the vectorized parquet decoding reader is controlled by spark.sql.parquet.enableVectorizedReader Spark property that is turned on by default.

TIP: To know what part of the if expression is used, enable DEBUG logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat logger.

In order to see all the pushed-down filters you could turn INFO logging level of org.apache.spark.sql.execution.FileSourceScanExec logger on. You should see the following in the logs:

INFO Pushed Filters: [pushedDownFilters]

I do hope that if it's not close to be a definitive answer it has helped a little and someone picks it up where I left off to make it one soon. Hope dies last :)

Jacek Laskowski
  • 64,943
  • 20
  • 207
  • 364
1

parquet reader of spark is just like any other InputFormat,

  1. None of the inputFormat have any thing special for S3. The input formats can read from LocalFileSystem , Hdfs and S3 no special optimization done for that.

  2. Parquet InpuTFormat depending on the columns you ask will selectively read the columns for you .

  3. If you want to be dead sure (although push down predicates works in latest spark version) manually select the columns and write the transformation and actions , instead of depending on SQL

KrazyGautam
  • 2,601
  • 1
  • 17
  • 31
  • 3
    Thanks for the answer, but even after reading it, it's still unclear whether recent spark distributions truly support predicate pushdown. I'm looking for an answer that either dives down into the particular implementation of the input reader invoked when reading parquet from s3, or performs an empirical test. See http://stackoverflow.com/a/41609999/189336 -- there's a surprising result indicating filter pushdown is broken on s3. – conradlee Mar 21 '17 at 09:59
  • 1
    pay attention to spark versions. there were problems with predicate pushdown in earlier versions, but starting from 2 something(and 2.2 for sure) this was fixed – Igor Berman Nov 24 '17 at 12:43
1

No, predicate pushdown is not fully supported. This, of course, depends on:

  • Specific use case
  • Spark version
  • S3 connector type and version

In order to check your specific use case, you can enable DEBUG log level in Spark, and run your query. Then, you can see whether there are "seeks" during S3 (HTTP) requests as well as how many requests to were actually sent. Something like this:

17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET /test/part-00000-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1[\r][\n]" .... 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 0-7472093/7472094[\r][\n]" .... 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 7472094[\r][\n]"

Here's example of an issue report that was opened recently due to inability of Spark 2.1 to calculate COUNT(*) of all the rows in a dataset based on metadata stored in Parquet file: https://issues.apache.org/jira/browse/SPARK-21074

Michael Spector
  • 34,963
  • 4
  • 55
  • 85
  • Michael, it's not so much spark as the version of Hadoop JARs bundled with it; those in HDP and CDH do "lazy" seeks, and, if you enable random IO, highly efficient columnar data reads. Regarding SPARK-21074, that JIRA awaits your experience after upgrading; if you don't get an anslwer it'll probably get closed as a "fixed/cannot reproduce" – stevel Dec 07 '17 at 15:09