5

TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files.

How can I read such a large number of files efficiently?

jkff
  • 16,670
  • 3
  • 46
  • 79

1 Answers1

4

When you know in advance that the filepattern being read with TextIO or AvroIO is going to expand into a large number of files, you can use the recently added feature .withHintMatchesManyFiles(), which is currently implemented on TextIO and AvroIO.

For example:

PCollection<String> lines = p.apply(TextIO.read()
    .from("gs://some-bucket/many/files/*")
    .withHintMatchesManyFiles());

Using this hint causes the transforms to execute in a way optimized for reading a large number of files: the number of files that can be read in this case is practically unlimited, and most likely the pipeline will run faster, cheaper and more reliably than without this hint.

However, it may perform worse than without the hint if the filepattern actually matches only a small number of files (for example, a few dozen or a few hundred files).

Under the hood, this hint causes the transforms to execute via respectively TextIO.readAll() or AvroIO.readAll(), which are more flexible and scalable versions of read() that allow reading a PCollection<String> of filepatterns (where each String is a filepattern), with the same caveat: if the total number of files matching the filepatterns is small, they may perform worse than a simple read() with the filepattern specified at pipeline construction time.

jkff
  • 16,670
  • 3
  • 46
  • 79