How can I improve performance of TextIO or AvroIO when reading a very large number of files?

Question

TextIO.read() and AvroIO.read() (as well as some other Beam IO's) by default don't perform very well in current Apache Beam runners when reading a filepattern that expands into a very large number of files - for example, 1M files.

How can I read such a large number of files efficiently?

score 4 · Accepted Answer · answered Jul 27 '17 at 22:36

When you know in advance that the filepattern being read with TextIO or AvroIO is going to expand into a large number of files, you can use the recently added feature .withHintMatchesManyFiles(), which is currently implemented on TextIO and AvroIO.

For example:

PCollection<String> lines = p.apply(TextIO.read()
    .from("gs://some-bucket/many/files/*")
    .withHintMatchesManyFiles());

Using this hint causes the transforms to execute in a way optimized for reading a large number of files: the number of files that can be read in this case is practically unlimited, and most likely the pipeline will run faster, cheaper and more reliably than without this hint.

However, it may perform worse than without the hint if the filepattern actually matches only a small number of files (for example, a few dozen or a few hundred files).

Under the hood, this hint causes the transforms to execute via respectively TextIO.readAll() or AvroIO.readAll(), which are more flexible and scalable versions of read() that allow reading a PCollection<String> of filepatterns (where each String is a filepattern), with the same caveat: if the total number of files matching the filepatterns is small, they may perform worse than a simple read() with the filepattern specified at pipeline construction time.

Anything equivalent for the Python API? I couldn't find anything. It's very slow reading many small files from GCS. — ehrencrona, Mar 21 '18 at 00:59
@jkff Can I use regex as a file pattern? It doesn't seem to work. — Kakaji, Apr 19 '18 at 05:32
When I try to read large number of files from GCS, with `withHintMatchesManyFiles` did not solve it. The step `readFromGCS` still stuck... — zangw, Nov 12 '19 at 07:16

How can I improve performance of TextIO or AvroIO when reading a very large number of files?

1 Answers1

Linked