The columns option in spark_read_parquet

Question

I tried to read a subset of columns from a 'table' using spark_read_parquet,

temp <- spark_read_parquet(sc, name='mytable',columns=c("Col1","Col2"),
                                 path="/my/path/to/the/parquet/folder")

But I got the error:

Error: java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (54): .....

Is my syntax right? I tried googling for a (real) code example using the columns argument but couldn't find any.

(And my apologies in advance... I don't really know how to give you a reproducible example involving a spark and cloud.)

zero323 · Accepted Answer · 2018-08-17T12:01:15.330

1

TL;DR This is not how columns work. When applied like this there are used to rename the columns, hence its lengths, should be equal to the length of the input.

The way to use it is (please note memory = FALSE, it is crucial for this to work correctly):

spark_read_parquet(
  sc, name = "mytable", path = "/tmp/foo", 
  memory = FALSE
) %>% select(Col1, Col2)

optionally followed by

... %>% 
  sdf_persist()

If you have a character vector, you can use rlang:

library(rlang)

cols <- c("Col1", "Col2")

spark_read_parquet(sc, name="mytable", path="/tmp/foo", memory=FALSE) %>% 
  select(!!! lapply(cols, parse_quosure))

edited Aug 17 '18 at 12:01

answered Aug 17 '18 at 11:52

zero323

283,404
79
858
880

That's great. So, is there a way to load this into memory while reading the subset of columns? – qoheleth Aug 17 '18 at 22:49
1

This is where you use `sdf_persist` or if you want to restrict yourself to memory `sdf_persist("MEMORY")`. It is lazy, but otherwise will do the same think. However please keep in mind, that indiscriminate caching, might degrade performance more often than improve. – zero323 Aug 18 '18 at 02:01

The columns option in spark_read_parquet

1 Answers1