1

I tried to read a subset of columns from a 'table' using spark_read_parquet,

temp <- spark_read_parquet(sc, name='mytable',columns=c("Col1","Col2"),
                                 path="/my/path/to/the/parquet/folder")

But I got the error:

Error: java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (54): .....

Is my syntax right? I tried googling for a (real) code example using the columns argument but couldn't find any.

(And my apologies in advance... I don't really know how to give you a reproducible example involving a spark and cloud.)

Community
  • 1
  • 1
qoheleth
  • 1,859
  • 2
  • 16
  • 20

1 Answers1

1

TL;DR This is not how columns work. When applied like this there are used to rename the columns, hence its lengths, should be equal to the length of the input.

The way to use it is (please note memory = FALSE, it is crucial for this to work correctly):

spark_read_parquet(
  sc, name = "mytable", path = "/tmp/foo", 
  memory = FALSE
) %>% select(Col1, Col2) 

optionally followed by

... %>% 
  sdf_persist()

If you have a character vector, you can use rlang:

library(rlang)

cols <- c("Col1", "Col2")

spark_read_parquet(sc, name="mytable", path="/tmp/foo", memory=FALSE) %>% 
  select(!!! lapply(cols, parse_quosure))
zero323
  • 283,404
  • 79
  • 858
  • 880
  • That's great. So, is there a way to load this into memory while reading the subset of columns? – qoheleth Aug 17 '18 at 22:49
  • 1
    This is where you use `sdf_persist` or if you want to restrict yourself to memory `sdf_persist("MEMORY")`. It is lazy, but otherwise will do the same think. However please keep in mind, that indiscriminate caching, might degrade performance more often than improve. – zero323 Aug 18 '18 at 02:01