203

Is there a more succinct way to get one column of a dplyr tbl as a vector, from a tbl with database back-end (i.e. the data frame/table can't be subset directly)?

require(dplyr)
db <- src_sqlite(tempfile(), create = TRUE)
iris2 <- copy_to(db, iris)
iris2$Species
# NULL

That would have been too easy, so

collect(select(iris2, Species))[, 1]
# [1] "setosa"     "setosa"     "setosa"     "setosa"  etc.

But it seems a bit clumsy.

smci
  • 26,085
  • 16
  • 96
  • 138
nacnudus
  • 5,420
  • 5
  • 30
  • 46

7 Answers7

213

With dplyr >= 0.7.0, you can use pull to get a vector from a tbl.


library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
db <- src_sqlite(tempfile(), create = TRUE)
iris2 <- copy_to(db, iris)
vec <- pull(iris2, Species)
head(vec)
#> [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
Lorenz Walthert
  • 3,219
  • 1
  • 13
  • 20
104

As per the comment from @nacnudus, it looks like a pull function was implemented in dplyr 0.6:

iris2 %>% pull(Species)

For older versions of dplyr, here's a neat function to make pulling out a column a bit nicer (easier to type, and easier to read):

pull <- function(x,y) {x[,if(is.name(substitute(y))) deparse(substitute(y)) else y, drop = FALSE][[1]]}

This lets you do either of these:

iris2 %>% pull('Species')
iris2 %>% pull(Species)
iris2 %>% pull(5)

Resulting in...

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4

And it also works fine with data frames:

> mtcars %>% pull(5)
 [1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43
[28] 3.77 4.22 3.62 3.54 4.11

A nice way to do this in v0.2 of dplyr:

iris2 %>% select(Species) %>% collect %>% .[[5]]

Or if you prefer:

iris2 %>% select(Species) %>% collect %>% .[["Species"]]

Or if your table isn't too big, simply...

iris2 %>% collect %>% .[["Species"]]
slhck
  • 30,965
  • 24
  • 125
  • 174
Tommy O'Dell
  • 6,553
  • 13
  • 51
  • 67
  • 2
    I like your pull function. I'd just add one simplification for cases where there's only one variable: `pull % pull()` – Rappster Jan 15 '16 at 12:18
  • 7
    You can also use the `magrittr` exposition operator (`%$%`) to pull a vector from a data frame. i.e. `iris2 %>% select(Species) %>% collect() %$% Species`. – seasmith Jan 07 '17 at 18:44
  • @Luke1018 you should create an answer out of this comment – rrs Mar 29 '17 at 03:37
  • `pull()` will be implemented in dplyr version 0.6 https://github.com/tidyverse/dplyr/commit/0b9aabf6c06c9cd3b784b155044d497d4b93df3e – nacnudus Apr 23 '17 at 11:01
75

You can also use unlist which I find easier to read because you do not need to repeat the name of the column or specify the index.

iris2 %>% select(Species) %>% unlist(use.names = FALSE)
StanislawSwierc
  • 2,301
  • 15
  • 22
  • 1
    This seems the most versatile method as it works identically with vectors and data.frames, i.e. it enables functions to be more agnostic. – geotheory Mar 14 '17 at 10:03
  • I was just looking for an answer to this exact question and `unlist` is precisely what I needed. Thanks! – Andrew Brēza May 19 '17 at 14:29
  • `unlist` can also extract values from multiple columns (combining all values into a single vector), while `dplyr::pull` is limited to a single column. – filups21 Apr 20 '18 at 12:26
21

I would use the extract2 convenience function from magrittr:

library(magrittr)
library(dplyr)

iris2 %>%
  select(Species) %>%
  extract2(1)  
Hugh
  • 13,468
  • 6
  • 48
  • 90
  • Did you mean to use `collect()` between `select` and `extract2`? – nacnudus Nov 26 '14 at 09:26
  • 10
    `use_series(Species)` is perhaps even more readable. Thanks for alerting me to these functions, there are several other handy ones where that came from. – nacnudus Nov 26 '14 at 09:27
20

I'd probably write:

collect(select(iris2, Species))[[1]]

Since dplyr is designed for working with tbls of data, there's no better way to get a single column of data.

hadley
  • 94,313
  • 27
  • 170
  • 239
  • Can't say fairer than that. It arose interactively in the console when I tried using unique(table$column) to check for spurious values. – nacnudus Feb 08 '14 at 10:49
  • 4
    @nacnudus for that case you could also do `group_by(column) %.% tally()` – hadley Feb 08 '14 at 15:00
  • 13
    An argument `drop = TRUE` to `dplyr::select` would be amazing for the quite many use cases where we actually need to extract the vectors. – Antoine Lizée Apr 19 '16 at 03:38
  • This was the only way I could get a column out of my Sparklyr sdf. Pull wasn't working for me on version 0.7.8. – Meep Feb 07 '19 at 22:42
17

@Luke1018 proposed this solution in one of the comments:

You can also use the magrittr exposition operator (%$%) to pull a vector from a data frame.

For example:

iris2 %>% select(Species) %>% collect() %$% Species

I thought it deserved its own answer.

rrs
  • 8,457
  • 4
  • 24
  • 34
  • I was looking for this. – Diego-MX Jun 15 '17 at 17:00
  • How would I do this if I want to pass not the colname itself but a string variable that contains it? – mzuba Aug 07 '17 at 22:16
  • @mzuba `tibble(x = 1:10, y = letters[1:10]) %>% select_("x") %>% unlist()` and you could also add another `%>% unname()` at the end if you want, but for my purposes I haven't found that last pipe chain link to be necessary. You could also specify `use.names = FALSE` in the `unlist()` command, which does the same thing as also adding `unname()` onto the pipe chain. – Mark White Jun 19 '18 at 14:37
  • 1
    @mzuba I would use the `pull` command now. My solution was written prior to `dplyr` version 0.6. – rrs Jun 20 '18 at 13:12
  • 1
    Note that `%$%` works on any list, whereas `pull()` doesn't – wint3rschlaefer Jun 26 '18 at 14:48
5

If you are used to using square brackets for indexing, another option is to just to wrap the usual indexing approach in a call to deframe(), e.g.:

library(tidyverse)

iris2 <- as_tibble(iris)

# using column name
deframe(iris2[, 'Sepal.Length'])

# [1] 5.1 4.9 4.7 4.6 5.0 5.4

# using column number
deframe(iris2[, 1])

# [1] 5.1 4.9 4.7 4.6 5.0 5.4

That and pull() are both pretty good ways of getting a tibble column.

Keith Hughitt
  • 4,233
  • 5
  • 40
  • 50