Extract a dplyr tbl column as a vector

Question

Is there a more succinct way to get one column of a dplyr tbl as a vector, from a tbl with database back-end (i.e. the data frame/table can't be subset directly)?

require(dplyr)
db <- src_sqlite(tempfile(), create = TRUE)
iris2 <- copy_to(db, iris)
iris2$Species
# NULL

That would have been too easy, so

collect(select(iris2, Species))[, 1]
# [1] "setosa"     "setosa"     "setosa"     "setosa"  etc.

But it seems a bit clumsy.

is `collect(iris2)$Species` less clumsy? – CJ Yetman May 25 '17 at 12:39 — CJ Yetman, May 25 '17 at 12:39

Lorenz Walthert · Accepted Answer · 2021-04-22T20:55:27.497

With dplyr >= 0.7.0, you can use pull to get a vector from a tbl.

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
db <- src_sqlite(tempfile(), create = TRUE)
iris2 <- copy_to(db, iris)
vec <- pull(iris2, Species)
head(vec)
#> [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

score 104 · Answer 2 · edited Dec 05 '17 at 17:48

104

As per the comment from @nacnudus, it looks like a pull function was implemented in dplyr 0.6:

iris2 %>% pull(Species)

For older versions of dplyr, here's a neat function to make pulling out a column a bit nicer (easier to type, and easier to read):

pull <- function(x,y) {x[,if(is.name(substitute(y))) deparse(substitute(y)) else y, drop = FALSE][[1]]}

This lets you do either of these:

iris2 %>% pull('Species')
iris2 %>% pull(Species)
iris2 %>% pull(5)

Resulting in...

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4

And it also works fine with data frames:

> mtcars %>% pull(5)
 [1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43
[28] 3.77 4.22 3.62 3.54 4.11

A nice way to do this in v0.2 of dplyr:

iris2 %>% select(Species) %>% collect %>% .[[5]]

Or if you prefer:

iris2 %>% select(Species) %>% collect %>% .[["Species"]]

Or if your table isn't too big, simply...

iris2 %>% collect %>% .[["Species"]]

edited Dec 05 '17 at 17:48

slhck

30,965
24
125
174

answered Jul 14 '14 at 06:42

Tommy O'Dell

6,553
13
51
67

2

I like your pull function. I'd just add one simplification for cases where there's only one variable: `pull % pull()` – Rappster Jan 15 '16 at 12:18
7

You can also use the `magrittr` exposition operator (`%$%`) to pull a vector from a data frame. i.e. `iris2 %>% select(Species) %>% collect() %$% Species`. – seasmith Jan 07 '17 at 18:44
@Luke1018 you should create an answer out of this comment – rrs Mar 29 '17 at 03:37
`pull()` will be implemented in dplyr version 0.6 https://github.com/tidyverse/dplyr/commit/0b9aabf6c06c9cd3b784b155044d497d4b93df3e – nacnudus Apr 23 '17 at 11:01

score 75 · Answer 3 · answered Jun 27 '15 at 10:02

75

You can also use unlist which I find easier to read because you do not need to repeat the name of the column or specify the index.

iris2 %>% select(Species) %>% unlist(use.names = FALSE)

answered Jun 27 '15 at 10:02

StanislawSwierc

2,301
15
22

1

This seems the most versatile method as it works identically with vectors and data.frames, i.e. it enables functions to be more agnostic. – geotheory Mar 14 '17 at 10:03
I was just looking for an answer to this exact question and `unlist` is precisely what I needed. Thanks! – Andrew Brēza May 19 '17 at 14:29
`unlist` can also extract values from multiple columns (combining all values into a single vector), while `dplyr::pull` is limited to a single column. – filups21 Apr 20 '18 at 12:26

Hugh · Answer 4 · 2014-11-26T09:20:05.437

21

I would use the extract2 convenience function from magrittr:

library(magrittr)
library(dplyr)

iris2 %>%
  select(Species) %>%
  extract2(1)

edited Nov 26 '14 at 09:20

answered Nov 26 '14 at 04:12

Hugh

13,468
6
48
90

Did you mean to use `collect()` between `select` and `extract2`? – nacnudus Nov 26 '14 at 09:26
10

`use_series(Species)` is perhaps even more readable. Thanks for alerting me to these functions, there are several other handy ones where that came from. – nacnudus Nov 26 '14 at 09:27

score 20 · Answer 5 · answered Feb 07 '14 at 13:33

20

I'd probably write:

collect(select(iris2, Species))[[1]]

Since dplyr is designed for working with tbls of data, there's no better way to get a single column of data.

answered Feb 07 '14 at 13:33

hadley

94,313
27
170
239

Can't say fairer than that. It arose interactively in the console when I tried using unique(table$column) to check for spurious values. – nacnudus Feb 08 '14 at 10:49
4

@nacnudus for that case you could also do `group_by(column) %.% tally()` – hadley Feb 08 '14 at 15:00
13

An argument `drop = TRUE` to `dplyr::select` would be amazing for the quite many use cases where we actually need to extract the vectors. – Antoine Lizée Apr 19 '16 at 03:38
This was the only way I could get a column out of my Sparklyr sdf. Pull wasn't working for me on version 0.7.8. – Meep Feb 07 '19 at 22:42

score 17 · Answer 6 · answered Mar 30 '17 at 17:50

17

@Luke1018 proposed this solution in one of the comments:

You can also use the magrittr exposition operator (%$%) to pull a vector from a data frame.

For example:

iris2 %>% select(Species) %>% collect() %$% Species

I thought it deserved its own answer.

answered Mar 30 '17 at 17:50

rrs

8,457
4
24
34

I was looking for this. – Diego-MX Jun 15 '17 at 17:00
How would I do this if I want to pass not the colname itself but a string variable that contains it? – mzuba Aug 07 '17 at 22:16
@mzuba `tibble(x = 1:10, y = letters[1:10]) %>% select_("x") %>% unlist()` and you could also add another `%>% unname()` at the end if you want, but for my purposes I haven't found that last pipe chain link to be necessary. You could also specify `use.names = FALSE` in the `unlist()` command, which does the same thing as also adding `unname()` onto the pipe chain. – Mark White Jun 19 '18 at 14:37
1

@mzuba I would use the `pull` command now. My solution was written prior to `dplyr` version 0.6. – rrs Jun 20 '18 at 13:12
1

Note that `%$%` works on any list, whereas `pull()` doesn't – wint3rschlaefer Jun 26 '18 at 14:48

score 5 · Answer 7 · answered Feb 01 '20 at 16:45

If you are used to using square brackets for indexing, another option is to just to wrap the usual indexing approach in a call to deframe(), e.g.:

library(tidyverse)

iris2 <- as_tibble(iris)

# using column name
deframe(iris2[, 'Sepal.Length'])

# [1] 5.1 4.9 4.7 4.6 5.0 5.4

# using column number
deframe(iris2[, 1])

# [1] 5.1 4.9 4.7 4.6 5.0 5.4

That and pull() are both pretty good ways of getting a tibble column.

Extract a dplyr tbl column as a vector

7 Answers7

Linked

Related