Questions tagged [dplyr]

Use this tag for questions relating to functions from the dplyr package, such as group_by, summarize, filter, and select.

The r dplyr package is the next iteration of the plyr package. It has three main goals:

Identify the most important data manipulation tools needed for data analysis and make them easy to use from R.

Provide fast performance for in-memory data by writing key pieces in C++.

Use the same interface to work with data no matter where it's stored, whether in a data.frame, a data.table or a database.

Repositories

Vignettes

Some vignettes have been moved to other related packages.

Tibbles (from tibble package)
Databases (from dbplyr package)
Introduction to dplyr
Adding a new SQL backend (from dbplyr package)
Programming with dplyr
Two-table verbs
Window functions and grouped mutate/filter

Other resources

Related tags

R's plyr, magrittr, tidyr, tidyverse and data.table packages
Python's pandas library

24676 questions

votes

2 answers

Complete column with group_by and complete

I've got a little problem using dplyr group_by function. After doing this : datasetALL %>% group_by(YEAR,Region) %>% summarise(count_number = n()) here is the result : YEAR Region count_number 1 1946 1 …

r dplyr tidyr

asked Apr 19 '17 at 16:51

Ben

votes

3 answers

dplyr for rowwise quantiles

I have a df of strata, each of which has 1000 samples from a posterior distribution of the estimates from that stratum. mydf <- as.data.frame(lapply(seq(1, 1000), rnorm, n=100)) colnames(mydf) <- paste('s', seq(1, ncol(mydf)), sep='') I want to…

r dplyr

asked Apr 18 '17 at 19:08

wylbur

votes

1 answer

How to pass column names into a function dplyr

I'm trying to create a simple summary function to speed up the reporting of multiple columns of data for use in a R Markdown file. var1 is a categorical column of data, t_var is an integer representing the quarter of data, and dt is the full…

r function dplyr

asked Apr 16 '17 at 13:59

elksie5000

4,829
8
44
71

votes

2 answers

Combine select and mutate

Quite often, I find myself manually combining select() and mutate() functions within dplyr. This is usually because I'm tidying up a dataframe, want to create new columns based on the old columns, and only want keep the new columns. For example, if…

r dplyr

asked Apr 12 '17 at 16:39

mdpead

votes

2 answers

Group by aggregate dynamic column name matching

Is it possible to group_by using regex match on column names using dplyr? library(dplyr) # dplyr_0.5.0; R version 3.3.2 (2016-10-31) # dummy data set.seed(1) df1 <- sample_n(iris, 20) %>% mutate(Sepal.Length = round(Sepal.Length), …

r dplyr aggregate

asked Apr 05 '17 at 10:54

zx8754

42,109
10
93
154

votes

1 answer

R dplyr method to replace all empty factors with NA

Instead of writing and reading a dataframe to fill all empty factors in this method, na.strings=c("","NA") I wanted to just apply a function to all the columns and substitute the empties with NA. I've selected the factor columns so far but don't…

r dplyr

asked Mar 28 '17 at 02:58

Ricky

votes

1 answer

Using dplyr to group_by and conditionally mutate a dataframe by group

I'd like to use dplyr functions to group_by and conditionally mutate a df. Given this sample data: A B C D 1 1 1 0.25 1 1 2 0 1 2 1 0.5 1 2 2 0 1 3 1 0.75 1 3 2 0.25 2 1 1 0 2 1 2 0.5 2 2 1 …

r group-by dplyr

asked Mar 23 '17 at 15:25

ucsbcoding

votes

0 answers

dplyr summarise evaluates custom function twice?

I am using dplyr group_by and summarise functions with custom made aggregate function, and have observed a strange behavior. It seems like the aggregate function is evaluate twice for each group. Here is a minimal example: aggFun <- function(x) {…

r dplyr

asked Feb 26 '17 at 16:15

Øystein S

votes

2 answers

Faster coding than using for loop

Suppose I have the following data frame set.seed(36) n <- 300 dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500))) d <- dat[order(dat$y),] For each value of d$y<=300, I have to create a variable res in which the…

r dplyr

asked Feb 24 '17 at 03:07

user 31466

votes

2 answers

How to use data.table within functions and loops?

While assessing the utility of data.table (vs. dplyr), a critical factor is the ability to use it within functions and loops. For this, I've modified the code snippet used in this post: data.table vs dplyr: can one do something well the other can't…

r function loops data.table dplyr

asked Feb 21 '17 at 19:16

IVIM

1,363
1
9
26

votes

2 answers

Tracking which group fails in a dplyr chain

How can I find out which group failed when using group_by in a dplyr type chain. Take for example: library(dplyr) data(iris) iris %>% group_by(Species) %>% do(mod=lm(Petal.Length ~ Petal.Width, data = .)) %>% mutate(Slope =…

r dplyr

asked Feb 16 '17 at 18:02

boshek

2,958
1
24
51

votes

4 answers

Remove the first N rows from each factor level in an r data.frame

With the dat below. How can I make a new dataframe subset that includes all values except the first five rows for each IndID? Said differently I want new data frame with the first 5 rows for each IndID excluded. set.seed(123) dat <-…

r dplyr greatest-n-per-group

asked Feb 14 '17 at 23:12

B. Davis

3,021
4
29
67

votes

1 answer

Filling "implied missing values" in a data frame that has varying observations per time unit

I have a large dataset with spatiotemporal data. Each set of coordinates are associated with an id (player id in a computer game). Unfortunately the coordinates for each id aren't logged at every time unit. If a reading is not available for a…

r merge dplyr data-manipulation tidyr

asked Feb 10 '17 at 17:53

Lauler

votes

1 answer

ggplot: How to make the x/time-axis of a time-series plot only the time-component, not the date?

Consider the following example library(lubridate) library(tidyverse) library(scales) library(ggplot2) dataframe <- data_frame(time = c(ymd_hms('2008-01-04 00:00:00'), ymd_hms('2008-01-04 00:01:00'), …

r ggplot2 dplyr lubridate

asked Feb 08 '17 at 17:56

ℕʘʘḆḽḘ

15,284
28
88
180

votes

2 answers

top_n versus order in r

I am having trouble understanding the output from dplyr's top_n function. Can anybody help? n=10 df = data.frame(ref=sample(letters,n),score=rnorm(n)) require(dplyr) print(dplyr::top_n(df,5,score)) print(df[order(df$score,decreasing =…

r dplyr

asked Jan 25 '17 at 11:28

PM.

Prev 1 2 3

…

100 Next