Questions tagged [dplyr]

Use this tag for questions relating to functions from the dplyr package, such as group_by, summarize, filter, and select.

The dplyr package is the next iteration of the package. It has three main goals:

  1. Identify the most important data manipulation tools needed for data analysis and make them easy to use from R.
  2. Provide fast performance for in-memory data by writing key pieces in C++.
  3. Use the same interface to work with data no matter where it's stored, whether in a data.frame, a data.table or a database.

Repositories

Vignettes

Some vignettes have been moved to other related packages.

Other resources

Related tags

24676 questions
5
votes
3 answers

A more elegant way to compute within-group proportions in dplyr?

Given a data_frame df <- data_frame(X = c('A', 'A', 'B', 'B', 'B'), Y = c('M', 'N', 'M', 'M', 'N')), I need to come up with a data_frame that tells us that 50% of A's are M, 50% of A's are N, 67% of B's are M, and 33% of B's are N. I have a little…
crf
  • 1,371
  • 3
  • 10
  • 21
5
votes
2 answers

Conditional mutate cumsum dlpyr

I have towns (from A to D), which have different populations, and are at different distances. The objective is to add up the total population living within the circle of radius (distance XY) where X is a town in the centre of the circle and Y any…
JPV
  • 313
  • 1
  • 9
5
votes
3 answers

Calculating age using mutate with lubridate functions

I would like to calculate age based on birth date. If I use lubridate, I would just run the following as in Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date as.period(new_interval(start…
HNSKD
  • 1,378
  • 1
  • 11
  • 22
5
votes
1 answer

Using mutate with dates gives numerical values

I am using the lubridate and dplyr packages to work with date variables and to create a new date variable, respectively. library(lubridate) library(dplyr) Let df be my dataframe. I have two variables date1 and date2. I want to create a new variable…
HNSKD
  • 1,378
  • 1
  • 11
  • 22
5
votes
1 answer

Sampling different numbers of rows by group in dplyr tidyverse

I'd like to sample rows from a data frame by group. But here's the catch, I'd like to sample a different number of records based on data from another table. Here is my reproducible data: df <- data_frame( Stratum = rep(c("High","Medium","Low"),…
Zafar
  • 1,699
  • 12
  • 27
5
votes
5 answers

R: calculate the number of occurrences of a specific event in a specified time future

my simplified data looks like this: set.seed(1453); x = sample(0:1, 10, TRUE) date = c('2016-01-01', '2016-01-05', '2016-01-07', '2016-01-12', '2016-01-16', '2016-01-20', '2016-01-20', '2016-01-25', '2016-01-26', …
Kasia Kulma
  • 1,442
  • 10
  • 34
5
votes
4 answers

Counting new values not occuring earlier and not occuring in last group

I am trying to count number of unique "new" users per month. New is a user that has not appeared before (since the beginning) I am also trying to count number of unique users not appearing last month. The original data looks like library(dplyr) …
user3482393
  • 217
  • 4
  • 12
5
votes
2 answers

Strip trailing spaces from factor labels using dplyr chain

I have a dataframe loaded that has trailing white spaces in the factor labels. I am trying to remove those trailing spaces in every factor in the dataframe but am unsuccessful so far. Reproducable example lvls <- c('a ', 'b ', …
Wietze314
  • 5,675
  • 1
  • 17
  • 35
5
votes
2 answers

Fit a different model for each row of a list-columns data frame

What is the best way to fit different model formulae that vary by the row of a data frame with the list-columns data structure in tidyverse? In R for Data Science, Hadley presents a terrific example of how to use the list-columns data structure and…
LmW.
  • 1,246
  • 9
  • 15
5
votes
1 answer

Running out of heap space in sparklyr, but have plenty of memory

I am getting heap space errors on even fairly small datasets. I can be sure that I'm not running out of system memory. For example, consider a dataset containing about 20M rows and 9 columns, and that takes up 1GB on disk. I am playing with it on a…
David Bruce Borenstein
  • 1,323
  • 1
  • 13
  • 31
5
votes
1 answer

SE filter_ by function taking multiple columns

I would like to filter a data frame to leave only the complete cases based on selected columns. This is easy to do with NSE filter(): library(dplyr) dd <- data.frame( id = 1:4, var1 = c(1, 2, NA, 4), var2 = c(1, NA, 3, 4), var3 = c(1, NA,…
mdlincoln
  • 304
  • 2
  • 11
5
votes
1 answer

Using `map()` in nested data frame

I am having some problems using the map() function along with the nest() function. I have some data set up like the following: counter counter date_time total 1 06032013 2013-06-03 16:00:00 476 2 06032013 2013-06-03 17:00:00 …
thus__
  • 351
  • 2
  • 13
5
votes
1 answer

dplyr Exclude row

I am looking for an dplyr equivalent on SELECT user_id, item FROM users WHERE user_id NOT IN (1, 5, 6, 7, 11, 17, 18); -- admin accounts I can use users %>% filter(user_id != 1) but can't imagine using multiple && all the way. Is there a…
Young Ha Kim
  • 107
  • 1
  • 2
  • 6
5
votes
2 answers

Substitute for mutate (dplyr package) in python pandas

Is there a Python pandas function similar to R's dplyr::mutate(), which can add a new column to grouped data by applying a function on one of the columns of the grouped data? Below is the detailed explanation of the problem: I generated sample data…
saurav shekhar
  • 529
  • 6
  • 17
5
votes
1 answer

Joining list of data.frames from map() call

Is there a "tidyverse" way to join a list of data.frames (a la full_join(), but for >2 data.frames)? I have a list of data.frames as a result of a call to map(). I've used Reduce() to do something like this before, but would like to merge them as…
1 2 3
99
100