6

I have a dataframe df:

userID Score  Task_Alpha Task_Beta Task_Charlie Task_Delta 
3108  -8.00   Easy       Easy      Easy         Easy    
3207   3.00   Hard       Easy      Match        Match
3350   5.78   Hard       Easy      Hard         Hard
3961   10.00  Easy       NA        Hard         Hard
4021   10.00  Easy       Easy      NA           Hard


1. userID is factor variable
2. Score is numeric
3. All the 'Task_' features are factor variables with possible values 'Hard', 'Easy', 'Match' or NA

I want to create new columns per userID that contain the counts of occurrence for each possible state of the Task_ feature. For the above toy example, the required output would be three new columns to be appended at the end of the df as below:

userID Hard Match Easy
3108   0    0     4
3207   1    2     1
3350   3    0     1
3961   2    0     1
4021   1    0     2

Update: This question is not a duplicate, an associated part of the original question has been moved to: R How to counting the factors in ordered sequence

Sandy
  • 511
  • 2
  • 11
  • Thanks @RonakShah I have posted the second part here: https://stackoverflow.com/questions/58745713/r-how-to-counting-the-factors-in-ordered-sequence – Sandy Nov 07 '19 at 09:33

5 Answers5

4

You can compare the dataframe df to each value in a map* or *apply function, compute the row-wise sums of the resulting boolean matrix, then combine the output with the original dataframe:

library(dplyr)
library(purrr)

facs <- c("Easy", "Match", "Hard")

bind_cols(df, set_names(map_dfc(facs, ~ rowSums(df == ., na.rm = T)), facs))

#### OUTPUT ####

  userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta Easy Match Hard
1   3108 -8.00       Easy      Easy         Easy       Easy    4     0    0
2   3207  3.00       Hard      Easy        Match      Match    1     2    1
3   3350  5.78       Hard      Easy         Hard       Hard    1     0    3
4   3961 10.00       Easy      <NA>         Hard       Hard    1     0    2
5   4021 10.00       Easy      Easy         <NA>       Hard    2     0    1
RoCh
  • 58
  • 3
3
library(data.table)
DT <- fread("userID Score  Task_Alpha Task_Beta Task_Charlie Task_Delta 
3108  -8.00   Easy       Easy      Easy         Easy    
3207   3.00   Hard       Easy      Match        Match
3350   5.78   Hard       Easy      Hard         Hard
3961   10.00  Easy       NA        Hard         Hard
4021   10.00  Easy       Easy      NA           Hard
")

DT.melt <- melt( DT, id.vars = "userID", measure.vars = patterns( task = "^Task_") )
dcast( DT.melt, userID ~ value, fun.aggregate = length )

#    userID NA Easy Hard Match
# 1:   3108  0    4    0     0
# 2:   3207  0    1    1     2
# 3:   3350  0    1    3     0
# 4:   3961  1    1    2     0
# 5:   4021  1    2    1     0
Wimpel
  • 16,956
  • 1
  • 15
  • 34
2

Answer to the first part can be obtained by using apply row-wise and count the occurrence of factor level in each row using table

cbind(df[1], t(apply(df[-c(1, 2)], 1, function(x) 
           table(factor(x, levels = c("Easy", "Hard", "Match"))))))


#  userID Easy Hard Match
#1   3108    4    0     0
#2   3207    1    1     2
#3   3350    1    3     0
#4   3961    1    2     0
#5   4021    2    1     0

In tidyverse, we can convert the data to long format, drop NA values, count occurrence of userID and value and get the data back to wide format.

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = starts_with("Task"), values_drop_na = TRUE) %>%
  count(userID, value) %>%
  pivot_wider(names_from = value, values_from = n, values_fill = list(n = 0))

data

df <- structure(list(userID = c(3108L, 3207L, 3350L, 3961L, 4021L), 
Score = c(-8, 3, 5.78, 10, 10), Task_Alpha = structure(c(1L, 
2L, 2L, 1L, 1L), .Label = c("Easy", "Hard"), class = "factor"), 
Task_Beta = structure(c(1L, 1L, 1L, NA, 1L), .Label = "Easy", class = "factor"), 
Task_Charlie = structure(c(1L, 3L, 2L, 2L, NA), .Label = c("Easy", 
"Hard", "Match"), class = "factor"), Task_Delta = structure(c(1L, 
3L, 2L, 2L, 2L), .Label = c("Easy", "Hard", "Match"), class = "factor")), 
class = "data.frame", row.names = c(NA, -5L))
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • Thanks @Ronak, would your `dplyr` based solution exclude the entries where all `Task_` features are NA? – Sandy Nov 07 '19 at 09:59
  • @Sandy Yes it drops `NA` values but I don't think it would have any impact on the final output since we count only the state of `Task` and not the `Task` itself. – Ronak Shah Nov 07 '19 at 10:05
  • Thanks! Just another quick question please: how should I update your answer to have a `mutate()` command? I want to append these three columns at the end of the original dataframe `df`. – Sandy Nov 07 '19 at 10:09
  • 1
    @Sandy Add a `left_join` in the beginning do `df %>% left_join(df %>% pivot_longer(cols = starts_with(.......rest of the code` – Ronak Shah Nov 07 '19 at 10:12
2

Another option using Rfast::rowTabulate

v <- c('Hard', 'Match', 'Easy', NA)
DT[, (v) := as.data.table(Rfast::rowTabulate(matrix(match(as.matrix(.SD), v), nrow=.N))), 
    .SDcols=Task_Alpha:Task_Delta]

output:

   userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta Hard Match Easy NA
1:   3108 -8.00       Easy      Easy         Easy       Easy    0     0    4  0
2:   3207  3.00       Hard      Easy        Match      Match    1     2    1  0
3:   3350  5.78       Hard      Easy         Hard       Hard    3     0    1  0
4:   3961 10.00       Easy      <NA>         Hard       Hard    2     0    1  1
5:   4021 10.00       Easy      Easy         <NA>       Hard    1     0    2  1

data from Wimpel:

library(data.table)
DT <- fread("userID Score  Task_Alpha Task_Beta Task_Charlie Task_Delta 
    3108  -8.00   Easy       Easy      Easy         Easy    
    3207   3.00   Hard       Easy      Match        Match
    3350   5.78   Hard       Easy      Hard         Hard
    3961   10.00  Easy       NA        Hard         Hard
    4021   10.00  Easy       Easy      NA           Hard
    ")

Would be interesting to know how fast this approach works on actual dataset and if actual dataset is large.


edit: Added timings

library(data.table)
set.seed(0L)
nr <- 1e6
v <- c('Hard', 'Match', 'Easy', NA)
DT <- data.table(userID=1:nr, Task_Alpha=sample(v, nr, TRUE),
    Task_Beta=sample(v, nr, TRUE), Task_Charlie=sample(v, nr, TRUE),
    Task_Delta=sample(v, nr, TRUE))
df <- as.data.frame(DT)

mtd0 <- function() {
    t(apply(df[-1L], 1L, function(x)
        table(factor(x, levels = c("Easy", "Hard", "Match")))))
}

mtd1 <- function() {
    DT.melt <- melt( DT, id.vars = "userID", measure.vars = patterns( task = "^Task_") )
    dcast( DT.melt, userID ~ value, fun.aggregate = length )
}

mtd2 <- function() {
    DT[, Rfast::rowTabulate(matrix(match(as.matrix(.SD), v), nrow=.N)),
        .SDcols=Task_Alpha:Task_Delta]
}

bench::mark(mtd0(), mtd1(), mtd2(), check=FALSE)

timings:

# A tibble: 3 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                    memory                 time     gc              
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                    <list>                 <list>   <list>          
1 mtd0()        54.7s    54.7s    0.0183     137MB    1.70      1    93      54.7s <int[,3] [1,000,000 x 3]> <df[,3] [107,168 x 3]> <bch:tm> <tibble [1 x 3]>
2 mtd1()         2.4s     2.4s    0.417      398MB    0.833     1     2       2.4s <df[,5] [1,000,000 x 5]>  <df[,3] [12,517 x 3]>  <bch:tm> <tibble [1 x 3]>
3 mtd2()      252.8ms  264.4ms    3.78       107MB    3.78      2     2    528.7ms <int[,4] [1,000,000 x 4]> <df[,3] [6,509 x 3]>   <bch:tm> <tibble [2 x 3]>
chinsoon12
  • 23,550
  • 4
  • 20
  • 30
1

If you are using base R, then the following may help you:

df <- cbind(df,as.data.frame(sapply(c('Hard','Match','Easy'), function(v) rowSums(df == v, na.rm = T))))

which outputs:

> df
  userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta Hard Match Easy
1   3108 -8.00       Easy      Easy         Easy       Easy    0     0    4
2   3207  3.00       Hard      Easy        Match      Match    1     2    1
3   3350  5.78       Hard      Easy         Hard       Hard    3     0    1
4   3961 10.00       Easy      <NA>         Hard       Hard    2     0    1
5   4021 10.00       Easy      Easy         <NA>       Hard    1     0    2
ThomasIsCoding
  • 53,240
  • 4
  • 13
  • 45