1

I have a problem with a data frame in R, I have some data with two dimensions and one metric, but some combination of categories don't have data. My data look like these:

          interestAffinityCategory userGender users
1                 Music Lovers       male   198
2                 Music Lovers     female   190
3  News Junkies & Avid Readers       male   134
4  News Junkies & Avid Readers     female   115
5                  Sports Fans       male   109
6                 Movie Lovers       male   108
7                 Technophiles       male    93
8                    TV Lovers       male    88
9                    TV Lovers     female    79
10                Technophiles     female    70

Example, Sport Fans, only have data for male gender. I need all the categories, even with a 0 value in the users column. Like: Sport Fans, female, 0 How my data need to be: (line 8 and 6)

      interestAffinityCategory userGender users
1                 Music Lovers       male   198
2                 Music Lovers     female   190
3  News Junkies & Avid Readers       male   134
4  News Junkies & Avid Readers     female   115
5                  Sports Fans       male   109
6                  Sports Fans     female   0
7                 Movie Lovers       male   108
8                 Movie Lovers     female   0
9                 Technophiles       male   93
10                    TV Lovers       male  88
11                    TV Lovers     female  79
12                Technophiles     female    70

I tried to find a solution, but I only find similar cases, but with only one dimension, and it didn't work for me.

Ps.: This data is from the Google Analytics API, I want to get the top 10 categories, and make a graph with visits by gender, but for it, I need to show data for all mix of categories and gender, even with 0 visits.

Alex
  • 19,061
  • 10
  • 50
  • 66
Diego Gaona
  • 448
  • 4
  • 18
  • Have attempted to use the [include-empty-rows](https://developers.google.com/analytics/devguides/reporting/core/v3/reference#includeEmptyRows) parameter? – Matt Mar 11 '16 at 22:20
  • @Matt, unfortunately this don't work... this parameter is set by default, even so I tried to set for true, but the results are the same. I think this don't work for this type of data in Analytics. – Diego Gaona Mar 14 '16 at 13:38

2 Answers2

4

You should use the complete function from tidyr. The first argument is your data, second and third are the columns that you want to find all possible comibnations (if you have more, you can just list them one by one), and fill is a list with the default values to fill in.

complete(data, interestAffinityCategory, userGender, fill=list(users=0))
Xiongbing Jin
  • 9,945
  • 3
  • 36
  • 34
1

You could create a data frame of all combinations of categories with users set to zero. Then you can combine the two data frames and for each combination of categories keep the maximum value for users.

You can create a data frame with all combinations using expand.grid():

all_levels_0 <- expand.grid(levels(data$interestAffinityCategory), levels(data$userGender))
all_levels_0$users <- 0
names(all_levels_0) <- names(data)
head(all_levels_0)
##        interestAffinityCategory  userGender users
## 1                  Movie Lovers      female     0
## 2                  Music Lovers      female     0
## 3   News Junkies & Avid Readers      female     0
## 4                   Sports Fans      female     0
## 5                  Technophiles      female     0
## 6                  Technophiles      female     0

(This assumes that data$interestAffinityCategory and data$userGender are both factors. If they are characters, use unique() instead of levels().)

For the second step, I use the dplyr package:

library(dplyr)
all_levels <- bind_rows(data, all_levels_0) %>%
              group_by(interestAffinityCategory, userGender) %>%
              summarise(users = max(users))
head(all_levels)
## Source: local data frame [6 x 3]
## Groups: interestAffinityCategory [3]
## 
##        interestAffinityCategory  userGender users
##                          (fctr)      (fctr) (dbl)
## 1                  Movie Lovers      female     0
## 2                  Movie Lovers        male   108
## 3                  Music Lovers      female   190
## 4                  Music Lovers        male   198
## 5   News Junkies & Avid Readers      female   115
## 6   News Junkies & Avid Readers        male   134

If you prefer not to use dplyr, you can do the same with rbind() and aggregate() from base R:

combined <- rbind(data, all_levels_0)
all_levels <- aggregate(users ~ interestAffinityCategory + userGender,
                        data = combined, FUN = max)
head(all_levels)
##        interestAffinityCategory  userGender users
## 1                  Movie Lovers      female     0
## 2                  Music Lovers      female   190
## 3   News Junkies & Avid Readers      female   115
## 4                   Sports Fans      female     0
## 5                  Technophiles      female    70
## 6                  Technophiles      female     0

(This orders the rows differently, so the first few rows are not the same as in the dplyr example.)

Stibu
  • 13,247
  • 5
  • 50
  • 67
  • Thanks! Interesting solution. I liked more the solution with tidyr because is more simple, but your solution uses some interesting points. – Diego Gaona Mar 14 '16 at 13:47