0

I have a tibble of 65524 observations where one variable is an ID for an household and the other is factor where the value of 1 is assigned if the age of the person in the household is less then 15y.o., 2 is assigned if the age is between 15 and 64, and 3 is assigned if the age of the person is 65 or older. The tibble look like this

> head(df, 15)
# A tibble: 15 x 2
   hh.id age.cat  
   <dbl> <dbl+lbl>
 1 11009 2        
 2 11009 2        
 3 11009 2        
 4 11009 2        
 5 11009 2        
 6 11009 1        
 7 11009 1        
 8 11009 1        
 9 11018 2        
10 11018 1        
11 11018 1        
12 11018 1        
13 11018 1        
14 11018 2        
15 11018 2

I need to create a variable to estimate the dependency ratio of each household. Something similar to this

 > head(df, 15)
# A tibble: 15 x 3
   hh.id age.cat  dep.ratio
   <dbl> <dbl+lbl><dbl>
 1 11009 2        0.60
 2 11009 2        0.60
 3 11009 2        0.60
 4 11009 2        0.60
 5 11009 2        0.60
 6 11009 1        0.60
 7 11009 1        0.60
 8 11009 1        0.60
 9 11018 2        1.25
10 11018 1        1.25
11 11018 1        1.25
12 11018 1        1.25
13 11018 1        1.25
14 11018 2        1.25
15 11018 2        1.25

I thought that using dplyr::mutate and dplyr::group_by would work

df <- df %>%
  dplyr::group_by(hh.id) %>%
  dplyr::mutate(dep.ratio = (length(which(df$age.cat == 1)) + length(which(df$age.cat == 3)))/length(which(df$age.cat == 2)))

However, I do not get the estimates per each group (i.e. per each household), but I get the overall dependency ratio for the whole sample, repeated for each observation.

# A tibble: 15 x 3
# Groups:   hh.id [2]
   hh.id age.cat   dep.ratio
   <dbl> <dbl+lbl>     <dbl>
 1 11009 2              1.02
 2 11009 2              1.02
 3 11009 2              1.02
 4 11009 2              1.02
 5 11009 2              1.02
 6 11009 1              1.02
 7 11009 1              1.02
 8 11009 1              1.02
 9 11018 2              1.02
10 11018 1              1.02
11 11018 1              1.02
12 11018 1              1.02
13 11018 1              1.02
14 11018 2              1.02
15 11018 2              1.02

I then considered using tapply, but I could not write a function which conditions on the values of hh.id. Finally, I also tried aggregate, but without any luck.

Any suggestion is welcome.

Thanks

Manolo

Frank
  • 63,401
  • 8
  • 85
  • 161
Manolo
  • 63
  • 7
  • You don't want to refer to the whole dataset if working with groups, so take out the `df$` everywhere in your solution and simply refer to the variable names directly. Then I think it works as you want (if the 1.25 should really be 1.33). You can streamline things a bit with `sum(age.cat %in% c(1, 3))/sum(age.cat == 2)`, but it isn't necessary. – aosmith Oct 05 '18 at 19:55
  • @aosmith thank you, it worked perfectly after I removed the 'df$'. thank you also for suggesting a lighter code with 'sum'. – Manolo Oct 06 '18 at 06:13

1 Answers1

0

Here's an option:

ratiodf<- df %>% group_by(hh.id,age.cat) %>% 
  summarize(n=n()) %>% 
  spread(age.cat,n) %>% 
  mutate(ratio=(`1`+`3`)/`2`)

This will give you something like this:

# A tibble: 2 x 4
# Groups:   hh.id [2]
  hh.id   `1`   `2` ratio
  <int> <int> <int> <dbl>
1 11009     3     5  0.6 
2 11018     4     3  1.33

If you need to keep the data from the original df (e.g., other columns), you can continue with a left_join:

left_join(df, ratiodf[,c(-2:-3)], by="hh.id")

Which will result in this:

   hh.id age.cat    ratio
1  11009       2 0.600000
2  11009       2 0.600000
3  11009       2 0.600000
4  11009       2 0.600000
5  11009       2 0.600000
6  11009       1 0.600000
7  11009       1 0.600000
8  11009       1 0.600000
9  11018       2 1.333333
10 11018       1 1.333333
11 11018       1 1.333333
12 11018       1 1.333333
13 11018       1 1.333333
14 11018       2 1.333333
15 11018       2 1.333333

The reason your code didn't work is that as soon as your put in the base format of df$hh.id etc., you bypass the tidyverse grouping and get the full column.

iod
  • 6,861
  • 2
  • 13
  • 30
  • thank you @iod i also tried your code but i get an error message Error in summarize(., n = n()) : argument "by" is missing, with no default . I look at the documentation for n() , but could not find anything on its argument by. The dplyr pipe worked finally after your suggestion of dropping the df$, but I am interested in watching your suggested code working too. – Manolo Oct 06 '18 at 06:18
  • There's a chance either summarize or n are masked by another package. Try typing dplyr:: before both. The code works for me as is... – iod Oct 06 '18 at 11:51
  • Try seeing what conflicts() gives you. – iod Oct 06 '18 at 11:53