Calculate multi-group row-wise proportion in R

Question

I have a data.frame that is grouped by folder, z_stack_id and contains the counts for each binary_layer. The "primary" layers are FITC, TRITC, and Cy5. I have already calculated intersections elsewhere (). My goal is to calculate the proportions within z_stack, within folder (and other grouping variable if in case needed). I was hoping to use dplyr::group_by(...) %>% summarise(my_custom_fancy_function). But I am uncertain how to go about making such function.

The expected outputs for the function would be the proportions for each main layer, grouped by folder/z_stack_id/(...). For example, for Cy5

FITC_Cy5/Cy5, TRITC_Cy5/Cy5 , Triple/Cy5

Notice that Triple does not always have counts, so I would need to fill the groups first (currently working on it).

 my_df
# A tibble: 13 x 4
   folder              z_stack_id binary_layer n_blobs
   <chr>                    <dbl> <chr>          <int>
 1 20180601_122650_896       1.00 Cy5              959
 2 20180601_122650_896       1.00 FITC              16
 3 20180601_122650_896       1.00 TRITC            499
 4 20180601_122650_896       2.00 Cy5              225
 5 20180601_122650_896       2.00 FITC             157
 6 20180601_122650_896       2.00 TRITC             19
 7 20180601_122650_896       1.00 FITC_Cy5           5
 8 20180601_122650_896       1.00 FITC_TRITC         2
 9 20180601_122650_896       1.00 TRITC_Cy5        301
10 20180601_122650_896       2.00 FITC_Cy5          34
11 20180601_122650_896       2.00 FITC_TRITC         8
12 20180601_122650_896       2.00 Triple             4
13 20180601_122650_896       2.00 TRITC_Cy5          8

dput(my_df)
structure(list(folder = c("20180601_122650_896", "20180601_122650_896", 
"20180601_122650_896", "20180601_122650_896", "20180601_122650_896", 
"20180601_122650_896", "20180601_122650_896", "20180601_122650_896", 
"20180601_122650_896", "20180601_122650_896", "20180601_122650_896", 
"20180601_122650_896", "20180601_122650_896"), z_stack_id = c(1, 
1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2), binary_layer = c("Cy5", 
"FITC", "TRITC", "Cy5", "FITC", "TRITC", "FITC_Cy5", "FITC_TRITC", 
"TRITC_Cy5", "FITC_Cy5", "FITC_TRITC", "Triple", "TRITC_Cy5"), 
    n_blobs = c(959L, 16L, 499L, 225L, 157L, 19L, 5L, 2L, 301L, 
    34L, 8L, 4L, 8L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -13L), .Names = c("folder", "z_stack_id", 
"binary_layer", "n_blobs"))

UPDATE

I made an example of the Cy5 calculations by hand. Notice most of the results on the prop_main_Cy5 column are spurious. The only ones who make sense are those in which Cy5 value for that z_stack_id is the total (e.g, FITC_TRITC/Cy5 makes no sense)

1               folder z_stack_id binary_layer n_blobs Cy5 prop_main_channel_Cy5
2  20180601_122650_896          1          Cy5     959 959                     1
3  20180601_122650_896          1         FITC      16 959           0.016684046
4  20180601_122650_896          1        TRITC     499 959           0.520333681
5  20180601_122650_896          2          Cy5     225 225                     1
6  20180601_122650_896          2         FITC     157 225           0.697777778
7  20180601_122650_896          2        TRITC      19 225           0.084444444
8  20180601_122650_896          1     FITC_Cy5       5 959           0.005213764
9  20180601_122650_896          1   FITC_TRITC       2 959           0.002085506
10 20180601_122650_896          1    TRITC_Cy5     301 959           0.313868613
11 20180601_122650_896          2     FITC_Cy5      34 225           0.151111111
12 20180601_122650_896          2   FITC_TRITC       8 225           0.035555556
13 20180601_122650_896          2       Triple       4 225           0.017777778
14 20180601_122650_896          2    TRITC_Cy5       8 225           0.035555556

A very similar question was answered here: https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr You can adapt that solution to fit several levels of grouping. — TTNK, Jun 07 '18 at 20:28
@TTNK The proposed `group_by(folder, z_stack_id) %>% mutate(n/sum(n))` alternative would give me the partials for the total group. What I don't understand is how to calculate *within* level. Instead of `sum(n)` I need something to tell `within each factor level combination I want` — Matias Andina, Jun 07 '18 at 20:41
@chinsoon12 Do you mean if I have a sample of my desired output? — Matias Andina, Jun 07 '18 at 22:30
yeah what output are you expecting using the data you provided in OP? — chinsoon12, Jun 07 '18 at 22:31

score 0 · Answer 1 · answered Jun 07 '18 at 21:46

Perhaps something like this, using prop.table(), then using ungroup() and group_by to go up levels of aggregation?

library(tidyverse)

my_df %>% 
  group_by(folder, z_stack_id) %>% 
  mutate(prop_binary_layer = n_blobs/sum(n_blobs)) %>% 
  ungroup %>% 
  group_by(folder) %>% 
  mutate(prop_z_stack_id = n_blobs/sum(n_blobs))


#> # A tibble: 13 x 6
#> # Groups: folder [1]
#>    folder              z_stack_id binary_layer n_blobs prop_bina~ prop_z_~
#>    <chr>                    <dbl> <chr>          <int>      <dbl>    <dbl>
#>  1 20180601_122650_896       1.00 Cy5              959    0.538   0.429   
#>  2 20180601_122650_896       1.00 FITC              16    0.00898 0.00715 
#>  3 20180601_122650_896       1.00 TRITC            499    0.280   0.223   
#>  4 20180601_122650_896       2.00 Cy5              225    0.495   0.101   
#>  5 20180601_122650_896       2.00 FITC             157    0.345   0.0702  
#>  6 20180601_122650_896       2.00 TRITC             19    0.0418  0.00849 
#>  7 20180601_122650_896       1.00 FITC_Cy5           5    0.00281 0.00224 
#>  8 20180601_122650_896       1.00 FITC_TRITC         2    0.00112 0.000894
#>  9 20180601_122650_896       1.00 TRITC_Cy5        301    0.169   0.135   
#> 10 20180601_122650_896       2.00 FITC_Cy5          34    0.0747  0.0152  
#> 11 20180601_122650_896       2.00 FITC_TRITC         8    0.0176  0.00358 
#> 12 20180601_122650_896       2.00 Triple             4    0.00879 0.00179 
#> 13 20180601_122650_896       2.00 TRITC_Cy5          8    0.0176  0.00358

This answer is not producing the expected results. I never need to `sum` on the denominator. Example of expected result: for z_stack_id == 1 >> TRITC_Cy5/Cy5 = 301/959 = 0.313 ; FITC_Cy5/Cy5 = 5/959 = 0.005 ; Triple/Cy5 = 0/959 = 0 — Matias Andina, Jun 07 '18 at 21:56
Am I understanding correctly, and FITC_Cy5 is a subgroup of Cy5? — TTNK, Jun 07 '18 at 22:04
FITC_Cy5 is the intersection of FITC and Cy5 channels calculated elsewhere. I care about the proportion of FITC_Cy5 out of the total Cy5 or FITC_Cy5 out of the total FITC. With the numbers in the given data frame it's possible to make the venn diagram for the 3 sets (TRITC, FITC, and Cy5) — Matias Andina, Jun 07 '18 at 22:08

Matias Andina · Accepted Answer · 2018-06-12T17:09:19.530

This is how I ended up doing it. It's a combination of splitting the data and within each folder level doing some filtering, and a little name change to be able to rejoin later. Once each z_stack_id has the proper value for each channel (FITC_blobs, TRITC_blobs, Cy5_blobs) we can bind_rows and do the proportions. This method still gives spurious proportions, but they can be filtered out somewhat easily.

I had to do some column renaming because my real data had different columns than the ones posted in the simplified question. I condensed it into a function.

calculate_blob_proportions <- function(dataframe){




  dataframe <- dataframe %>% ungroup()

# prepare a list

      li <- list()


  for (i in unique(dataframe$folder)){
    # Get each folder
    my_df <- dataframe %>% filter(folder == i) %>%
      mutate(filename_cells = ifelse(is.na(filename_cells),
                                     filename_coloc,
                                     filename_cells)) %>%
      rename(filename = filename_cells) %>%
      select(-filename_coloc)

    Cy5 <- filter(my_df, binary_layer=="Cy5") %>%
      rename(Cy5_blobs = n_blobs) %>%
      select(-binary_layer, -filename) %>% 
      left_join(my_df)

    TRITC <- filter(my_df, binary_layer=="TRITC") %>%
      rename(TRITC_blobs = n_blobs) %>%
      select(-binary_layer, -filename) %>% 
      left_join(my_df)

    FITC <- filter(my_df, binary_layer=="FITC") %>%
      rename(FITC_blobs = n_blobs) %>%
      select(-binary_layer, -filename) %>% 
      left_join(my_df)


    li[[i]] <- left_join(Cy5,left_join(TRITC,FITC)) %>%
      select(RatID, folder, filename, z_stack_id,
             binary_layer, n_blobs,
             FITC_blobs, TRITC_blobs, Cy5_blobs)

  }


  df_out <- bind_rows(li) %>% 
            mutate(FITC_prop = n_blobs/FITC_blobs,
                   TRITC_prop = n_blobs/TRITC_blobs,
                   Cy5_prop = n_blobs/Cy5_blobs)

  return(df_out)

}

Calculate multi-group row-wise proportion in R

UPDATE

2 Answers2