R - Frequencies of a categorical variable in multiple dataframes

Question

I have 18 dataframes, which they look like this:

I need to go through column 7 and count its frequency in the 17 remaining dataframes. I have tried different methods such as the count function from ddply but I am not sure whether they work with multiple dataframes.

My desired output is the frequency of each gene in all dataframes (each gene appears only once or less in a dataframe):

The output of dput:

> dput(head(df1, 20))
structure(list(V1 = 1:20, V2 = structure(c(16L, 14L, 13L, 32L, 
18L, 28L, 1L, 2L, 31L, 25L, 15L, 5L, 23L, 24L, 35L, 7L, 9L, 20L, 
29L, 10L), .Label = c("BMPR1B", "C9", "CACNB2", "CLEC12A", "DEFB126", 
"DYNC2H1", "EDDM3A", "FAM47C", "FZD8", "GPR1", "GSDMC", "H2AFY2", 
"HOMER1", "HSPB3", "HTR7", "IFNA1", "IFNA6", "IL1RL1", "INSL5", 
"LIMCH1", "LINC00478", "LRCH2", "MAGEH1", "MGP", "MMP8", "NFIB", 
"PDE1A", "PLSCR2", "PPP1R42", "PRKY", "PTENP1", "RGAG1", "SSBP2", 
"STEAP4", "SV2C", "TLR2", "TMSB15A", "TOX"), class = "factor"), 
    V3 = structure(c(16L, 14L, 13L, 32L, 18L, 28L, 1L, 2L, 31L, 
    25L, 15L, 5L, 23L, 24L, 35L, 7L, 9L, 20L, 29L, 10L), .Label = c("BMPR1B", 
    "C9", "CACNB2", "CLEC12A", "DEFB126", "DYNC2H1", "EDDM3A", 
    "FAM47C", "FZD8", "GPR1", "GSDMC", "H2AFY2", "HOMER1", "HSPB3", 
    "HTR7", "IFNA1", "IFNA6", "IL1RL1", "INSL5", "LIMCH1", "LINC00478", 
    "LRCH2", "MAGEH1", "MGP", "MMP8", "NFIB", "PDE1A", "PLSCR2", 
    "PPP1R42", "PRKY", "PTENP1", "RGAG1", "SSBP2", "STEAP4", 
    "SV2C", "TLR2", "TMSB15A", "TOX"), class = "factor"), V4 = structure(c(16L, 
    14L, 13L, 32L, 18L, 28L, 1L, 2L, 31L, 25L, 15L, 5L, 23L, 
    24L, 35L, 7L, 9L, 20L, 29L, 10L), .Label = c("BMPR1B", "C9", 
    "CACNB2", "CLEC12A", "DEFB126", "DYNC2H1", "EDDM3A", "FAM47C", 
    "FZD8", "GPR1", "GSDMC", "H2AFY2", "HOMER1", "HSPB3", "HTR7", 
    "IFNA1", "IFNA6", "IL1RL1", "INSL5", "LIMCH1", "LINC00478", 
    "LRCH2", "MAGEH1", "MGP", "MMP8", "NFIB", "PDE1A", "PLSCR2", 
    "PPP1R42", "PRKY", "PTENP1", "RGAG1", "SSBP2", "STEAP4", 
    "SV2C", "TLR2", "TMSB15A", "TOX"), class = "factor"), V5 = c(46L, 
    49L, 90L, 93L, 99L, 150L, 272L, 373L, 472L, 478L, 489L, 540L, 
    661L, 855L, 889L, 947L, 971L, 1002L, 1007L, 1080L), V6 = c(0.732, 
    0.717, 0.635, 0.633, 0.624, 0.545, 0.449, 0.374, 0.3, 0.295, 
    0.284, 0.244, 0.16, 0.0366, 0.0121, -0.0197, -0.0334, -0.0546, 
    -0.056, -0.0982), V7 = c(0.0345, 0.0903, 0.12, 0.169, 0.216, 
    0.233, 0.205, 0.183, 0.155, 0.176, 0.193, 0.186, 0.136, 0.0379, 
    0.0215, -0.00676, -0.0162, -0.0275, -0.0252, -0.0551), V8 = structure(c(2L, 
    2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), .Label = c("NO", "YES"), class = "factor")), .Names = c("V1", 
"V2", "V3", "V4", "V5", "V6", "V7", "V8"), row.names = c(NA, 
20L), class = "data.frame")
> dput(head(df2, 20))
structure(list(V1 = 1:20, V2 = structure(c(5L, 16L, 9L, 11L, 
12L, 13L, 6L, 10L, 20L, 14L, 3L, 8L, 17L, 4L, 1L, 21L, 15L, 18L, 
7L, 22L), .Label = c("ACAT2", "ADRA2A", "AKR1B1", "BRCA2", "CCNE2", 
"CDC6", "CDO1", "CYP2C9", "EZH2", "GSG2", "HAUS8", "HIST2H2BF", 
"HMGN2", "KIF18A", "MEIS2", "ORC6", "OTC", "PECR", "REG1A", "SKA1", 
"SYCP3", "TOX"), class = "factor"), V3 = structure(c(5L, 16L, 
9L, 11L, 12L, 13L, 6L, 10L, 20L, 14L, 3L, 8L, 17L, 4L, 1L, 21L, 
15L, 18L, 7L, 22L), .Label = c("ACAT2", "ADRA2A", "AKR1B1", "BRCA2", 
"CCNE2", "CDC6", "CDO1", "CYP2C9", "EZH2", "GSG2", "HAUS8", "HIST2H2BF", 
"HMGN2", "KIF18A", "MEIS2", "ORC6", "OTC", "PECR", "REG1A", "SKA1", 
"SYCP3", "TOX"), class = "factor"), V4 = structure(c(5L, 16L, 
9L, 11L, 12L, 13L, 6L, 10L, 20L, 14L, 3L, 8L, 17L, 4L, 1L, 21L, 
15L, 18L, 7L, 22L), .Label = c("ACAT2", "ADRA2A", "AKR1B1", "BRCA2", 
"CCNE2", "CDC6", "CDO1", "CYP2C9", "EZH2", "GSG2", "HAUS8", "HIST2H2BF", 
"HMGN2", "KIF18A", "MEIS2", "ORC6", "OTC", "PECR", "REG1A", "SKA1", 
"SYCP3", "TOX"), class = "factor"), V5 = c(31L, 68L, 145L, 170L, 
204L, 222L, 235L, 279L, 355L, 556L, 646L, 726L, 789L, 807L, 954L, 
1396L, 1399L, 1639L, 1711L, 1776L), V6 = c(0.774, 0.681, 0.55, 
0.528, 0.5, 0.488, 0.478, 0.443, 0.387, 0.23, 0.17, 0.12, 0.0757, 
0.0619, -0.0229, -0.279, -0.281, -0.418, -0.452, -0.491), V7 = c(0.0747, 
0.135, 0.16, 0.209, 0.25, 0.298, 0.348, 0.377, 0.383, 0.306, 
0.28, 0.253, 0.229, 0.228, 0.155, -0.042, -0.0103, -0.0857, -0.07, 
-0.046), V8 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("NO", 
"YES"), class = "factor")), .Names = c("V1", "V2", "V3", "V4", 
"V5", "V6", "V7", "V8"), row.names = c(NA, 20L), class = "data.frame")

Post example datasets if you want help. Post the output of `dput(head(df1, 20))` and the same with `df2`. At least two datasets, since you mention 18 of them. Please read [How to ask a good question](https://stackoverflow.com/help/how-to-ask). — Rui Barradas, Oct 11 '17 at 16:46
What is your desired output? Something like `table(Genes1, Genes2)`? Or `table(Genes2[Genes2 %in% Genes1])`? Something else? — Gregor Thomas, Oct 11 '17 at 17:30
*Can you please run the two commands in my previous comment and tell me if either of them are right?* And do you want a count *for each* of the seventeen remaining data frames, or do you want a single cumulative count including all of the remaining data frames? That is, is `table(c(Genes1, Genes2))` what you want?? Or maybe `table(Genes1)` and `table(Genes2)` separately? — Gregor Thomas, Oct 11 '17 at 17:36
Okay, in that example in your last comment, is `CCNE2` 31 because if we look in all the rows in all the data frames, it occurs 31 times? — Gregor Thomas, Oct 11 '17 at 17:41
Error in table(Genes1, Genes2) : all arguments must have the same length. I want a single cumulative count including all of the remaining data frames. — user91, Oct 11 '17 at 17:44
@Gregor CCNE2 occurs in the example 1 time only in the first dataframe. I was just trying to give an example. It shouldn't appear more than 18 times though. So each one of these genes occurs only once here. All 17 dataframes vary in raws number. — user91, Oct 11 '17 at 17:53
Yes, it's clear now. In you description and code it looked like you might be trying to get a separate count for each data frame, or perhaps looking for pair-wise counts between data frames. — Gregor Thomas, Oct 11 '17 at 17:55
I aplogise for the misleading question. I appreciate your help. — user91, Oct 11 '17 at 18:02

score 3 · Accepted Answer · answered Oct 11 '17 at 17:53

Okay. First let's get your data frame in a nice, easy-to-use list:

data_list = list(df1, df2)

We want to extract the "V2" column from each data frame in the list make a table of the values. We'll need to convert them all from factor class to character class so they can be combined before the counting starts:

# extract column named V2
res = lapply(data_list, '[[', 'V2')
# convert to character
res = lapply(res, as.character)
# make a frequency table
res = table(unlist(res))
res
  #   ACAT2    AKR1B1    BMPR1B     BRCA2        C9     CCNE2      CDC6      CDO1    CYP2C9 
  #       1         1         1         1         1         1         1         1         1 
  # DEFB126    EDDM3A      EZH2      FZD8      GPR1      GSG2     HAUS8 HIST2H2BF     HMGN2 
  #       1         1         1         1         1         1         1         1         1 
  #  HOMER1     HSPB3      HTR7     IFNA1    IL1RL1    KIF18A    LIMCH1    MAGEH1     MEIS2 
  #       1         1         1         1         1         1         1         1         1 
  #     MGP      MMP8      ORC6       OTC      PECR    PLSCR2   PPP1R42    PTENP1     RGAG1 
  #       1         1         1         1         1         1         1         1         1 
  #    SKA1      SV2C     SYCP3       TOX 
  #       1         1         1         1

"But what's a good way to get all my data frames in a list?" That's answered very thoroughly in this answer.

Rui Barradas · Answer 2 · 2017-10-12T09:29:47.003

1

My approach is different from Gregor's. If I understand the problem correctly, try the following.

count_genes <- function(DF1, DF_list){
    genes <- lapply(DF_list, `[[`, 2)
    genes <- unlist(lapply(genes, as.character))
    res <- sapply(DF1[[2]], function(x) sum(genes %in% as.character(x)))
    names(res) <- DF1[[2]]
    res
}

df_list <- list(df2)

count_genes(df1, df_list)

edited Oct 12 '17 at 09:29

answered Oct 11 '17 at 18:29

Rui Barradas

44,483
8
22
48

Thanks Rui, I got this output. I applied it on 4 dataframes. I need to represent also which gene do they refer to. > df_list > count_genes(df1, df_list) [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 > count_genes(df2,df_list) [1] 2 2 2 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 2 1 1 > count_genes(df3,df_list) [1] 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 > count_genes(df4,df_list) [1] 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 2 [49] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 – user91 Oct 12 '17 at 08:33
@MarJamil OK, I'll edit the answer. Same function, a bit changed. – Rui Barradas Oct 12 '17 at 09:12
'names' attribute [151] must be the same length as the vector [38]. I think because res is shorter than genes. I am sorry I am very beginner. Just started two days ago. – user91 Oct 12 '17 at 09:28
This is great! It works and the best part is that I am able to visualize the frequency of a variable in a single dataframe with the rest, separately. Thanks a million Rui. – user91 Oct 12 '17 at 09:55

R - Frequencies of a categorical variable in multiple dataframes

2 Answers2