2

I am looping through different data.tables and the variables in the data.table. But I'm having trouble referencing the variables inside of the for loop

dt1 <- data.table(a1 = c(1,2,3), a2 = c(4,5,2))
dt2 <- data.table(a1 = c(1,43,1), a2 = c(52,4,1))

For each datatable, I want to find the average of each variable for observations where that variable != 1. Below is my attempt which doesn't work:

dtname = 'dt'
ind  = c('1', '2')
for (d in ind) {
  df <- get(paste0('dt', d, sep=''))
  for (v in ind) {
    varname <- paste0('a', v, sep='')
    df1 <- df %>%
      filter(varname!=1) %>%
      summarise(varname = mean(varname))
    print(df1)
    }
   }

The desired output is to take and print the average of a1 = c(2,3) in dt1, the average of a2 = (4,5,2) in dt1, the average of a1 = c(43) in dt2, the average of a2 = c(54,4) in dt2.

What am I doing wrong here? In general, how should I reference a variable inside of a for loop (varname) that is pieced together by using the looping index (v) and something else?

Amazonian
  • 307
  • 6
  • 19
  • 2
    Why are you mixing `data.table` syntax with `dplyr`/verse syntax? If you want to work with data.tables, you should really learn about how to do this. Otherwise, it's pointless to have them around. Something like `df1 – lmo Mar 16 '18 at 01:16
  • well done for posting an example data. It is not really clear to me what you are looking to achieve. It sounds like something that can easily be done using a vectorised R-approach. A desired output would help – tjebo Mar 16 '18 at 01:17
  • Tjebo I just added the desired output – Amazonian Mar 16 '18 at 01:21
  • @Imo I tried df1 – Amazonian Mar 16 '18 at 01:23
  • I have a partial answer, use `eval(as.name(varname))` however this does not work with `summarise` for the first instance of varname. this is the code for the relevant part: `df1 % filter(eval(as.name(varname))!=1) %>% summarise_(varname = mean(eval(as.name(varname)))) print(df1)` – Amar Mar 16 '18 at 01:37
  • 1
    @lmo The plan is for dplyr syntax to work with data.tables, fwiw: https://github.com/hadley/dtplyr Anyway, a somewhat data.tabley way: `rbindlist(list(dt1, dt2), id=TRUE)[, dcast(.SD, .id ~ ., fun.agg = function(x) mean(x[x!=1]), value.var = names(dt1))]` – Frank Mar 16 '18 at 01:38
  • @Amar when I tried your method I got the following error: Error in filter_impl(.data, dots) : object 'a1' not found – Amazonian Mar 16 '18 at 01:45

4 Answers4

3

For a purely data.table way, I would combine the different data.tables and compute the averages:

# Concatenate the data.tables: 
all_dt <- rbind("dt1" = dt1, "dt2" = dt2, idcol = "origin")
all_dt
#    origin a1 a2
# 1:    dt1  1  4
# 2:    dt1  2  5
# 3:    dt1  3  2
# 4:    dt2  1 52
# 5:    dt2 43  4
# 6:    dt2  1  1

# Melt so that "a1" and "a2" are labels in a group column:
all_dt <- melt(all_dt, id.vars="origin")
all_dt
#     origin variable value
#  1:    dt1       a1     1
#  2:    dt1       a1     2
#  3:    dt1       a1     3
#  4:    dt2       a1     1
#  5:    dt2       a1    43
#  6:    dt2       a1     1
#  7:    dt1       a2     4
#  8:    dt1       a2     5
#  9:    dt1       a2     2
# 10:    dt2       a2    52
# 11:    dt2       a2     4
# 12:    dt2       a2     1

# Compute averages by each data.table and column group, ignoring 1s:
all_dt[value != 1, .(mean = mean(value)), by = .(origin, variable)]
#    origin variable      mean
# 1:    dt1       a1  2.500000
# 2:    dt2       a1 43.000000
# 3:    dt1       a2  3.666667
# 4:    dt2       a2 28.000000
Scott Ritchie
  • 9,228
  • 2
  • 24
  • 61
  • 1
    One downside is that when all rows for a var are 1, you'll have it omitted instead of NA. – Frank Mar 16 '18 at 01:44
1

I figured out a solution based on the comments of @Amar and @Scott Richie

for (d in ind) {
  df <- get(paste0('dt', d, sep=''))
  for (v in ind) {
    varname <- paste0('a', v, sep='')
    df1 <- df[eval(as.name(varname))!=1, .(mean = 
                                 mean(eval(as.name(varname))))]
    print(df1)

   }
 }

Thanks EVERYONE!

Amazonian
  • 307
  • 6
  • 19
0

Would go for a vectorised approach. You are using R!

One possible way:

require(dplyr)

dt1[dt1==1] <- NA #replace 1 with NA

dt1 %>% summarise_all(mean, na.rm = TRUE) #mean of all columns. 

   a1       a2
1 2.5 3.666667
tjebo
  • 12,885
  • 4
  • 34
  • 61
  • The actual problem I'm facing is more complicated than this one, where I do need to loop through data.tables and variables. I need to first filter out the observations where one variable ==0, and then I need to do some aggregation wrt another variable for the remaining observations. – Amazonian Mar 16 '18 at 01:40
  • I guess this might all be solved with vectorisation. The filter step easily done with dplyr::filter. Aggregation probably with summarise. But I probably do not entirely understand your problem – tjebo Mar 16 '18 at 01:42
0

It is not very clear what you are trying to do, but if you want to replace all of the rows in the dataframe with the mean of the previous data frame's columns, I would suggest using a dataframe type instead as it is easier to index. Here is code that should work:

dt1 <- data.frame(a1 = c(1,2,3), a2 = c(4,5,2))
dt2 <- data.frame(a1 = c(1,43,1), a2 = c(52,4,1))

dtname = 'dt'
ind  = c('1', '2')
for (d in ind){
  df <- get(paste0('dt', d, sep=''))
  for (i in 1:nrow(df)){
    for (j in 1:ncol(df)){
      if (df[i,j] !=1){
        df[,j]<- mean(df[,j])
      }
     }
    print(df)
  }
}

The reason your code was not working before was because the variables were being treated like strings, not actual variables. You can see this by printing the data type of variances:

dtname = 'dt'
ind  = c('1', '2')
for (d in ind) {
  df <- get(paste0('dt', d, sep=''))
  for (v in ind) {
    varname <- paste0('a', v, sep='')
    print(class(varname))
  }
}

Which just returns "character"

Another solution using variable names and the dataframe type would be to index the df as follows:

df[["varname"]]


Here are two helpful links for this kind of operation:
* link 1: How to find the mean of a column
* link 2: Data frames

adono
  • 18
  • 6