0

I am finding it very difficult to understand how to use lapply on imputed datasets in R.

Here is a code for an example dataset (with 6 Variables: "Ozone", "Solar.R", "Wind", "Temp", "Month", "Day"):

data <- airquality
data[4:10,3] <- rep(NA,7)
data[1:5,4] <- NA

tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)

After that, let's run a linear regression on the imputed datasets.

> reg1
call :
with.mids(data = tempData, expr = lm(Ozone ~ Wind))

call1 :
mice(data = data, m = 5, method = "pmm", maxit = 50, seed = 500)

nmis :
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       7       5       0       0 

analyses :
[[1]]

Call:
lm(formula = Ozone ~ Wind)

Coefficients:
(Intercept)         Wind  
     92.401       -5.067  
[...] 

So far, so good.

Now I was wondering if I can use lapply to conduct regression analyses for multiple dependent variables and store them in a list type object. Below, you can see my failed attempt.

> variables_subset<-c("Ozone","Solar.R", "Temp")
> models<-lapply(tempData[,variables_subset],
+                function(x) (with(tempData, lm(x ~  Wind))))
Error in tempData[, variables_subset] : incorrect number of dimensions

Is there a way to make this code work?

Saja01
  • 25
  • 5

1 Answers1

1

I am not sure if this is what exactly you were trying to do but here are few suggestions :

  • tempData is not a dataframe (tempData$data is) so you cannot directly subset it.
  • I use reformulate here to create formula which is applied in lm
  • Instead of looping over columns values in lapply, I loop over column names which also makes it easy to construct formula.

So try :

variables_subset<-c("Ozone","Solar.R", "Temp")
lapply(variables_subset,function(x)  
          lm(reformulate("Wind", x), data = tempData$data))

#[[1]]

#Call:
#lm(formula = reformulate("Wind", x), data = tempData$data)

#Coefficients:
#(Intercept)         Wind  
#     99.166       -5.782  


#[[2]]

#Call:
#lm(formula = reformulate("Wind", x), data = tempData$data)

#Coefficients:
#(Intercept)         Wind  
#   189.5896      -0.3649  


#[[3]]

#Call:
#lm(formula = reformulate("Wind", x), data = tempData$data)

#Coefficients:
#(Intercept)         Wind  
#     89.982       -1.142  

To get nested list using imputed datasets you can try :

dat <- mice::complete(tempData, "long", inc = TRUE)

model_list <- lapply(split(dat, dat$.imp), function(x) {
                 lapply(variables_subset,function(y)  
                     lm(reformulate("Wind", y), data = x))
               })
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • Thanks for pointing out what the problem was. But is there a way to get something like a list within a list? What I want is a list of results for each dependent variable (Ozone, Solar, Temp). Each of these lists should then contain another list (with one entry for each of the five imputed datasets). Sorry for not having been clear enough in my previous post. – Saja01 Jul 23 '20 at 10:45
  • @Saja01 Where are those 5 imputed datasets in `tempData` ? – Ronak Shah Jul 23 '20 at 11:34
  • tempData contains 5 imputed datasets based on a dataframe with NAs (because I set m = 5 in the mice function). The results for the complete dataset can later be combined with Rubin's rules for each dependent variable. – Saja01 Jul 24 '20 at 12:19
  • `tempData$data` has only one dataframe. – Ronak Shah Jul 24 '20 at 12:26
  • I think these are in tempData$imp, not tempData$data. Data is the original dataset with no imputation. – Saja01 Jul 24 '20 at 12:36
  • @Saja01 Ok...I think you can get them using `complete` function. Check updated answer. – Ronak Shah Jul 24 '20 at 13:42