Repeat a procedure with generated random data and save the results in data frames in R

Question

I create random data in R like that:

data<-matrix(rnorm(100*5,mean=0,sd=1), 100, 5) 
colnames(data) <- c("X1", "X2", "X3", "X4", "X5")
data <- as.data.frame(data)
a <- 5 
b <- 0.8
c <- 100

and then i want to "play" with the correlations of those data and do something like the following

data[,2] <- a*data[,1] - b*rnorm(c)
data[,3] <- a*data[,1] + b*rnorm(c)
data[,4] <- a*data[,1] - b*rnorm(c)

after that i perform the following code

data<-matrix(rnorm(100*5,mean=0,sd=1), 100, 5) 
colnames(data) <- c("X1", "X2", "X3", "X4", "X5")
data <- as.data.frame(data)
a <- 5 
b <- 0.8
c <- 100

data[,2] <- a*data[,1] - b*rnorm(c)
data[,3] <- a*data[,1] + b*rnorm(c)
data[,4] <- a*data[,1] - b*rnorm(c)

library(glmnet)
library(coefplot)

A <- as.matrix(data)
set.seed(1)
results <- lapply(seq_len(ncol(A)), function(i) {
  list(
    cvfit = cv.glmnet(A[, -i] , A[, i] , standardize = TRUE , type.measure = "mse" , nfolds = 10 , alpha = 1)
  )
})

lam <- as.data.frame(`names<-`(
  lapply(results, function(x) (x$cvfit$lambda.min)), 
  paste0("X", seq_along(results))
))
    
    sigma<- matrix(rnorm(1*5,mean=0,sd=1), 1, 5) 
colnames(sigma) <- c("X1", "X2", "X3", "X4", "X5")
as.vector(sigma)
sub1.sigma <- subset(sigma, select = sigma <= sum(lam))
sub2.sigma <- subset(sigma, select = sigma <= 2*sum(lam))
sub3.sigma <- subset(sigma, select = sigma <= 3*sum(lam))

which results in a vector 1x5 called sigma and 3 vectors sub1.sigma, sub2.sigma, sub3.sigma like the following

   > sigma
      X1      X2        X3 X4 X5        
     38.64019 624.4896  0  0  0 
> sub1.sigma
        X1  X3  X4  X5       
1 38.64019  0   0   0 
 
> sub2.sigma
        X1  X3 X4 X5          
1 38.64019  0  0  0 

> sub3.sigma
        X1  X3 X4 X5         
1 38.64019  0  0  0

The generated data are random and i usually use a set.seed() to produce the same results. I want, if it's possible without modify the main code, to run my code 100 times (with different data each time) and save in 4 dataframes the correspanding results sigma sub1.sigma, sub2.sigma, sub3.sigma in order to compare the them. Is there any way to achieve that in R?

Based on comments i manage to create the following but still doesn't seem to give the desired results. FIrst of all code[1:10] display 10 vectors which represent what? the sigma? are those the sigma of each run? how can i make it calculate the sub.sigma also?

set.seed(2021)

code <- replicate(10,{
data<-matrix(rnorm(100*5,mean=0,sd=1), 100, 5) 
colnames(data) <- c("X1", "X2", "X3", "X4", "X5")
data <- as.data.frame(data)
a <- 5 
b <- 0.8
c <- 100

data[,2] <- a*data[,1] - b*rnorm(c)
data[,3] <- a*data[,1] + b*rnorm(c)
data[,4] <- a*data[,1] - b*rnorm(c)

library(glmnet)
library(coefplot)

A <- as.matrix(data)
set.seed(1)
results <- lapply(seq_len(ncol(A)), function(i) {
  list(
    cvfit = cv.glmnet(A[, -i] , A[, i] , standardize = TRUE , type.measure = "mse" , nfolds = 10 , alpha = 1)
  )
})

lam <- as.data.frame(`names<-`(
  lapply(results, function(x) (x$cvfit$lambda.min)), 
  paste0("X", seq_along(results))
))

sigma<- matrix(rnorm(1*5,mean=0,sd=1), 1, 5) 
colnames(sigma) <- c("X1", "X2", "X3", "X4", "X5")
as.vector(sigma)
sub1.sigma <- subset(sigma, select = sigma <= sum(lam))
sub2.sigma <- subset(sigma, select = sigma <= 2*sum(lam))
sub3.sigma <- subset(sigma, select = sigma <= 3*sum(lam))
}, simplify = FALSE)

code[1:10]
sigmas <- as.data.frame(do.call(rbind,lapply(code, sigma)))

`replicate(100, ...)`? It's likely better to replicate and return the results into a list, which you can then later reduce into frames. See https://stackoverflow.com/a/24376207/3358227 — r2evans, Jan 06 '21 at 19:17
I tried ti apply your suggestion but still keep struggling. I edit it and added the main code, can you give me a suggestion on how to use the `replicate`? — nickolakis, Jan 06 '21 at 19:37

score 1 · Accepted Answer · answered Jan 06 '21 at 19:55

I'm a fan of keeping models around for various reasons, so I'll start with a list of models-run. In this case, replicate(n, ..., simplify=FALSE) returns a list of whatever we need it to. (For the record, that is equivalent to lapply(seq_len(n), function(ign) ...).)

(Side note: I don't have glmnet installed, so I'll mimic that with a simple/absurd lm model.)

set.seed(2021)
models <- replicate(10, {
  zany_numbers <- runif(32) # nrow(mtcars)
  lm(zany_numbers ~ mpg + disp + cyl, data = mtcars)
}, simplify = FALSE)
models[1:2]
# [[1]]
# Call:
# lm(formula = zany_numbers ~ mpg + disp + cyl, data = mtcars)
# Coefficients:
# (Intercept)          mpg         disp          cyl  
#    8.85e-02     2.01e-02    -5.29e-05     2.41e-02  
# [[2]]
# Call:
# lm(formula = zany_numbers ~ mpg + disp + cyl, data = mtcars)
# Coefficients:
# (Intercept)          mpg         disp          cyl  
#    -0.52302      0.02485     -0.00122      0.13071

From here, we can make any sort of frame you want.

coefs <- as.data.frame(do.call(rbind, lapply(models, coef)))
coefs
#    (Intercept)       mpg      disp     cyl
# 1       0.0885  0.020114 -5.29e-05  0.0241
# 2      -0.5230  0.024847 -1.22e-03  0.1307
# 3       0.0856  0.014215  1.18e-03 -0.0225
# 4       0.4899  0.012013  1.08e-03 -0.0876
# 5      -0.6926  0.024653 -1.16e-03  0.1499
# 6       0.0862  0.010497 -5.02e-04  0.0389
# 7       0.8358 -0.008419 -6.64e-04 -0.0141
# 8       0.3679 -0.000198  1.44e-03 -0.0391
# 9       0.4360  0.011994 -9.59e-05 -0.0303
# 10      0.2276  0.003659 -1.14e-03  0.0651

(You might need to clean up names there.)

You can replace do.call(rbind, ...) with data.table::rbindlist(...) or dplyr::bind_rows(...) if you prefer.

From this models, and using the same list-of-frames do.call(rbind, ...) follow-up, you can generate companion frames, such as

otherstats <- as.data.frame(do.call(rbind, lapply(models, function(mdl) summary(mdl)[c("r.squared", "adj.r.squared")])))
otherstats
#    r.squared adj.r.squared
# 1      0.104       0.00745
# 2      0.144        0.0523
# 3      0.044       -0.0584
# 4      0.202         0.117
# 5      0.149        0.0573
# 6     0.0639       -0.0364
# 7     0.0586       -0.0422
# 8      0.137        0.0446
# 9      0.241          0.16
# 10    0.0814        -0.017

i tried to incorporate your code to mine but still can;t figure out what i do wrong. FIrst of all code[1:10] display 10 vectors? are those the `sigma` of each run? Why `sub.sigma` are not calculate? — nickolakis, Jan 06 '21 at 20:35
I don't know, sorry, your code has confused me, and since I don't have `glmnet`, I can't easily adapt your code. — r2evans, Jan 06 '21 at 21:21

Repeat a procedure with generated random data and save the results in data frames in R

1 Answers1