Whenever I deal with multiple frames of the same structure, I tend to put them into a list and do "one thing" to everything in that list. A good reference for this can be found here: How do I make a list of data frames?.
In this example, there are a couple of ways to proceed. I don't have your data, so I'll use mtcars
:
dat <- mtcars[1:3]
ntrain <- (2/3) * nrow(dat)
n <- 3 # 100 for you?
Reproducibility is important, but hard-coding set.seed
can be problematic (academically, at least), so here's a randomly-generated seed that we track/store:
(seed <- sample(.Machine$integer.max, size=1L))
seed
# [1] 558990070
I like to store the indices for easy recall later.
set.seed(seed)
inds <- replicate(n, sample(nrow(dat), size=ntrain), simplify=FALSE)
str(inds)
# List of 3
# $ : int [1:21] 22 32 15 16 30 20 21 3 14 1 ...
# $ : int [1:21] 6 11 17 24 22 9 15 4 10 21 ...
# $ : int [1:21] 23 26 4 21 14 10 20 17 32 28 ...
Now these can be used easily to generate your training and test sets:
trains <- lapply(inds, function(i) dat[i,,drop=FALSE])
tests <- lapply(inds, function(i) dat[-i,,drop=FALSE])
str(tests)
# List of 3
# $ :'data.frame': 11 obs. of 3 variables:
# ..$ mpg : num [1:11] 18.1 14.3 24.4 22.8 17.8 32.4 30.4 13.3 19.2 27.3 ...
# ..$ cyl : num [1:11] 6 8 4 4 6 4 4 8 8 4 ...
# ..$ disp: num [1:11] 225 360 147 141 168 ...
# $ :'data.frame': 11 obs. of 3 variables:
# ..$ mpg : num [1:11] 21 18.7 24.4 16.4 17.3 10.4 33.9 19.2 26 15.8 ...
# ..$ cyl : num [1:11] 6 8 4 8 8 8 4 8 4 8 ...
# ..$ disp: num [1:11] 160 360 147 276 276 ...
# $ :'data.frame': 11 obs. of 3 variables:
# ..$ mpg : num [1:11] 21 18.7 18.1 22.8 17.8 17.3 10.4 32.4 30.4 19.2 ...
# ..$ cyl : num [1:11] 6 8 6 4 6 8 8 4 4 8 ...
# ..$ disp: num [1:11] 160 360 225 141 168 ...
Alternatively, you can generate both train/test in each element, though I don't know if this adds much value:
str(both)
# List of 3
# $ :List of 3
# ..$ ind : int [1:21] 22 32 15 16 30 20 21 3 14 1 ...
# ..$ train:'data.frame': 21 obs. of 3 variables:
# .. ..$ mpg : num [1:21] 15.5 21.4 10.4 10.4 19.7 33.9 21.5 22.8 15.2 21 ...
# .. ..$ cyl : num [1:21] 8 4 8 8 6 4 4 4 8 6 ...
# .. ..$ disp: num [1:21] 318 121 472 460 145 ...
# ..$ test :'data.frame': 11 obs. of 3 variables:
# .. ..$ mpg : num [1:11] 18.1 14.3 24.4 22.8 17.8 32.4 30.4 13.3 19.2 27.3 ...
# .. ..$ cyl : num [1:11] 6 8 4 4 6 4 4 8 8 4 ...
# .. ..$ disp: num [1:11] 225 360 147 141 168 ...
# $ :List of 3
# ..$ ind : int [1:21] 6 11 17 24 22 9 15 4 10 21 ...
# ..$ train:'data.frame': 21 obs. of 3 variables:
# .. ..$ mpg : num [1:21] 18.1 17.8 14.7 13.3 15.5 22.8 10.4 21.4 19.2 21.5 ...
# .. ..$ cyl : num [1:21] 6 6 8 8 8 4 8 6 6 4 ...
# .. ..$ disp: num [1:21] 225 168 440 350 318 ...
# ..$ test :'data.frame': 11 obs. of 3 variables:
# .. ..$ mpg : num [1:11] 21 18.7 24.4 16.4 17.3 10.4 33.9 19.2 26 15.8 ...
# .. ..$ cyl : num [1:11] 6 8 4 8 8 8 4 8 4 8 ...
# .. ..$ disp: num [1:11] 160 360 147 276 276 ...
# $ :List of 3
# ..$ ind : int [1:21] 23 26 4 21 14 10 20 17 32 28 ...
# ..$ train:'data.frame': 21 obs. of 3 variables:
# .. ..$ mpg : num [1:21] 15.2 27.3 21.4 21.5 15.2 19.2 33.9 14.7 21.4 30.4 ...
# .. ..$ cyl : num [1:21] 8 4 6 4 8 6 4 8 4 4 ...
# .. ..$ disp: num [1:21] 304 79 258 120 276 ...
# ..$ test :'data.frame': 11 obs. of 3 variables:
# .. ..$ mpg : num [1:11] 21 18.7 18.1 22.8 17.8 17.3 10.4 32.4 30.4 19.2 ...
# .. ..$ cyl : num [1:11] 6 8 6 4 6 8 8 4 4 8 ...
# .. ..$ disp: num [1:11] 160 360 225 141 168 ...
From here, it's just a matter of running your model against the data:
results <- lapply(trains, function(x) randomForest(mpg~., data=x, ...))
(where ...
are your other model parameters). Then something like:
validation <- mapply(function(result, test) predict(result, data=test, ...),
results, tests, SIMPLIFY=FALSE)
(You can certainly do more than just predict
, perhaps checking yhat
or similar.)