best way to store many training data on R

Question

I want to random the dataset I have on R for 100 times and want to see which training and testing data give the best model result. how I should store these data so I can compare the prediction result? should I make different variable for each one training and testing data or save it on an array? I'm pretty new on R so I don't really know how to do it in the best way. I'm using RStudio 1.1.423.

This is how I random the data, I use holdout function from package rminer

H=holdout(myData$salary, ratio = 2/3, mode = "random")
trainData <- myData[H$tr,]
testData <- myData[H$ts,]

trainData and testData is the variable I made to store the training and testing data. myData is my dataset.

Please provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), perhaps brushing up on https://stackoverflow.com/help/how-to-ask and https://stackoverflow.com/help/mcve. Then come back and edit your question. — r2evans, Apr 14 '18 at 05:17
If you read https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames/24376207#24376207, you'll see how to do \*something\* to a list of frames. This process can be adapted to *make* a list of frames with something like `replicate(100, dat, simplify=FALSE)` (and then *do* something with it using that same link). — r2evans, Apr 14 '18 at 05:44
@r2evans thanks! that's giving me new insight since I'm new here. — Andira Gita, Apr 14 '18 at 06:14

score 0 · Accepted Answer · answered Apr 14 '18 at 16:43

Whenever I deal with multiple frames of the same structure, I tend to put them into a list and do "one thing" to everything in that list. A good reference for this can be found here: How do I make a list of data frames?.

In this example, there are a couple of ways to proceed. I don't have your data, so I'll use mtcars:

dat <- mtcars[1:3]
ntrain <- (2/3) * nrow(dat)
n <- 3 # 100 for you?

Reproducibility is important, but hard-coding set.seed can be problematic (academically, at least), so here's a randomly-generated seed that we track/store:

(seed <- sample(.Machine$integer.max, size=1L))
seed
# [1] 558990070

I like to store the indices for easy recall later.

set.seed(seed)
inds <- replicate(n, sample(nrow(dat), size=ntrain), simplify=FALSE)
str(inds)
# List of 3
#  $ : int [1:21] 22 32 15 16 30 20 21 3 14 1 ...
#  $ : int [1:21] 6 11 17 24 22 9 15 4 10 21 ...
#  $ : int [1:21] 23 26 4 21 14 10 20 17 32 28 ...

Now these can be used easily to generate your training and test sets:

trains <- lapply(inds, function(i) dat[i,,drop=FALSE])
tests <- lapply(inds, function(i) dat[-i,,drop=FALSE])
str(tests)
# List of 3
#  $ :'data.frame': 11 obs. of  3 variables:
#   ..$ mpg : num [1:11] 18.1 14.3 24.4 22.8 17.8 32.4 30.4 13.3 19.2 27.3 ...
#   ..$ cyl : num [1:11] 6 8 4 4 6 4 4 8 8 4 ...
#   ..$ disp: num [1:11] 225 360 147 141 168 ...
#  $ :'data.frame': 11 obs. of  3 variables:
#   ..$ mpg : num [1:11] 21 18.7 24.4 16.4 17.3 10.4 33.9 19.2 26 15.8 ...
#   ..$ cyl : num [1:11] 6 8 4 8 8 8 4 8 4 8 ...
#   ..$ disp: num [1:11] 160 360 147 276 276 ...
#  $ :'data.frame': 11 obs. of  3 variables:
#   ..$ mpg : num [1:11] 21 18.7 18.1 22.8 17.8 17.3 10.4 32.4 30.4 19.2 ...
#   ..$ cyl : num [1:11] 6 8 6 4 6 8 8 4 4 8 ...
#   ..$ disp: num [1:11] 160 360 225 141 168 ...

Alternatively, you can generate both train/test in each element, though I don't know if this adds much value:

str(both)
# List of 3
#  $ :List of 3
#   ..$ ind  : int [1:21] 22 32 15 16 30 20 21 3 14 1 ...
#   ..$ train:'data.frame': 21 obs. of  3 variables:
#   .. ..$ mpg : num [1:21] 15.5 21.4 10.4 10.4 19.7 33.9 21.5 22.8 15.2 21 ...
#   .. ..$ cyl : num [1:21] 8 4 8 8 6 4 4 4 8 6 ...
#   .. ..$ disp: num [1:21] 318 121 472 460 145 ...
#   ..$ test :'data.frame': 11 obs. of  3 variables:
#   .. ..$ mpg : num [1:11] 18.1 14.3 24.4 22.8 17.8 32.4 30.4 13.3 19.2 27.3 ...
#   .. ..$ cyl : num [1:11] 6 8 4 4 6 4 4 8 8 4 ...
#   .. ..$ disp: num [1:11] 225 360 147 141 168 ...
#  $ :List of 3
#   ..$ ind  : int [1:21] 6 11 17 24 22 9 15 4 10 21 ...
#   ..$ train:'data.frame': 21 obs. of  3 variables:
#   .. ..$ mpg : num [1:21] 18.1 17.8 14.7 13.3 15.5 22.8 10.4 21.4 19.2 21.5 ...
#   .. ..$ cyl : num [1:21] 6 6 8 8 8 4 8 6 6 4 ...
#   .. ..$ disp: num [1:21] 225 168 440 350 318 ...
#   ..$ test :'data.frame': 11 obs. of  3 variables:
#   .. ..$ mpg : num [1:11] 21 18.7 24.4 16.4 17.3 10.4 33.9 19.2 26 15.8 ...
#   .. ..$ cyl : num [1:11] 6 8 4 8 8 8 4 8 4 8 ...
#   .. ..$ disp: num [1:11] 160 360 147 276 276 ...
#  $ :List of 3
#   ..$ ind  : int [1:21] 23 26 4 21 14 10 20 17 32 28 ...
#   ..$ train:'data.frame': 21 obs. of  3 variables:
#   .. ..$ mpg : num [1:21] 15.2 27.3 21.4 21.5 15.2 19.2 33.9 14.7 21.4 30.4 ...
#   .. ..$ cyl : num [1:21] 8 4 6 4 8 6 4 8 4 4 ...
#   .. ..$ disp: num [1:21] 304 79 258 120 276 ...
#   ..$ test :'data.frame': 11 obs. of  3 variables:
#   .. ..$ mpg : num [1:11] 21 18.7 18.1 22.8 17.8 17.3 10.4 32.4 30.4 19.2 ...
#   .. ..$ cyl : num [1:11] 6 8 6 4 6 8 8 4 4 8 ...
#   .. ..$ disp: num [1:11] 160 360 225 141 168 ...

From here, it's just a matter of running your model against the data:

results <- lapply(trains, function(x) randomForest(mpg~., data=x, ...))

(where ... are your other model parameters). Then something like:

validation <- mapply(function(result, test) predict(result, data=test, ...),
                     results, tests, SIMPLIFY=FALSE)

(You can certainly do more than just predict, perhaps checking yhat or similar.)

best way to store many training data on R

1 Answers1