Formula with dynamic number of variables

Question

Suppose, there is some data.frame foo_data_frame and one wants to find regression of the target column Y by some others columns. For that purpose usualy some formula and model are used. For example:

linear_model <- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)

That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:

for (i in seq_len(factor_number)) {
  for (j in seq(i + 1, factor_number)) {
    linear_model <- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
                                         F1=foo_data_frame[[i]],
                                         F2=foo_data_frame[[j]]))
    # linear_model further analyzing...
  }
}

My question is how to do the same affect when the number of variables is changing dynamically during program running?

for (number_of_factors in seq_len(5)) {
   # Then root over subsets with #number_of_factors cardinality.
   for (factors_subset in all_subsets_with_fixed_cardinality) {
     # Here I want to fit model with factors from factors_subset.
     linear_model <- lm(Does R provide smth to write here?)
   }
}

Thanks! your middle example made me realise I didn't need the solution to your question and could do something much simpler! — Mark Adamson, Jan 29 '16 at 09:34

score 108 · Accepted Answer · edited Jan 04 '17 at 22:01

See ?as.formula, e.g.:

factors <- c("factor1", "factor2")
as.formula(paste("y~", paste(factors, collapse="+")))
# y ~ factor1 + factor2

where factors is a character vector containing the names of the factors you want to use in the model. This you can paste into an lm model, e.g.:

set.seed(0)
y <- rnorm(100)
factor1 <- rep(1:2, each=50)
factor2 <- rep(3:4, 50)
lm(as.formula(paste("y~", paste(factors, collapse="+"))))

# Call:
# lm(formula = as.formula(paste("y~", paste(factors, collapse = "+"))))

# Coefficients:
# (Intercept)      factor1      factor2  
#    0.542471    -0.002525    -0.147433

score 68 · Answer 2 · edited Sep 22 '14 at 11:24

68

An oft forgotten function is reformulate. From ?reformulate:

reformulate creates a formula from a character vector.

A simple example:

listoffactors <- c("factor1","factor2")
reformulate(termlabels = listoffactors, response = 'y')

will yield this formula:

y ~ factor1 + factor2

Although not explicitly documented, you can also add interaction terms:

listofintfactors <- c("(factor3","factor4)^2")
reformulate(termlabels = c(listoffactors, listofintfactors), 
    response = 'y')

will yield:

y ~ factor1 + factor2 + (factor3 + factor4)^2

edited Sep 22 '14 at 11:24

landroni

2,702
1
27
36

answered Nov 14 '12 at 00:50

mnel

105,872
25
248
242

3

@JorisMeys And it's so much nicer as it allows adding interaction terms! I've been looking for a similar solution for years.. – landroni Sep 22 '14 at 11:37
What if the x variables contain spaces? Say "factor 1" , "factor 2" etc.. – axiom Jan 22 '19 at 08:24

score 11 · Answer 3 · answered Feb 09 '11 at 23:25

11

Another option could be to use a matrix in the formula:

Y = rnorm(10)
foo = matrix(rnorm(100),10,10)
factors=c(1,5,8)

lm(Y ~ foo[,factors])

answered Feb 09 '11 at 23:25

Sacha Epskamp

42,423
17
105
128

3

+1, but be aware of the fact this doesn't allow to use interaction effects. For that one can construct a model matrix as well (see `?model.matrix` ) – Joris Meys Feb 09 '11 at 23:39

score 4 · Answer 4 · answered Feb 10 '11 at 00:30

4

You don't actually need a formula. This works:

lm(data_frame[c("Y", "factor1", "factor2")])

as does this:

v <- c("Y", "factor1", "factor2")
do.call("lm", list(bquote(data_frame[.(v)])))

answered Feb 10 '11 at 00:30

G. Grothendieck

211,268
15
177
297

+1 Very correct, but again, you'd have to use model.matrix to construct a matrix with interaction effects. – Joris Meys Feb 10 '11 at 08:59

bibzzzz · Answer 5 · 2016-11-22T18:00:20.233

1

I generally solve this by changing the name of my response column. It is easier to do dynamically, and possibly cleaner.

model_response <- "response_field_name"
setnames(model_data_train, c(model_response), "response") #if using data.table
model_gbm <- gbm(response ~ ., data=model_data_train, ...)

edited Nov 22 '16 at 18:00

answered Nov 22 '16 at 17:30

bibzzzz

183
1
10

Formula with dynamic number of variables

5 Answers5

Linked

Related