1

Is it at all possible to use the lm() function with a matrix? Or maybe, the correct question is: "Is it possible to dynamically create formulas in R?"

I am creating a function whose output is a matrix and the number of columns in the matrix is not fixed = it depends on the inputs of the user. I want to fit an OLS model using the data in the matrix. - The first column represents the dependent variable - The other columns are the independent variables.

Using the lm function requires a formula, which presupposes the knowledge of the number of explanatory variables, which is not my case!

Is there any solution other than estimating the equation manually with the OLS formula?

Reproducible example:

> # When user 1 uses the function, he obtains m1
> m1 <- replicate(5, rnorm(50))
> colnames(m1) <- c("dep", paste0("ind", 1:(ncol(m1)-1)))
> head(m1)
            dep       ind1        ind2       ind3       ind4
[1,]  0.5848705  0.3602760 -0.95493403 -1.7278030 -0.1914170
[2,]  1.7167604 -0.1035825  0.31026183 -1.5071415 -1.2748600
[3,] -0.1326187 -0.5669026  0.01819749  0.8346880 -0.6304498
[4,] -0.7381232  0.4612792 -0.36132404 -0.1183131 -0.7446985
[5,]  0.9919123 -1.3228248 -0.44728270  0.6571244 -0.4895385
[6,] -0.8010111  0.8307584 -0.16106804  0.3069870 -0.3834583
> 
> # When user 2 uses the function, he obtains m2
> m2 <- replicate(6, rnorm(50))
> colnames(m2) <- c("dep", paste0("ind", 1:(ncol(m2)-1)))
> head(m2)
            dep       ind1       ind2         ind3       ind4       ind5
[1,]  1.2936031 -0.8060085  0.5020699 -1.699123234  1.0205626  1.0787888
[2,]  1.2357370  0.5973699 -1.2134283 -0.928040354 -0.3037920 -0.1251678
[3,]  0.5292583  0.1063213 -1.3036526  0.395886937 -0.1280863  1.1423532
[4,]  0.9234484 -0.4505604  1.2796922  0.424705893 -0.5547274 -0.3794037
[5,] -0.8016376  1.1362677 -1.1935238 -0.004460092 -1.4449704 -0.3739311
[6,]  0.4385867  0.5671138  0.4493617 -2.277925642 -0.8626944 -0.6880523

User 1 will estimate the linear model with:

lm(dep ~ ind1 + ind2 + ind3 + ind4, data = m1)

Meanwhile user 2 has an extra independent variable and will estimate the linear model in the following way:

lm(dep ~ ind1 + ind2 + ind3 + ind4 + ind5, data = m1)

Once again, is there any way I can create the formula dynamically?

SavedByJESUS
  • 2,471
  • 4
  • 22
  • 38
  • 1
    `lm(dep ~ ., data =m1)` – Khashaa Apr 18 '15 at 00:53
  • `dep ~ .` is bad style because it will pick up any extra or derived columns you create, possibly causing data leakage. – smci Apr 18 '15 at 00:55
  • 1
    Near-duplicate: [Formula with dynamic number of variables](http://stackoverflow.com/questions/4951442/formula-with-dynamic-number-of-variables) – smci Apr 18 '15 at 00:58
  • Thank you for the link. The solution is in the `reformulate` function. – SavedByJESUS Apr 18 '15 at 01:03
  • 1
    You don't need that if you just want to use a matrix slice: `lm(m1[,'dep'] ~ m1[,2:5])` – smci Apr 18 '15 at 01:12
  • I just realized that the problem with this is that the names of the columns are affected and are prefixed by `m1[,2:5]` in the regression output. – SavedByJESUS Apr 18 '15 at 02:40

1 Answers1

2

Yes, and in fact the formula interface has performance issues the larger the number of columns. So in fact the matrix interface is preferred for large column widths.

Is there any way I can create the formula dynamically?

Sure, you look up the matrix columns either directly by an vector of column-indices, or indirectly by converting a vector of names into column-indices using grep(cols_you_want, names(mat))

But in your case, you don't need to bother with grep since you already have a straightforward column-naming scheme, you know that ind1...ind5 corresponds to column-indices 1..5

lm(m1[,'dep'] ~ m1[,2:5])

# or in general
lm(m1[,'dep'] ~ m1[,colIndicesVector])  # e.g. c(1,3,4)
smci
  • 26,085
  • 16
  • 96
  • 138
  • Thank you for your answer, but would you please show me how this will work inside the `lm` function? – SavedByJESUS Apr 18 '15 at 00:59
  • 1
    The syntax is `lm(m1[,'dep'] ~ m1[,2:5])`. As opposed to `lm(dep ~ ind1 + ind2 + ind3 + ind4, data = m1)` – smci Apr 18 '15 at 01:11