2

I know from Formula with dynamic number of variables, that I can use as.formula to make a regression dynamic, or use paste to include lots of variables in a formula

#This Works:
glm(as.formula(paste0("vs~am")) , mtcars , family = binomial)

I am trying to use data.table, because I am working with a large data.set. I know from Using data.table to create a column of regression coefficients, that I can run a regression model in the i parameter

#So Does This:
m <- data.table( mtcars )
m[ , glm(vs~am, family = binomial) ]

I can't seem to figure out how to use as.formula in data.table. I am trying to include lots of columns as independent variables.

#This breaks
m[ , glm(as.formula(paste0("vs~am")), family = binomial) ]
MatthewR
  • 2,399
  • 3
  • 20
  • 29
  • maybe: `m[ , glm(eval(parse(text="vs~am")), family = binomial)]` or `m[ , glm(eval(substitute(as.formula("vs~am"))), family = binomial)]` or `m[ , glm(eval(expression(as.formula("vs~am"))), family = binomial)]` or `m[ , glm(eval(quote(as.formula("vs~am"))), family = binomial)]` see also https://stackoverflow.com/questions/24833247/how-can-one-work-fully-generically-in-data-table-in-r-with-column-names-in-varia?noredirect=1&lq=1 – chinsoon12 Apr 02 '19 at 00:28

3 Answers3

3

Within the dat.table, the data can be specified as .SD

library(data.table)
out2 <- m[ , glm(as.formula(paste0("vs", "~am")), family = binomial, data = .SD) ]

Also, can use reformulate instead of paste

m[, glm(reformulate("am", "vs"), family = binomial, data = .SD)]
akrun
  • 674,427
  • 24
  • 381
  • 486
2

Not entirely sure how to "capture" data from within data.table. However, maybe we can apply our formula to the data.table. This is admittedly not the best solution:

myformula<-function(x,y,df,...){
  f1<-as.formula(paste0(x,"~",y))
  #to_remove<-setdiff(names(df),y)#This was to be used if I used this with .SD
  do_this<-do.call("glm",list(f1,quote(df),family="binomial",...))
  do_this
}
myformula("am","vs",m)
NelsonGon
  • 11,358
  • 5
  • 21
  • 44
1

I'm not sure if you're just wanting a column with the coefficients repeated all the way down or individual predictions for each row of data, or something else, but it looks like you just need to specify where the data are coming from:

m$amcoef <- m[ , glm(as.formula(paste("vs~am")), family = binomial, data=m)$coefficients["am"] ]

for the coefficients repeated all the way down, which returns

    mpg cyl disp  hp drat    wt  qsec vs am gear carb    amcoef
 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.6931472
 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.6931472
 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.6931472
 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.6931472
 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.6931472
 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 0.6931472

or

m$ampred <- m[ , predict(glm(as.formula(paste("vs~am")), family = binomial, data=m),  newdata=m) ]

to run the model on the full dataset, then apply it to each row of data ("-0.5390+0.6931" for am=1, "-0.5390" for am=0), which returns:

    mpg cyl disp  hp drat    wt  qsec vs am gear carb     ampred
 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4  0.1541507
 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  0.1541507
 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  0.1541507
 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 -0.5389965
 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 -0.5389965
 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 -0.5389965

Downside of this is it's going to re-run the initial model fit for each row of data. I would run each glm you're interested in outside of the data table (one time) and then just call the glm objects to get the row-specific values:

mod1 <- glm(as.formula(paste("vs~am")), family = binomial, data=m)
m$ampred1 <- m[ , predict(mod1, newdata=m) ]

Not sure if this would hinder the dynamics you're looking for though.

cgrafe
  • 398
  • 3
  • 9