13

I have a formula that contains some terms and a data frame (the output of an earlier model.frame() call) that contains all of those terms and some more. I want the subset of the model frame that contains only the variables that appear in the formula.

ff <- log(Reaction) ~ log(1+Days) + x + y
fr <- data.frame(`log(Reaction)`=1:4,
                 `log(1+Days)`=1:4,
                 x=1:4,
                 y=1:4,
                 z=1:4,
                 check.names=FALSE)

The desired result is fr minus the z column (fr[,1:4] is cheating -- I need a programmatic solution ...)

Some strategies that don't work:

fr[all.vars(ff)]
## Error in `[.data.frame`(fr, all.vars(ff)) : undefined columns selected

(because all.vars() gets "Reaction", not log("Reaction"))

stripwhite <- function(x) gsub("(^ +| +$)","",x)
vars <- stripwhite(unlist(strsplit(as.character(ff)[-1],"\\+")))
fr[vars]
## Error in `[.data.frame`(fr, vars) : undefined columns selected

(because splitting on + spuriously splits the log(1+Days) term).

I've been thinking about walking down the parse tree of the formula:

ff[[3]]       ## log(1 + Days) + x + y
ff[[3]][[1]]  ## `+`
ff[[3]][[2]]  ## log(1 + Days) + x

but I haven't got a solution put together, and it seems like I'm going down a rabbit hole. Ideas?

Karolis Koncevičius
  • 7,687
  • 9
  • 48
  • 71
Ben Bolker
  • 173,430
  • 21
  • 312
  • 389
  • Seems like the main variable that's causing you problems is `log(1+Days)`. Do you have to call it that or could you just use a different name? – Thomas Aug 02 '13 at 13:18
  • 1
    What about `attr(terms.formula(ff), "term.labels")`? – Roman Luštrik Aug 02 '13 at 13:19
  • 1
    I'm trying to come up with a general solution. Therefore, anything that could show up in a `model.frame()` generated from a legal formula has to be handled. That's part of the problem. – Ben Bolker Aug 02 '13 at 13:19
  • 1
    Or `rownames(attr(terms.formula(ff), "factors"))` to get the DV as well. – Thomas Aug 02 '13 at 13:21
  • Very nice. Do you guys know this magic by heart or did you just browse through `str(ff)` looking for something that would help? – Ben Bolker Aug 02 '13 at 13:22
  • 1
    `?formula` lists `terms.formula`. :) – Roman Luštrik Aug 02 '13 at 13:22
  • And now I know that "háček" is officially called a "caron" in English -- I was looking for the compose-key sequence for š (it's Compose-c-s) so I could reference Roman (and Thomas) in my code – Ben Bolker Aug 02 '13 at 13:30
  • Can you guarantee that any variable (or function turned into a variable as in your example) will exist in `fr` ? It sort of seems suspicious to me that you're creating the `ff` manually but it always works out that `fr` "covers" the variable list. What I'm leading up to is that it looks like you've previously created a bunch of variable names and functions, so there may be a simpler way to save those as character strings, and then load these char strings into an `as.formula` or similar call when creating `ff` . – Carl Witthoft Aug 02 '13 at 14:37
  • the slightly larger context is that `fr` was generated from a formula; that formula was then manipulated (some terms dropped), and I want to extract the portion of the model frame that corresponds to the non-dropped terms. I'm reasonably happy with the answer I got here, which is magical but fairly straightforward. – Ben Bolker Aug 02 '13 at 17:05

2 Answers2

4

This should work:

> fr[gsub(" ","",rownames(attr(terms.formula(ff), "factors")))]
  log(Reaction) log(1+Days) x y
1             1           1 1 1
2             2           2 2 2
3             3           3 3 3
4             4           4 4 4

And props to Roman Luštrik for pointing me in the right direction.

Edit: Looks like you could pull it out off the "variables" attribute as well:

fr[gsub(" ","",attr(terms(ff),"variables")[-1])]

Edit 2: Found first problem case, involving I() or offset():

ff <- I(log(Reaction)) ~ I(log(1+Days)) + x + y
fr[gsub(" ","",attr(terms(ff),"variables")[-1])]

Those would be pretty easy to correct with regex, though. BUT, if you had situations like in the question where a variable is called, e.g., log(x) and is used in a formula alongside something like I(log(y)) for variable y, this will get really messy.

Thomas
  • 40,508
  • 11
  • 98
  • 131
  • thanks. I can't accept this for another few minutes. the `gsub(...)` won't be necessary in my case, I think -- the mismatch in white space won't be there. I introduced it accidentally in setting up the example. – Ben Bolker Aug 02 '13 at 13:24
  • @BenBolker Yea, it would probably be good to test this on some other formula constructions to see if it's general... – Thomas Aug 02 '13 at 13:27
  • 1
    but your original answer, `rownames(attr(terms.formula(ff), "factors")))`, seems to work fine on your problem case. – Ben Bolker Aug 02 '13 at 13:32
0

It looks to me like the only problem is the lack of a space in the name of the second column of fr. Rename it with a space and pull the columns in this way:

ff <- log(Reaction) ~ log(1+Days) + x + y
fr <- data.frame(`log(Reaction)`=1:4,
                 `log(1 + Days)`=1:4,
                 x=1:4,
                 y=1:4,
                 z=1:4,
                 check.names=FALSE)


fr[labels(terms(ff))]

If you believe the only difference between the two will always be that the names of fr has spaces where the names in ff don't, then the above solution holds. I like labels(terms(x)) a bit more, though, because it seems a bit more abstract.

fr[gsub(pattern = ' ', replacement = '', x = labels(terms(ff)))]
rcorty
  • 1,010
  • 1
  • 9
  • 25