2

I'm trying to do some pre/post comparisons, a small sample below:

dataset =
data.table(
Key1 = c("p","q","r"),
Key2 = c("a","b","c"),
a_pre = c(3,2,6),
b_pre = c(2,6,3),
a_post = c(1,2,3),
b_post = c(1,4,2)
#etc.
)

dataset[,a_compare := a_pre/a_post]
dataset[,b_compare := b_pre/b_post]
#etc.

The problem is that the number of quantities I have is much more than just a and b, and is sometimes variable, so manually coding up each comparison is not an option. I'm trying to avoid eval(parse()).

Assume that I have the names of the quantities c("a","b", etc.). My current thought process is along these lines:

loop through the quantity names
{
grep on colnames(dataset) for each quantity name, 
use above to subset the pre and post column. include the keys in this subset.
send this subsetted dataset to a function that calculates pre/post irrespective of the specific quantity
merge result of function back to original dataset
}

I feel there must be a better way to do this. Any ideas?

Thomas
  • 40,508
  • 11
  • 98
  • 131
TheComeOnMan
  • 11,085
  • 6
  • 35
  • 50
  • @Thomas, thanks for pointing it out, but in my opinion this isn't specific to data.table. It's just that I use data.table and not data.frame which is why the syntax is the data.table syntax. – TheComeOnMan Sep 26 '13 at 15:42
  • Might be a lot easier if you create a list variable. Like `pplist – Carl Witthoft Sep 26 '13 at 15:44
  • 1
    I second @CarlWitthoft's opinion that the problem is your data organization. What seems most natural to me is convert it to long format with two grouping variables (a,b) and (pre,post). – joran Sep 26 '13 at 15:50
  • Please see this question as the answers should fit your needs: [Data.table meta-programming](http://stackoverflow.com/questions/15790743/data-table-meta-programming) – Ricardo Saporta Sep 26 '13 at 15:59
  • @CarlWitthoft, I haven't used apply functions until now. I'm trying to understand sapply but I'm unable to see how I can use it in this case. As for the loop, are you suggesting something that boils down to this - `pplist$compare$a – TheComeOnMan Sep 26 '13 at 16:20

3 Answers3

2

It's pretty standard to use get and not eval(parse here:

v = c("a", "b")
dataset[, paste0(v, "_compare") :=
            lapply(v, function(x) get(paste0(x, "_pre")) / get(paste0(x, "_post")))]
eddi
  • 47,367
  • 6
  • 94
  • 148
2

I find a for loop easier to write and read :

basevars = c("a","b","c","d")
for (i in basevars)
    DT[, paste0(i,"_compare"):=get(paste0(i,"_pre"))/get(paste0(i,"_post"))]

I've never really known why R couldn't just define + to work on strings. It's an error currently, so it's not like it's used or anything :

> "a"+"b"
Error in "a" + "b" : non-numeric argument to binary operator

Otherwise you could simply do :

for (i in basevars)
    DT[, i+"_compare" := get(i+"_pre")/get(i+"_post")]
Matt Dowle
  • 56,107
  • 20
  • 160
  • 217
  • 2
    +1. I just investigated...my attempt to define `+.character` was refused with `the method for function ‘+’ and signature e1="character", e2="character" is sealed and cannot be re-defined`. Annoying...and here's a ref http://stackoverflow.com/questions/4730551/making-a-string-concatenation-operator-in-r – Frank Sep 26 '13 at 18:02
  • 1
    @Frank That sealed error is the one I've seen before. And I knew `%+%` could be defined but didn't like that. But, I didn't think of redefining "+" itself, as that link has. That's not a bad idea! – Matt Dowle Sep 26 '13 at 19:15
  • redefining "+" as in the comment by Taylor Arnold there can be a terribly bad idea if you intend to make some computations. A simple 1 + 1 will be much slower than before. – lebatsnok Sep 27 '13 at 09:52
  • @lebatsnok Good point, but I'm not sure that's a given. The particular "+" that's in the linked answer will be awful, yes, but a faster one is straightforward. R has to check each time whether you have defined "+" or not anyway (iiuc) so I don't think the mere existence of your own "+" will slow it down. Easily testable to see if you're right. – Matt Dowle Sep 27 '13 at 10:08
  • 1
    I found myself using your answer and so I'm accepting your answer over @eddi and Frank. I hope it isn't bad manners to take away the accepted tag from someone else's answer. – TheComeOnMan Sep 27 '13 at 11:02
  • 1
    @Codoremifa I've been surprised when it happens to me (the previously accepted answerer will see a -15, so initially it feels like 3 downvotes). Explaining why in a comment as you have done is a nice thing to do. But it is encouraged to move the tick to the best answer. – Matt Dowle Sep 27 '13 at 11:13
  • @lebatsnok it's true that a single "+" operation will become much slower, but can you come up with an example where that time difference would matter (let's define "matter" here as larger than 100 millis for the entire computation)? – eddi Sep 27 '13 at 20:02
  • @eddi strictly speaking, yes, as you worded it. lebatsnok's example is `1+1`. If I loop that 1e6 times (rather than a vectorized call) I get a difference of 3.7 seconds (1.1s vs 3.8s when I redefine "+"). But that isn't enough to prove it's a terribly bad idea, only that it's a bad idea to loop `+` a million times in a vector language (since x+1 takes 0.008 seconds when x is length 1e6). Now I'm wondering if there's any better way to redefine `+` that doesn't impact 1e6 calls to it (C call, compiled, in a namespace). And even if that matters, as you say. – Matt Dowle Sep 28 '13 at 00:31
  • @eddi - it depends on what you do. if you have no big data and don't write your own functions which use "+" then it's quite safe to redefine "+". And even then you can define your own "+" in a global environment whereas having your serious functions in a package which would use "+" as defined in base R. So ok maybe it's not a terribly bad idea if you use some precautions - and if you use string concatenation a lot. I've found paste0 to be quite handy for that purpose. – lebatsnok Sep 28 '13 at 00:44
  • @lebatsnok What's big data got to do with this? And of course we write functions including "+". One good "unsafe" argument would be that base or other packages relied on "+" being an error for strings. And if that code suddenly started to paste strings instead then the earth would stop spinning. Or if "+" was redefined completely and started doing different things for NA, NaN, then yes of course. But the suggestion is just to paste strings, and then call safe and sound "+" in base if no strings are passed to it. – Matt Dowle Sep 28 '13 at 00:55
  • @Matthew - +'ing big data takes more time than +'ing small data so you lose more time with making `+` slower?! And then it's one thing to redefine "+" in your workspace - there are really no serious objections as none of the base packages (nor any other serious packages, for that matter) would still use "+" as defined in base package. That's the whole point of namespaces. If the idea was to redefine "+" in base package then I think it should be done differently (allowing to define methods for primitive functions for "classes" like character or function). I'd like that (unlikely) change. – lebatsnok Sep 28 '13 at 15:08
  • I once tried to define a method of "[" for functions so that one could conveniently use a function with different set of default values. - Something like `lapply(a_list, mean[na.rm=TRUE, trim=0.1])` instead of using anonymous wrapper functions. This was impossible for the same reason and when I redefined `[` to use UseMethod then the rest of the code which used `[` a lot, became terribly slow. – lebatsnok Sep 28 '13 at 15:15
  • @lebatsnok You wrote "`+`ing big data takes more time than `+`ing small data". Please read all the comments above through slowly and hopefully you'll see why this comes across as an odd thing to write. We've been discussing a single vectorized call to '+' vs repeated calls to '+' over and over, for example. Do you know what we mean by vectorized code? – Matt Dowle Sep 28 '13 at 17:14
  • @matthew. Of course adding A+B takes more time when length(A)==10000000 than when length(A)==1. That is what I was talking about and I am surprised you deny this. What matters practically is not percentage of time saved but how much time you can save in terms of minutes or hours. If your 1-hour computation is taking 50% more then this is bad news whereas a 1000% increase on a 100-milliseconds-computation may easily go unnoticed. Do you understand the difference between 30 minutes and 3 seconds? – lebatsnok Sep 28 '13 at 17:31
  • @lebatsnok I give up. I'm done. You clearly don't understand the difference between a loop at C level (a vectorized call) and a loop at R level. It's the loop at R level which can be affected by defining your own "+", but a vectorized call to a single "+" won't be affected at all. – Matt Dowle Sep 28 '13 at 18:18
  • @matthew. You're right, this is pointless. Thanks btw for the data.table package (which looks like something I might use in near future) but I disagree with your last comment. `+`(A,B) will always be affected by defining a less efficient `+` - but for longer vectors, the decrease in speed will be less noticeable. You seem to be comparing above an R-level loop such as `result – lebatsnok Sep 28 '13 at 22:13
  • I don't understand why you brought up the R-level loops which would be clearly less efficient than vectorized computations. This has nothing to do with the issue of computations on big data sets taking more time than computations on small data sets. But again, we're not being constructive here. Let's stop. Bye. – lebatsnok Sep 28 '13 at 22:15
  • @lebatsnok Well you're getting close to understanding now. What is "+"? It's a bit of R code which then calls a C level loop. Right? Ok, so let's take this "+" and add an extra line of R code before it calls the C loop. (That extra line tests for character and calls paste instead). It matters not a jot to the C level loop whether that extra line of R code was there or not before it started looping. You're saying the longer the vector is, the more noticeable the decrease in speed will be. But once the C level loop is running it doesn't matter how it got called. Does that make sense now? – Matt Dowle Sep 28 '13 at 22:26
  • @matthew. well, of course at the C level it doesn't matter. However the biggest time consumers are often at the R level. "+" as a primitive function bypasses the R level almost completely whereas when you add an extra `if` or `UseMethod` at the R level then it slows things down. One of the difficulties is that while you can define methods for `+` and `[` and a few other primitive functions, they behave differently from S3-generic functions. This makes them more efficient but less flexible. Redefining + or [ or another primitive function, you gain some flexibility but loose some efficiency. – lebatsnok Sep 28 '13 at 22:38
  • BTW The R internals manual says, "However, for reasons of convenience and also efficiency /.../ , the primitive functions are exceptions that can be accessed directly. And of course, primitive functions are needed for basic operations—for example .Internal is itself a primitive. Note that primitive functions make no use of R code, and hence are very different from the usual interpreted functions." `+` is initially a primitive function but when redefined, it will become an ordinary function hence you lose some efficiency. see http://cran.r-project.org/doc/manuals/R-ints.html#Special-primitives – lebatsnok Sep 28 '13 at 22:47
  • @lebatsnok Do you still think that "`+`ing big data takes more time than '+'ing small data so you lose more time with making '+' slower"? I hope you now agree that a single vectorized call to '+' on large data spends a negligible amount of time getting to the C loop, -vs- being in the C loop. Only when you make many calls to '+' does the overhead of calling '+' come into it. "Big data" does not imply calling '+' many times. – Matt Dowle Sep 30 '13 at 09:13
0

What about something like

foo <- dataset[,grep("_pre", names(dataset))] / dataset[,grep("_post", names(dataset))]
names(foo) <- sub("pre", "comp", names(foo))

(I reformatted your data.table as data.frame. - No idea about data.tables although I'm sure thery're highly useful.)

lebatsnok
  • 5,473
  • 2
  • 17
  • 21
  • This doesn't work with a data.table because you can't use a 'column number' as such. I'm quite a fan of data.table and you should have a look at it too. – TheComeOnMan Sep 26 '13 at 16:28
  • 1
    @Codoremifa Of course, you can: `dataset[,grep("_pre", names(dataset)), with=FALSE] / dataset[,grep("_post", names(dataset)), with=FALSE]` – Roland Sep 26 '13 at 16:57
  • I worry about the repetition of `dataset` here, 4 times. See [here](http://stackoverflow.com/a/10758086/403310) for why variable name repetition can sometimes bite. – Matt Dowle Sep 27 '13 at 11:03