The key to making this computation fast is to do it in log-probability space so that the product for each branch of the tree is a sum that can be computed as the inner sum of a matrix multiply. In this manner, all the branches can be computed together in vectorized fashion.
First, we construct an enumeration of all branches. For this, we use the intToBin
function from R.utils
package:
library(R.utils)
enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))
where n
is the number of Bernoulli variables. For your example, n=5
:
matrix(enum.branches, nrow=n)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
##[1,] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1"
##[2,] "0" "0" "0" "0" "0" "0" "0" "0" "1" "1" "1" "1" "1" "1" "1" "1" "0"
##[3,] "0" "0" "0" "0" "1" "1" "1" "1" "0" "0" "0" "0" "1" "1" "1" "1" "0"
##[4,] "0" "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1" "0"
##[5,] "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0"
## [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
##[1,] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##[2,] "0" "0" "0" "0" "0" "0" "0" "1" "1" "1" "1" "1" "1" "1" "1"
##[3,] "0" "0" "0" "1" "1" "1" "1" "0" "0" "0" "0" "1" "1" "1" "1"
##[4,] "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1"
##[5,] "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1"
results in a matrix where each column is the outcomes from a branch of the probability tree.
Now, use that to construct a matrix of log probabilities of the same size as enum.branches
where the value is log(p)
if enum.branches=="1"
and log(1-p)
otherwise. For your data, with p <- c(0.5, 0.12, 0.7, 0.8, .02)
, this is:
logp <- matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n)
Then, sum the log-probabilities and take the exponential to get the product of the probabilities:
result <- exp(rep(1,n) %*% logp)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##[1,] 0.025872 0.000528 0.103488 0.002112 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05
[,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
##[1,] 0.014112 0.000288 0.008232 0.000168 0.032928 0.000672 0.025872 0.000528 0.103488 0.002112
[,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30]
##[1,] 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05 0.014112 0.000288 0.008232 0.000168
[,31] [,32]
##[1,] 0.032928 0.000672
The result
will be in the same order as the numeration of branches in enum.branches
.
We can encapsulate the computation into a function:
enum.prob.product <- function(n, p) {
enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))
exp(rep(1,n) %*% matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n))
}
Timing this with 19
independent Bernoulli variables:
n <- 19
p <- runif(n)
system.time(enum.prob.product(n,p))
## user system elapsed
## 24.064 1.470 26.082
This is on my 2 GHz MacBook (circa 2009). It should be noted that the computation itself is quite fast; it is the enumeration of the branches of the probability tree (I would guess the unlist
within that) that is taking the bulk of the time. Any suggestions from the community on another approach to do that will be appreciated.