2

Suppose I have a series of n probabilities for success of independent Bernoulli trials, p1 to pn such that p1 != p2 != ... != pn. Give each trial a unique name.

    p <- c(0.5, 0.12, 0.7, 0.8, .02)
    a <- c("A","B","C","D","E")

I know from searching stack exchange (e.g., here and here) that I can find the cdf, pmf, etc. using the Poisson Binomial distribution function.

What I'm interested in is the exact probability of every possible combination of success and failures. (E.g. If I drew a probability tree, I want to know the probability at the end of each branch.)

    all <- prod(p)
    all
    [1] 0.000672
    o1 <- (0.5 * (1-0.12) * 0.7 * 0.8 * .02)
    o1
    [1] 0.004928
    o2 <- (0.5 * 0.12 * (1-0.7) * 0.8 * .02)
    o2
    [1] 0.000288

...for all 2^5 possible combinations of success/failure.

What's an efficient way to go about this in R?

In the case of my actual data set, the number of trials is 19, so we're talking about 2^19 total paths on the probability tree.

Community
  • 1
  • 1
paqmo
  • 3,439
  • 1
  • 9
  • 20

2 Answers2

1

The key to making this computation fast is to do it in log-probability space so that the product for each branch of the tree is a sum that can be computed as the inner sum of a matrix multiply. In this manner, all the branches can be computed together in vectorized fashion.

First, we construct an enumeration of all branches. For this, we use the intToBin function from R.utils package:

library(R.utils)
enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))

where n is the number of Bernoulli variables. For your example, n=5:

matrix(enum.branches, nrow=n)
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
##[1,] "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"   "0"   "0"   "0"   "0"   "0"   "0"   "1"  
##[2,] "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "1"  "1"   "1"   "1"   "1"   "1"   "1"   "1"   "0"  
##[3,] "0"  "0"  "0"  "0"  "1"  "1"  "1"  "1"  "0"  "0"   "0"   "0"   "1"   "1"   "1"   "1"   "0"  
##[4,] "0"  "0"  "1"  "1"  "0"  "0"  "1"  "1"  "0"  "0"   "1"   "1"   "0"   "0"   "1"   "1"   "0"  
##[5,] "0"  "1"  "0"  "1"  "0"  "1"  "0"  "1"  "0"  "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"  
##     [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
##[1,] "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"  
##[2,] "0"   "0"   "0"   "0"   "0"   "0"   "0"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"  
##[3,] "0"   "0"   "0"   "1"   "1"   "1"   "1"   "0"   "0"   "0"   "0"   "1"   "1"   "1"   "1"  
##[4,] "0"   "1"   "1"   "0"   "0"   "1"   "1"   "0"   "0"   "1"   "1"   "0"   "0"   "1"   "1"  
##[5,] "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"   "1"  

results in a matrix where each column is the outcomes from a branch of the probability tree.

Now, use that to construct a matrix of log probabilities of the same size as enum.branches where the value is log(p) if enum.branches=="1" and log(1-p) otherwise. For your data, with p <- c(0.5, 0.12, 0.7, 0.8, .02), this is:

logp <- matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n)

Then, sum the log-probabilities and take the exponential to get the product of the probabilities:

result <- exp(rep(1,n) %*% logp)
##         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]     [,9]   [,10]
##[1,] 0.025872 0.000528 0.103488 0.002112 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05
        [,11]    [,12]    [,13]    [,14]    [,15]    [,16]    [,17]    [,18]    [,19]    [,20]
##[1,] 0.014112 0.000288 0.008232 0.000168 0.032928 0.000672 0.025872 0.000528 0.103488 0.002112
        [,21]    [,22]    [,23]    [,24]    [,25]   [,26]    [,27]    [,28]    [,29]    [,30]
##[1,] 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05 0.014112 0.000288 0.008232 0.000168
        [,31]    [,32]
##[1,] 0.032928 0.000672

The result will be in the same order as the numeration of branches in enum.branches.

We can encapsulate the computation into a function:

enum.prob.product <- function(n, p) {
  enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))
  exp(rep(1,n) %*% matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n))
}

Timing this with 19 independent Bernoulli variables:

n <- 19
p <- runif(n)
system.time(enum.prob.product(n,p))
##   user  system elapsed 
## 24.064   1.470  26.082 

This is on my 2 GHz MacBook (circa 2009). It should be noted that the computation itself is quite fast; it is the enumeration of the branches of the probability tree (I would guess the unlist within that) that is taking the bulk of the time. Any suggestions from the community on another approach to do that will be appreciated.

aichao
  • 6,680
  • 3
  • 11
  • 16
  • Awesome, this worked like a charm! As a bonus, I also learned that the exponential of the sum of log-probabilities is equal to the product of probabilities. Neat. FYI, on my macbook air running dual 1.3 GHz: 14.255, 2.001, 17.652. Not bad! – paqmo Oct 07 '16 at 02:42
1

Just try this in base R:

p <- c(0.5, 0.12, 0.7, 0.8, .02)
a <- c("A","B","C","D","E")
n <- length(p)
apply(expand.grid(replicate(n,list(0:1)))[n:1], 1, 
                  function(x) prod(p[which(x==1)])*prod(1-p[which(x==0)]))

#[1] 0.025872 0.000528 0.103488 0.002112 0.060368 0.001232 0.241472 0.004928 0.003528 0.000072 0.014112 0.000288 0.008232 0.000168 0.032928 0.000672 0.025872
#[18] 0.000528 0.103488 0.002112 0.060368 0.001232 0.241472 0.004928 0.003528 0.000072 0.014112 0.000288 0.008232 0.000168 0.032928 0.000672
Sandipan Dey
  • 17,896
  • 2
  • 31
  • 42
  • Great solution and faster than @aichao nice solution, since it eliminates the 'enumeration' step. – paqmo Oct 11 '16 at 15:25