2

I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)] as

sum = 0
for i in range(a,b+1):
    sum += bin(i)

f(x) = bin(x)/sum

One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.

The values of n,x,a,b are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx.

What's a nice/efficient way to achieve this?

Black Jack 21
  • 313
  • 1
  • 15
  • i thought the range of binomial distribution has to be from 0 to n. can you provide me a link for the further math behind this question? I wanna read about this little more. – Kevin Choi Oct 03 '20 at 19:54
  • In addition, if you are having trouble with optimizing your code, I recommend you to use NumPy to vectorize your calculation. it's 100x faster than a for-loop – Kevin Choi Oct 03 '20 at 19:57
  • I've found a bug-free and effective solution which allows to calculate `pmf` and `cdf` for millions of items. It requires to use `scipy.stats.binom`. This is a new thing for me because I knew so far only a `scipy.special.comb` which is misleading. – mathfux Oct 03 '20 at 22:39
  • @KevinChoi I am simulating a binomial RV `x` in my project to get another RV `y` which is a function of `x`. However, `y` also has range `[0,n]`. In order to ensure that `y = f(x)` is within the acceptable range, a condition needs to be imposed on `x`, which boils down to choosing `x` binomially between `[a,b]`. – Black Jack 21 Oct 04 '20 at 05:09

2 Answers2

1

This is a way to collect all the values of bin in a pretty short time:

from scipy.special import comb
import numpy as np
def distribution(n, p=0.5):
    x = np.arange(n+1)
    return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)

It can be done in a quarter of microsecond for n=1000.

Sample run:

>>> distribution(4):
array([0.0625, 0.25  , 0.375 , 0.25  , 0.0625])

You can sum specific parts of this array like so:

>>> np.sum(distribution(4)[2:4])
0.625

Remark: For n>1000 middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning is raised.

Bugfix

You can use scipy.stats.binom equivalently:

from scipy.stats import binom
def distribution(n, p):
    return binom.pmf(np.arange(n+1), n, p)

This does the same as above mentioned method quite efficiently (n=1000000 in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p) which calculate cumulative sum of binom.pmf. Then subtraction of bth and ath items of this array gives an output which is very close to what you expect.

mathfux
  • 3,779
  • 1
  • 10
  • 26
  • Wow, this really seems to be quite fast. I wonder how they go about calculating `nCr` within the `binom` class so efficiently. – Black Jack 21 Oct 04 '20 at 05:11
1

Another way would be to use the CDF and it's inverse, something like:

from scipy import stats

dist = stats.binom(100, 0.5)

# limit ourselves to [60, 100]
lo, hi = dist.cdf([60, 100])

# draw a sample
x = dist.ppf(stats.uniform(lo, hi-lo).rvs())

should give us values in the range. note that due to floating point precision, this might give you values outside of what you want. it gets worse above the mean of the distribution

note that for large values you might as well use the normal approximation

Sam Mason
  • 11,177
  • 1
  • 29
  • 42