3

I am playing with a mixed multinomial discrete choice model in Tensorflow Probability. The function should take an input of a choice among 3 alternatives. The chosen alternative is specified by CHOSEN (a # observationsx3 tensor). Below is an update to the code to reflect my progress on the problem (but the problem remains).

I currently get the error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [6768,3] vs. [1,3,6768] [Op:Mul]

with the traceback suggesting the issue is in the call to log_prob() for the final component of the joint distrubtion (i.e., tfp.Independent(tfp.Multinomial(...))

The main components are (thank you to Padarn Wilson for helping to fix the joint distribution definition):

@tf.function
def affine(x, kernel_diag, bias=tf.zeros([])):
  """`kernel_diag * x + bias` with broadcasting."""
  kernel_diag = tf.ones_like(x) * kernel_diag
  bias = tf.ones_like(x) * bias
  return x * kernel_diag + bias

def mmnl_func():
    adj_AV_train = (tf.ones(num_idx) - AV[:,0]) * tf.constant(-9999.)
    adj_AV_SM = (tf.ones(num_idx) - AV[:,1]) * tf.constant(-9999.)
    adj_AV_car = (tf.ones(num_idx) - AV[:,2]) * tf.constant(-9999.)

    return tfd.JointDistributionSequential([
        tfd.Normal(loc=0., scale=1e5),  # mu_b_time
        tfd.HalfCauchy(loc=0., scale=5),  # sigma_b_time
        lambda sigma_b_time,mu_b_time: tfd.MultivariateNormalDiag(  # b_time
        loc=affine(tf.ones([num_idx]), mu_b_time[..., tf.newaxis]),
        scale_diag=sigma_b_time*tf.ones(num_idx)),
        tfd.Normal(loc=0., scale=1e5), # a_train
        tfd.Normal(loc=0., scale=1e5), # a_car
        tfd.Normal(loc=0., scale=1e5), # b_cost
        lambda b_cost,a_car,a_train,b_time: tfd.Independent(tfd.Multinomial(
          total_count=1,
          logits=tf.stack([
              affine(DATA[:,0], tf.gather(b_time, IDX[:,0], axis=-1), (a_train + b_cost * DATA[:,1] + adj_AV_train)),
              affine(DATA[:,2], tf.gather(b_time, IDX[:,0], axis=-1), (b_cost * DATA[:,3] + adj_AV_SM)),
              affine(DATA[:,4], tf.gather(b_time, IDX[:,0], axis=-1), (a_car + b_cost * DATA[:,5] + adj_AV_car))
          ], axis=1)
        ),reinterpreted_batch_ndims=1)
    ])

@tf.function
def mmnl_log_prob(mu_b_time, sigma_b_time, b_time, a_train, a_car, b_cost):
    return mmnl_func().log_prob(
      [mu_b_time, sigma_b_time, b_time, a_train, a_car, b_cost, CHOICE])

# Based on https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args
# change constant values to tf.constant()
nuts_samples = tf.constant(1000)
nuts_burnin = tf.constant(500)
num_chains = tf.constant(1)
## Initial step size
init_step_size= tf.constant(0.3)
# Set the chain's start state.
initial_state = [
    tf.zeros([num_chains], dtype=tf.float32, name="init_mu_b_time"),
    tf.zeros([num_chains], dtype=tf.float32, name="init_sigma_b_time"),
    tf.zeros([num_chains, num_idx], dtype=tf.float32, name="init_b_time"),
    tf.zeros([num_chains], dtype=tf.float32, name="init_a_train"),
    tf.zeros([num_chains], dtype=tf.float32, name="init_a_car"),
    tf.zeros([num_chains], dtype=tf.float32, name="init_b_cost")
]

## NUTS (using inner step size averaging step)
##
@tf.function
def nuts_sampler(init):
    nuts_kernel = tfp.mcmc.NoUTurnSampler(
      target_log_prob_fn=mmnl_log_prob, 
      step_size=init_step_size,
      )
    adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
  inner_kernel=nuts_kernel,
  num_adaptation_steps=nuts_burnin,
  step_size_getter_fn=lambda pkr: pkr.step_size,
  log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
  step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
       )

    samples_nuts_, stats_nuts_ = tfp.mcmc.sample_chain(
  num_results=nuts_samples,
  current_state=initial_state,
  kernel=adapt_nuts_kernel,
  num_burnin_steps=tf.constant(100),
  parallel_iterations=tf.constant(5))
    return samples_nuts_, stats_nuts_

samples_nuts, stats_nuts = nuts_sampler(initial_state)
Jason Hawkins
  • 545
  • 1
  • 8
  • 21

2 Answers2

0

More than likely this is an issue with your initial state and number of chains. You can try to initialize your kernel outside of the sampler call:

nuts_kernel = tfp.mcmc.NoUTurnSampler(
      target_log_prob_fn=mmnl_log_prob, 
      step_size=init_step_size,
      )
    adapt_nuts_kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
  inner_kernel=nuts_kernel,
  num_adaptation_steps=nuts_burnin,
  step_size_getter_fn=lambda pkr: pkr.step_size,
  log_accept_prob_getter_fn=lambda pkr: pkr.log_accept_ratio,
  step_size_setter_fn=lambda pkr, new_step_size: pkr._replace(step_size=new_step_size)
       )

and then do

nuts_kernel.bootstrap_results(initial_state)

and investigate the shapes of logL, and proposal states are being returned.

Another thing to do is to feed your initial state into your log-likelihood/posterior and see if the dimensions of the returned log-likelihoods match what you think it should be (if you are doing multiple chains then maybe it should be returning # chains log likelihoods).

It is my understanding that the batch dimension (# chains) has to be the first one in all your vectorized calculations.

The very last part of my blog post on tensorflow and custom likelihoods has working code for an example that does this.

Rob Hicks
  • 42
  • 5
  • Thanks for the feedback. Your blog was actually one of my initial sources. I was able to get some help from the TensorFlow Probability team. Using JointDistributionSequentialAutoBatched() helps build the distributions with the correct batch shapes. I know have some sampling problems. Just working to fix those before posting a complete response. – Jason Hawkins May 11 '20 at 17:02
0

I was able to get reasonable results from my model. Thank you to everyone for the help! The following points helped solve the various issues.

  1. Use of JointDistributionSequentialAutoBatched() to produce consistent batch shapes. You need tf-nightly installed for access.

  2. More informative priors for hyperparameters. The exponential transformation in the Multinomial() distribution means that uninformative hyperparameters (i.e., with sigma = 1e5) mean you quickly have large positive numbers entering the exp(), leading to infinities.

  3. Setting the step size, etc. was also important.

  4. I found an answer by Christopher Suter to a recent question on the Tensorflow Probability forum specifying a model from STAN useful. I took the use of taking a sample from my prior as a starting point for the initial likelihood parameters useful.

  5. Despite JointDistributionSequentialAutoBatched() correcting the batch shapes, I went back and corrected my joint distribution shapes so that printing log_prob_parts() gives consistent shapes (i.e., [10,1] for 10 chains). I still get a shape error without using JointDistributionSequentialAutoBatched() but the combination seem to work.

  6. I separated my affine() into two functions. They do the same thing but remove retracing warnings. Basically, affine() was able to broadcast the inputs but they differed and it was easier to write two functions that setup the inputs with consistent shapes. Differently shaped inputs causes Tensorflow to trace the function multiple times.

Jason Hawkins
  • 545
  • 1
  • 8
  • 21