Suppose a rental car service operates in your city. Drivers can drop off and pick up cars anywhere inside the city limits. You can find and rent cars using an app. Suppose you wish to find the probability that you can find a rental car within a short distance of your home address at any time of day. Over three days you look at the app and find the following number of cars within a short distance of your home address: \mathbf{x} = [3,4,1] Suppose we assume the data comes from a
Poisson distribution. In that case, we can compute the
maximum likelihood estimate of the parameters of the model, which is \lambda = \frac{3+4+1}{3} \approx 2.67. Using this maximum likelihood estimate, we can compute the probability that there will be at least one car available on a given day: p(x>0 | \lambda \approx 2.67) = 1 - p(x=0 | \lambda \approx 2.67) = 1-\frac{2.67^0 e^{-2.67}}{0!} \approx 0.93 This is the Poisson distribution that is
the most likely to have generated the observed data \mathbf{x}. But the data could also have come from another Poisson distribution, e.g., one with \lambda = 3, or \lambda = 2, etc. In fact, there is an infinite number of Poisson distributions that
could have generated the observed data. With relatively few data points, we should be quite uncertain about which exact Poisson distribution generated this data. Intuitively we should instead take a weighted average of the probability of p(x>0| \lambda) for each of those Poisson distributions, weighted by how likely they each are, given the data we've observed \mathbf{x}. Generally, this quantity is known as the
posterior predictive distribution p(x|\mathbf{x}) = \int_\theta p(x|\theta)p(\theta|\mathbf{x})d\theta\,, where x is a new data point, \mathbf{x} is the observed data and \theta are the parameters of the model. Using
Bayes' theorem we can expand p(\theta|\mathbf{x}) = \frac{p(\mathbf{x}|\theta)p(\theta)}{p(\mathbf{x})}\,, therefore p(x|\mathbf{x}) = \int_\theta p(x|\theta)\frac{p(\mathbf{x}|\theta)p(\theta)}{p(\mathbf{x})}d\theta\,. Generally, this integral is hard to compute. However, if you choose a conjugate prior distribution p(\theta), a closed-form expression can be derived. This is the posterior predictive column in the tables below. Returning to our example, if we pick the
Gamma distribution as our prior distribution over the rate of the Poisson distributions, then the posterior predictive is the
negative binomial distribution, as can be seen from the table below. The Gamma distribution is parameterized by two hyperparameters \alpha, \beta, which we have to choose. By looking at plots of the gamma distribution, we pick \alpha = \beta = 2, which seems to be a reasonable prior for the average number of cars. The choice of prior hyperparameters is inherently subjective and based on prior knowledge. Given the prior hyperparameters \alpha and \beta we can compute the posterior hyperparameters \alpha' = \alpha + \sum_i x_i = 2 + 3+4+1 = 10 and \beta' = \beta + n = 2+3 = 5 Given the posterior hyperparameters, we can finally compute the posterior predictive of p(x>0|\mathbf{x}) = 1-p(x=0|\mathbf{x}) = 1 - NB\left(0\, |\, 10, \frac{5}{1+5}\right) \approx 0.84 This much more conservative estimate reflects the uncertainty in the model parameters, which the posterior predictive takes into account. == Table of conjugate distributions ==