Robust Bayesian Modeling

Yau Group meeting

Tammo Rukat

February 2, 2016

Bayesian Modeling
Robust Bayesian Modeling
Performance
The posterior predictive
References

Bayesian Modeling

What is a Bayesian Model?

A joint distribution of parameters $\beta$ and data $x$.
We usually think of exchangable models $$p(\beta,x) = \underbrace{p(\beta|\alpha)}_{\text{prior}}\;\prod_{i=1}^{n} \underbrace{p(x_i|\beta)}_{\text{likelihood}}$$

Generalise this to accomodate most common models
- conditional models: $$p(x_i|\beta) \rightarrow p(x_i|\mathbf{y}_i,\beta) \stackrel{\text{e.g.}}{=}\mathcal{N}(x_i|\mathbf{w}^T \mathbf{y}_i,\sigma)$$
- or latent variable models: $$p(x_i|\beta) = \sum_z p(x_i,z_i|\beta) \stackrel{\text{e.g.}}{=} \sum_k \pi_k \mathcal{N}(x_i|\mu_k,\sigma_k)$$

Interlude: What is the difference between parameters and latent variables?

Joint distribution: $$p(\beta,\mathbf{z},\mathbf{x}) = \underbrace{p(\beta|\alpha)}_{\text{parameters}} \prod_{i=1}^n \underbrace{p(z_i|\gamma)}_{\text{latent variables}} p(x_i|\beta,z_i)$$
Practical notion:
- Number of latent variables grows with number of data points.
- Number of parameters stays fixed.
There is no real difference

A useful distinction: Local vs global variables

$$p(\beta,\mathbf{z},\mathbf{x}) = \underbrace{p(\beta|\alpha)}_{\text{global}} \prod_{i=1}^n \underbrace{p(z_i|\gamma)}_{\text{local}} p(x_i|\beta,z_i)$$

The distinction is determined by conditional dependencies: $$p(x_i,z_i|x_{-i},z_{-i},\beta) = p(x_i,z_i|\beta)$$

Example: Gaussian mixture model

Which are the local and which are the global variables?

$$\;p(\mathbf{x}|\mathbf{z}) = \prod\limits_k \mathcal{N}(x|\mu_k,\sigma_k)^{z_{k}}; \;\;\;\;\;\; p(\mathbf{z}) = \prod\limits_k \pi_k^{z_k}$$

global: means $\mu_k$, standard deviations $\sigma_k$, mixture proportions $\pi_k$;
local: cluster assignments $z_k$.

Robust Bayesian Modeling

Motivation

Wang and Blei, 2015 – A General Method for Robust Bayesian Modeling

… all models are wrong …

Robustness: Inference should be insensitive to small deviations from the model assumptions.
Wang and Blei introduce a general framework for robust Bayesian modeling.

Key idea: Localisation of global parameters

Classical model: $$ p(\beta,\mathbf{x}) = p(\beta|\alpha) \prod\limits_{i=1}^n p(x_i|\beta) $$
- All data is drawn from the parameter.
- The hyperparmater $\alpha$ is usually fixed.
Robust model: $$ p(\mathbf{\beta},\mathbf{x}) = \prod\limits_{i=1}^n p(\beta_i|\alpha) p(x_i|\beta) $$
- Every data point is assumed drawn from an individual realisation of the parameter, which is drawn from the prior.
- Outliers are explained by variation in the parameters.

Graphical Model for Localisation

Classic model – global $\beta$

Robust model – local $\beta$

We now need to fit the hyperparameter $\alpha$.
Fixing $\alpha$ would make the data points independent.

Example: Normal observation model

Localise the precision parameter and use the conjugate prior

$\begin{align} p(x_i|\alpha) &= \int p(x_i|\beta_i) p(\beta_i|\alpha) d\beta_i \\ &= \int \mathcal{N}(x_i|\mu,\sigma_i)\; \text{Gam}^{-1}(\sigma_i|\alpha) d\sigma_i \end{align}$

Any guesses?

$$ p(x_i|\alpha) = \text{Student-t}(x_i|\mu, (\lambda,\nu)=f(\alpha) ) $$

2nd key idea: Empirical Bayes

Estimate hyperparameters via maximum likelihood $$ \hat{\alpha}=\text{arg max}_{\alpha} \sum\limits_{i=1}^{n} \int p(x_i|\beta_i) p(\beta_i|\alpha) d\beta_i $$
aka evidence approximation $$ \text{evidence} = p(x_i|\alpha) = \int p(x_i|\beta_i) p(\beta_i|\alpha) d\beta_i $$
Here we use the data to determine the prior, is that legit?

Performance

Linear Regression

Trainin data: $\begin{align} y_i|x_i &\sim \mathcal{N}(\omega^T x_i + b, \sigma_i + 0.02) \\ \sigma_i &\sim \text{Gamma}(k,1) \end{align}$
Test data: $\begin{align} y_i|x_i \sim \mathcal{N}(\omega^T x_i + b, 0.02) \end{align}$

Logistic Regression

$$ y_i | x_i \sim \text{Bernoulli}(\sigma(\omega^T x_i)) $$

The posterior predictive

Classical Bayesian model: $$p(x_i|\mathbf{x},\alpha) = \int p(x_i|\beta)\,p(\beta|\mathbf{x},\alpha)d\beta$$
- Gives correct predictive distr. only if the data comes from the model.
Robust Bayesian model $$p(x_i|\hat{\alpha}) = \int p(x_i|\beta_i)\,p(\beta_i|\hat{\alpha}) d\beta_i$$
- Gives correct predictive distr. independent of model mismatch.
If we want to make predictions under the model, which one should we choose?

References

Wang and Blei 2015, "A General Method for Robust Bayesian Modeling"
Gelman et al. 2014 "Bayesian Data Analysis", 3rd Edition
Murphy 2012, "Machine Learning: A Probabilistic Perspective"
Carlin and Louis 2000, "Empirical Bayes: Past, Present and Future"

Robust Bayesian Modeling

Yau Group meeting

Tammo Rukat

February 2, 2016

Table of Contents

Bayesian Modeling

What is a Bayesian Model?

Interlude: What is the difference between parameters and latent variables?

A useful distinction: Local vs global variables

Example: Gaussian mixture model

Robust Bayesian Modeling

Motivation

Key idea: Localisation of global parameters

Graphical Model for Localisation

Example: Normal observation model

2nd key idea: Empirical Bayes

Performance

Linear Regression

Logistic Regression

The posterior predictive

References