Robust Bayesian Modeling

Yau Group meeting

Tammo Rukat

February 2, 2016

Bayesian Modeling

What is a Bayesian Model?

  • A joint distribution of parameters \(\beta\) and data \(x\).
  • We usually think of exchangable models $$p(\beta,x) = \underbrace{p(\beta|\alpha)}_{\text{prior}}\;\prod_{i=1}^{n} \underbrace{p(x_i|\beta)}_{\text{likelihood}}$$

gm1.png
  • Generalise this to accomodate most common models
    • conditional models: $$p(x_i|\beta) \rightarrow p(x_i|\mathbf{y}_i,\beta) \stackrel{\text{e.g.}}{=}\mathcal{N}(x_i|\mathbf{w}^T \mathbf{y}_i,\sigma)$$
    • or latent variable models: $$p(x_i|\beta) = \sum_z p(x_i,z_i|\beta) \stackrel{\text{e.g.}}{=} \sum_k \pi_k \mathcal{N}(x_i|\mu_k,\sigma_k)$$

Interlude: What is the difference between parameters and latent variables?

  • Joint distribution: $$p(\beta,\mathbf{z},\mathbf{x}) = \underbrace{p(\beta|\alpha)}_{\text{parameters}} \prod_{i=1}^n \underbrace{p(z_i|\gamma)}_{\text{latent variables}} p(x_i|\beta,z_i)$$
  • Practical notion:
    • Number of latent variables grows with number of data points.
    • Number of parameters stays fixed.
  • There is no real difference

A useful distinction: Local vs global variables

$$p(\beta,\mathbf{z},\mathbf{x}) = \underbrace{p(\beta|\alpha)}_{\text{global}} \prod_{i=1}^n \underbrace{p(z_i|\gamma)}_{\text{local}} p(x_i|\beta,z_i)$$

  • The distinction is determined by conditional dependencies: $$p(x_i,z_i|x_{-i},z_{-i},\beta) = p(x_i,z_i|\beta)$$

Example: Gaussian mixture model

  • Which are the local and which are the global variables?

$$\;p(\mathbf{x}|\mathbf{z}) = \prod\limits_k \mathcal{N}(x|\mu_k,\sigma_k)^{z_{k}}; \;\;\;\;\;\; p(\mathbf{z}) = \prod\limits_k \pi_k^{z_k}$$

  • global: means \(\mu_k\), standard deviations \(\sigma_k\), mixture proportions \(\pi_k\);
  • local: cluster assignments \(z_k\).

Robust Bayesian Modeling

Motivation

  • Wang and Blei, 2015 – A General Method for Robust Bayesian Modeling

all models are wrong

  • Robustness: Inference should be insensitive to small deviations from the model assumptions.
  • Wang and Blei introduce a general framework for robust Bayesian modeling.

Key idea: Localisation of global parameters

  • Classical model: $$ p(\beta,\mathbf{x}) = p(\beta|\alpha) \prod\limits_{i=1}^n p(x_i|\beta) $$
    • All data is drawn from the parameter.
    • The hyperparmater \(\alpha\) is usually fixed.
  • Robust model: $$ p(\mathbf{\beta},\mathbf{x}) = \prod\limits_{i=1}^n p(\beta_i|\alpha) p(x_i|\beta) $$
    • Every data point is assumed drawn from an individual realisation of the parameter, which is drawn from the prior.
    • Outliers are explained by variation in the parameters.

Graphical Model for Localisation

  • Classic model – global \(\beta\)

gm1.png

  • Robust model – local \(\beta\)

gm2.png

  • We now need to fit the hyperparameter \(\alpha\).
  • Fixing \(\alpha\) would make the data points independent.

Example: Normal observation model

  • Localise the precision parameter and use the conjugate prior

\(\begin{align} p(x_i|\alpha) &= \int p(x_i|\beta_i) p(\beta_i|\alpha) d\beta_i \\ &= \int \mathcal{N}(x_i|\mu,\sigma_i)\; \text{Gam}^{-1}(\sigma_i|\alpha) d\sigma_i \end{align}\)

  • Any guesses?

$$ p(x_i|\alpha) = \text{Student-t}(x_i|\mu, (\lambda,\nu)=f(\alpha) ) $$ student_t.png

2nd key idea: Empirical Bayes

  • Estimate hyperparameters via maximum likelihood $$ \hat{\alpha}=\text{arg max}_{\alpha} \sum\limits_{i=1}^{n} \int p(x_i|\beta_i) p(\beta_i|\alpha) d\beta_i $$
  • aka evidence approximation $$ \text{evidence} = p(x_i|\alpha) = \int p(x_i|\beta_i) p(\beta_i|\alpha) d\beta_i $$
  • Here we use the data to determine the prior, is that legit?

Performance

Linear Regression

  • Trainin data: \(\begin{align} y_i|x_i &\sim \mathcal{N}(\omega^T x_i + b, \sigma_i + 0.02) \\ \sigma_i &\sim \text{Gamma}(k,1) \end{align}\)
  • Test data: \(\begin{align} y_i|x_i \sim \mathcal{N}(\omega^T x_i + b, 0.02) \end{align}\)

lin_reg_error.png

legend.png

Logistic Regression

$$ y_i | x_i \sim \text{Bernoulli}(\sigma(\omega^T x_i)) $$ log_reg_error.png

legend.png

The posterior predictive

  • Classical Bayesian model: $$p(x_i|\mathbf{x},\alpha) = \int p(x_i|\beta)\,p(\beta|\mathbf{x},\alpha)d\beta$$
    • Gives correct predictive distr. only if the data comes from the model.
  • Robust Bayesian model $$p(x_i|\hat{\alpha}) = \int p(x_i|\beta_i)\,p(\beta_i|\hat{\alpha}) d\beta_i$$
    • Gives correct predictive distr. independent of model mismatch.
  • If we want to make predictions under the model, which one should we choose?

References

  • Wang and Blei 2015, "A General Method for Robust Bayesian Modeling"
  • Gelman et al. 2014 "Bayesian Data Analysis", 3rd Edition
  • Murphy 2012, "Machine Learning: A Probabilistic Perspective"
  • Carlin and Louis 2000, "Empirical Bayes: Past, Present and Future"