The Hamming Machine

Tammo Rukat

September 7, 2016

Introducing the Hamming Machine

Notation

  • Indices
    • \({d = 1\ldots D\; \text{- features}}\)
    • \({n = 1\ldots N\; \text{- observations/specimens}}\)
    • \({l = 1\ldots L\; \text{- latent variables}}\)
    • \({k = 1\ldots K\; \text{- layers}}\)
  • Variables
    • \({x_{nd}\; \text{- observations}}\)
    • \({u_{ld}\; \text{- parameters}}\)
    • \({z_{nl}\; \text{- latent variables}}\)

Model derivation

  • Construct a conditional probability distribution based the hamming distance between two binary vectors, \({h(\mathbf{x},\mathbf{u})}\), and a dispersion parameter \({\lambda}\): $$ p(\mathbf{x}|\mathbf{u}) \propto \exp\left[ -\lambda \, h(\mathbf{x},\mathbf{u}) \right] $$
  • Each observations \({\mathbf{x} }\) is generated from a subset of binary codes: \({\mathbf{u}_{l{=}1\ldots L}}\), selected by a vector of binary latent variables \({\mathbf{z}}\) $$ p(\mathbf{x}|\mathbf{U},\mathbf{z},\lambda) \propto \prod\limits_l p(\mathbf{x}|\mathbf{u}_l,\lambda)^{z_l} = \prod\limits_d \exp\left[- \sum_l z_l \lambda h(x_d,u_{ld}) \right]$$
  • Normalising the likelihood for for binary observations yields a logistic sigmoid: $$ p(x_d = 1|\mathbf{z}, \mathbf{u}_{1\ldots L}, \lambda) = \frac{1}{1+\exp\left[-\lambda \sum\limits_l z_l (2u_{ld} - 1) \right]} = \sigma\left[-\lambda \sum_l z_l \tilde{u}_{ld} \right]$$
  • We defined the mapping from \({\{0,1\}}\) to \({\{{-}1,1\}\,}\): \(\;\;{\tilde{u} = 2u{-}1}\)

Graphical model representation

  • $$ p(\mathbf{x}_{n}|\mathbf{z}_n,\mathbf{U},\lambda) = \prod_d \text{Ber} \left( x_{nd} |\sigma \left[ \lambda \sum\limits_{l=1}^L z_{ln} \tilde{u}_{ld} \right] \right)$$

single_layer_network.png

Toy example

sampler_002.png

Toy example

animation.gif

Synthetic example: Calculator digits

digits.png

  • Each digit is composed of a subset of 7 distinct bars.

Noiseless calculator digits

nonsparse_noisefree_data.png compressed data

nonsparse_noisefreeU0.png inferred codes

codes_recon_new.png uncompressed inferred codes

nonsparse_noisefreeZ0.png inferred latent variables

  • What about 3, 8 and 9?
    • \({``7 + 2 + 5 = 3"}\)
    • \({``7 + 2 + 5 + 6 + 1 = 3"}\)

Other perfect solutions

snas714.png

snas715.png

cbar.png

Reconstruction Error

calc_dist.png

Denoising

  • Calculator digits with 10% noise.

recon_example32.png Denoised digits

recon_example42.png Denoised digits

recon_example_snas.png Corresponding codes

The multi-layer Hamming Machine

twolayer_hm.png

The joint density factorises in terms of the form p(layer|parents)

With \({\mathbf{z}^{[0]}_n = \mathbf{x}_n}\) and \({L^{[0]} = D}\), that is $$ p(\mathbf{Z}^{[0:K]},\mathbf{U}^{[1:K]},\lambda) = p(\mathbf{Z}^{[K]}) \prod_{k=0}^{K-1} p(\mathbf{Z}^{[k]}|\mathbf{Z}^{[k{+}1]},\mathbf{U}^{[k{+}1]},\lambda^{[k{+}1]})\, p(\mathbf{U}^{[k{+}1]})\, p(\lambda^{[k{+}1]}) $$

Inference and learning

Abbrevations

  • Observation count matrix: $$ a_{nd} = \tilde{x}_{nd} \sum\limits_{l = 1}^{M} z_{ln} \tilde{u}_{ld} $$ $$ a_{nd} \in \{-L,\ldots,-1,0,1,\ldots,L \} $$
  • Enables us to write the likelihood: $$ \mathcal{L}(\mathbf{U},\mathbf{Z},\lambda) = \prod_{n,d} \sigma \left[ \lambda a_{nd} \right] $$

Gibbs sampling – full conditionals

  • Precomputed quantities $$ \gamma_{\lambda}(l) = \log(1+e^{-\lambda l}) $$
  • Full conditionals $$ p(u_{ld}=1|\text{rest}) = \sigma \left[-\tilde{u}_{ld} \sum\limits_n \left\{ \gamma_{\lambda}(a_{nd}) - \gamma_{\lambda}(a_{nd} - \tilde{u}_{ld} \,\tilde{x}_{nd} (\tilde{z}_{nl}+1) )\right\} \right] $$ $$ p(z_{nl}{=}1|\text{rest}) = \sigma\left[ -\tilde{z}_{ln} \sum\limits_d \left\{ \gamma_{\lambda}\left(a_{nd}\right) - \gamma_{\lambda}\left(a_{nd}-\tilde{z}_{ln}\,\tilde{x}_{nd}\,\tilde{u}_{ld}\right) \right\} \right] $$
  • Multilayer conditionals $$ p(z^{[k]}_{nl}{=}1) = \sigma\left[-\tilde{z}_{nl} \sum\limits_{d} \left\{ \log \left( 1 + \exp \left[ -\lambda a_{nd} \right] \right) -\log\left( 1 + \exp\left[ -\lambda (a_{nd} - \tilde{z}_{nl} \tilde{x}_{nd} \tilde{u}_{ld} \right) \right] \right) \right. \left. + \lambda^{[k+1]} \sum\limits_{l^{[k+1]}} \tilde{u}^{[k+1]}_{l^{[k+1]}l}\; z_{nl}^{[k+1]} \right] $$

The modified metropolised Gibbs sampler

  • Instead of drawing from the full conditional we always propose a value \({y'}\) that is different from the current value \({y}\), i.e. \({y' = 1-y}\).
  • The proposal distribution then takes the form $$ q(y'|y\neq y') = 1 = \frac{p(y'|\text{rest})}{1-p(y|\text{rest})} $$
  • And the Hasting acceptance ratio, equal the mutation probability and is given by $$ p_{\text{mutation}}^{\text{modified}} = \frac{p(\mathbf{y}')q(y|y')}{p(\mathbf{y})q(y'|y)} = \frac{p(y'|\text{rest})}{1-p(y'|\text{rest})} $$
  • The Gibbs mutation probability is given by $$ p_{\text{mutation}}^{\text{Gibbs}} = p(y'|\text{rest}) $$
  • And therefor the modified sampler has a higher mutation probability $$ p_{\text{mutation}}^{\text{modified}} > p_{\text{mutation}}^{\text{Gibbs}} $$

Alternative sampling schemes

Approaches

  1. Forward-filtering backward-sampling for joint updates across layers
    • Using coniditional independence properties, like for hidden Markov models.
  2. Layer-wise training
    • Start from the layer closest to the data
    • Train until convergence
    • Turn on layer below
  3. Simulated reheating
    • After convergence, reheat the system by means of the dispersion parameter \(\lambda\).
    • Sample at fixed high temperature
    • Converge back to equilibrium temperature
  4. Parallel tempering
    • Swapping states between chains is extremely unlikely

Results

comparsion_overall.png

Joint \({p(\mathbf{X},\mathbf{Z}_1,\mathbf{Z}_2|\mathbf{U},\lambda_0)}\)

comparsion_layer1.png

Data layer \({p(\mathbf{X}|\mathbf{Z}_1,\mathbf{U},\lambda_1)}\)

comparsion_layer2.png

Data layer \({p(\mathbf{Z}_1|\mathbf{Z}_2,\mathbf{U},\lambda_2)}\)

comparsion_layer3.png

Data layer \({p(\mathbf{Z}_2|\mathbf{Z}_3,\mathbf{U},\lambda_3)}\)

Priors on the codes

Effect on the conditionals

  • A Bernoulli prior on a single code unit \({u_{ld}}\):

$$ p(u_{ld}=1|\text{rest}) = \sigma \left[\color{red}{ \log\left( \frac{ p(u_{ld}) }{ 1 - p(u_{ld}) } \right)} - \tilde{u}_{ld} \sum\limits_n \left\{ \gamma_{\lambda}(a_{nd}) - \gamma_{\lambda}(a_{nd} - \tilde{u}_{ld}\,\tilde{x}_{nd} (\tilde{s}_{mn} + 1))\right\} \right] $$

Types of priors

  1. Independent Bernoulli prior on every single code unit \({u_{md}}\)
  2. Bernoulli prior controlling the number of 1s in every code. q is the ratio of 1s in code to 1s in data.

E.g. step-exp prior

prior.png $$ p(u = 1) = \tfrac{1}{2} \mathrm{H}( 1 - q ) + \tfrac{1}{2} \mathrm{H}(q-1) e^{-a(q-1)} $$

Effect of the prior for synthetic data - flat prior (old example)

a4_10_5.gif

Effect of the prior for synthetic data - step exp sparsity prior

a4_10_5_sparse.gif

Cancer mutational landscapes

First hidden layer

latent_hm0.png

latent_hm2.png

latent_hm1.png

latent_hm3.png

  • Test Test Test Test Test Test Test Test Test Test Test

snas_0.png

Second hidden layer

latent_hm0_0.png

latent_hm2_0.png

latent_hm1_0.png

latent_hm3_0.png

  • Test Test Test Test Test Test Test Test Test Test Test

snas_1.png

PCA

pca_0.png

pca_1.png

tSNE

tsne_0.png

tsne_1.png

Comparison to binary PCA

The sparse Hamming Machine

Motivation

  • The problem: Every code has to vote on every feature. If a code believes that certain features appear together, than it necessarily provides the same evidence for all other features to no appear.
  • This may not reflect the generative process that we wish to capture.
  • Proposed modification: $$ \tilde{u} \in \{-1,1\} \;\; \rightarrow \;\; \tilde{u} \in \{-1,0,1\} $$
  • This yields the full conditional: $$ p(\tilde{u}_{ld}|\text{rest}) = \text{Cat} \left( \underset{\tilde{u}' \in \{-1,0,1\}}{\mathcal{S}} \left[ - \sum\limits_n \gamma_{\lambda}(a_{nd,\tilde{u}'}) \right] \right) $$

Synthetic example

nonsparse_noisy_newdata.png compressed data

calc_digits_codesviva.png inferred codes

calc_digits_sparse_codes_reconviva.png reconstruction of codes

nonsparse_noisy_newZ0.png latent variables

calc_digits_sparse_codes_reconviva_3.png codes for \({L{=}3}\)

rgb_simplex.png color legend

  • Latent representation are sparser than for the traditional HM
  • The obvious single-bar representation takes L=14 codes.

Reconstruction error

distr.png

Future work

The minimal Hamming Machine

  • A very intuitive way of combining codes to generate observations. $$ p(x_{nd}=1|\mathbf{u}_d,\mathbf{z}_n) \propto \exp\left[{\lambda\, h(x,\min(\mathbf{u}^T\mathbf{z},1))} \right] $$
  • The binomial parameter for node \(x_{nd}\) takes one of only 2 possible values, \({\sigma(\pm \lambda)}\).
    • It equals \({\sigma(+\lambda)}\) if a single pair of its parent variables is turned on, \((z_{nl},u_{ld}) = (1,1)\), indepedent of the value of all other parents.
    • It equald \({-\lambda}\) if all parents are turned off.

Scalable inference

Simple fixed point equations

  • Break the explaining away dependency between the latent variables. $$ p(\mathbf{x}|\mathbf{z},\mathbf{u}) \approx \prod\limits_{d,l} \sigma\left[ \lambda x_{d} z_l \tilde{u}_{ld} \right] $$
  • Iterate through all \(z_{nl}\) and \(u_{ld}\) and optimise every single one $$ \sum\limits_d \tilde{x}_{nd} \tilde{u}_{ld} + \sum\limits_{l^{[2]}} z^{[2]}_{nl} \tilde{u}^{[2]}_{ld} > 0 \; \rightarrow z_{nl} = 1,\; \text{else}\; z_{nl}=0 $$ $$ \sum\limits_n \tilde{x}_{nd} z_{nl} > 0 \; \rightarrow u_{ld} = 1,\; \text{else}\; u_{ld}=0 $$
  • This will converge very quickly but depend heavily on the intial conditions.

Further applications

  • Networks
    • Adjacency matrices as input, e.g. brain networks from EEG/fMRI for different timepoints or different individuals.
    • Binarised single cell expression data.
    • Hannah/Garrett?

Appendix A: Further example – MNIST

  • 200 images of the units 2, 7, 9
  • Two hidden layers, with 30 and 6 units respectively.

Sampling

mnist_sampler.png

Reconstructions

From the corresponding representations in layer 1 (left) and layer 2 (right)

recon_1.png recon_2.png

Codes

snb_small_1.png

Codes

snb_small_2.png