The Hamming Machine

Tammo Rukat

September 7, 2016

Introducing the Hamming Machine
Synthetic example: Calculator digits
The multi-layer Hamming Machine
Inference and learning
Priors on the codes
Cancer mutational landscapes
The sparse Hamming Machine
Future work
Appendix A: Further example – MNIST

Introducing the Hamming Machine

Notation

Indices
- ${d = 1\ldots D\; \text{- features}}$
- ${n = 1\ldots N\; \text{- observations/specimens}}$
- ${l = 1\ldots L\; \text{- latent variables}}$
- ${k = 1\ldots K\; \text{- layers}}$
Variables
- ${x_{nd}\; \text{- observations}}$
- ${u_{ld}\; \text{- parameters}}$
- ${z_{nl}\; \text{- latent variables}}$

Model derivation

Construct a conditional probability distribution based the hamming distance between two binary vectors, ${h(\mathbf{x},\mathbf{u})}$, and a dispersion parameter ${\lambda}$: $$ p(\mathbf{x}|\mathbf{u}) \propto \exp\left[ -\lambda \, h(\mathbf{x},\mathbf{u}) \right] $$
Each observations ${\mathbf{x} }$ is generated from a subset of binary codes: ${\mathbf{u}_{l{=}1\ldots L}}$, selected by a vector of binary latent variables ${\mathbf{z}}$ $$ p(\mathbf{x}|\mathbf{U},\mathbf{z},\lambda) \propto \prod\limits_l p(\mathbf{x}|\mathbf{u}_l,\lambda)^{z_l} = \prod\limits_d \exp\left[- \sum_l z_l \lambda h(x_d,u_{ld}) \right]$$
Normalising the likelihood for for binary observations yields a logistic sigmoid: $$ p(x_d = 1|\mathbf{z}, \mathbf{u}_{1\ldots L}, \lambda) = \frac{1}{1+\exp\left[-\lambda \sum\limits_l z_l (2u_{ld} - 1) \right]} = \sigma\left[-\lambda \sum_l z_l \tilde{u}_{ld} \right]$$
We defined the mapping from ${\{0,1\}}$ to ${\{{-}1,1\}\,}$: $\;\;{\tilde{u} = 2u{-}1}$

Graphical model representation

$$ p(\mathbf{x}_{n}|\mathbf{z}_n,\mathbf{U},\lambda) = \prod_d \text{Ber} \left( x_{nd} |\sigma \left[ \lambda \sum\limits_{l=1}^L z_{ln} \tilde{u}_{ld} \right] \right)$$

Toy example

Synthetic example: Calculator digits

Each digit is composed of a subset of 7 distinct bars.

Noiseless calculator digits

compressed data

inferred codes

uncompressed inferred codes

inferred latent variables

What about 3, 8 and 9?
- ${``7 + 2 + 5 = 3"}$
- ${``7 + 2 + 5 + 6 + 1 = 3"}$

Reconstruction Error

Denoising

Calculator digits with 10% noise.

Denoised digits

Corresponding codes

The multi-layer Hamming Machine

The joint density factorises in terms of the form p(layer|parents)

With ${\mathbf{z}^{[0]}_n = \mathbf{x}_n}$ and ${L^{[0]} = D}$, that is $$ p(\mathbf{Z}^{[0:K]},\mathbf{U}^{[1:K]},\lambda) = p(\mathbf{Z}^{[K]}) \prod_{k=0}^{K-1} p(\mathbf{Z}^{[k]}|\mathbf{Z}^{[k{+}1]},\mathbf{U}^{[k{+}1]},\lambda^{[k{+}1]})\, p(\mathbf{U}^{[k{+}1]})\, p(\lambda^{[k{+}1]}) $$

Inference and learning

Abbrevations

Observation count matrix: $$ a_{nd} = \tilde{x}_{nd} \sum\limits_{l = 1}^{M} z_{ln} \tilde{u}_{ld} $$ $$ a_{nd} \in \{-L,\ldots,-1,0,1,\ldots,L \} $$
Enables us to write the likelihood: $$ \mathcal{L}(\mathbf{U},\mathbf{Z},\lambda) = \prod_{n,d} \sigma \left[ \lambda a_{nd} \right] $$

Gibbs sampling – full conditionals

Precomputed quantities $$ \gamma_{\lambda}(l) = \log(1+e^{-\lambda l}) $$
Full conditionals $$ p(u_{ld}=1|\text{rest}) = \sigma \left[-\tilde{u}_{ld} \sum\limits_n \left\{ \gamma_{\lambda}(a_{nd}) - \gamma_{\lambda}(a_{nd} - \tilde{u}_{ld} \,\tilde{x}_{nd} (\tilde{z}_{nl}+1) )\right\} \right] $$ $$ p(z_{nl}{=}1|\text{rest}) = \sigma\left[ -\tilde{z}_{ln} \sum\limits_d \left\{ \gamma_{\lambda}\left(a_{nd}\right) - \gamma_{\lambda}\left(a_{nd}-\tilde{z}_{ln}\,\tilde{x}_{nd}\,\tilde{u}_{ld}\right) \right\} \right] $$
Multilayer conditionals $$ p(z^{[k]}_{nl}{=}1) = \sigma\left[-\tilde{z}_{nl} \sum\limits_{d} \left\{ \log \left( 1 + \exp \left[ -\lambda a_{nd} \right] \right) -\log\left( 1 + \exp\left[ -\lambda (a_{nd} - \tilde{z}_{nl} \tilde{x}_{nd} \tilde{u}_{ld} \right) \right] \right) \right. \left. + \lambda^{[k+1]} \sum\limits_{l^{[k+1]}} \tilde{u}^{[k+1]}_{l^{[k+1]}l}\; z_{nl}^{[k+1]} \right] $$

The modified metropolised Gibbs sampler

Instead of drawing from the full conditional we always propose a value ${y'}$ that is different from the current value ${y}$, i.e. ${y' = 1-y}$.
The proposal distribution then takes the form $$ q(y'|y\neq y') = 1 = \frac{p(y'|\text{rest})}{1-p(y|\text{rest})} $$
And the Hasting acceptance ratio, equal the mutation probability and is given by $$ p_{\text{mutation}}^{\text{modified}} = \frac{p(\mathbf{y}')q(y|y')}{p(\mathbf{y})q(y'|y)} = \frac{p(y'|\text{rest})}{1-p(y'|\text{rest})} $$
The Gibbs mutation probability is given by $$ p_{\text{mutation}}^{\text{Gibbs}} = p(y'|\text{rest}) $$
And therefor the modified sampler has a higher mutation probability $$ p_{\text{mutation}}^{\text{modified}} > p_{\text{mutation}}^{\text{Gibbs}} $$

Alternative sampling schemes

Approaches

Forward-filtering backward-sampling for joint updates across layers
- Using coniditional independence properties, like for hidden Markov models.
Layer-wise training
- Start from the layer closest to the data
- Train until convergence
- Turn on layer below
Simulated reheating
- After convergence, reheat the system by means of the dispersion parameter $\lambda$.
- Sample at fixed high temperature
- Converge back to equilibrium temperature
Parallel tempering
- Swapping states between chains is extremely unlikely

Results

Joint ${p(\mathbf{X},\mathbf{Z}_1,\mathbf{Z}_2|\mathbf{U},\lambda_0)}$

Data layer ${p(\mathbf{X}|\mathbf{Z}_1,\mathbf{U},\lambda_1)}$

Data layer ${p(\mathbf{Z}_1|\mathbf{Z}_2,\mathbf{U},\lambda_2)}$

Data layer ${p(\mathbf{Z}_2|\mathbf{Z}_3,\mathbf{U},\lambda_3)}$

Priors on the codes

Effect on the conditionals

A Bernoulli prior on a single code unit ${u_{ld}}$:

$$ p(u_{ld}=1|\text{rest}) = \sigma \left[\color{red}{ \log\left( \frac{ p(u_{ld}) }{ 1 - p(u_{ld}) } \right)} - \tilde{u}_{ld} \sum\limits_n \left\{ \gamma_{\lambda}(a_{nd}) - \gamma_{\lambda}(a_{nd} - \tilde{u}_{ld}\,\tilde{x}_{nd} (\tilde{s}_{mn} + 1))\right\} \right] $$

Types of priors

Independent Bernoulli prior on every single code unit ${u_{md}}$
Bernoulli prior controlling the number of 1s in every code. q is the ratio of 1s in code to 1s in data.

E.g. step-exp prior

$$ p(u = 1) = \tfrac{1}{2} \mathrm{H}( 1 - q ) + \tfrac{1}{2} \mathrm{H}(q-1) e^{-a(q-1)} $$

Effect of the prior for synthetic data - flat prior (old example)

Effect of the prior for synthetic data - step exp sparsity prior

Cancer mutational landscapes

First hidden layer

Test Test Test Test Test Test Test Test Test Test Test

Second hidden layer

Test Test Test Test Test Test Test Test Test Test Test

PCA

tSNE

Comparison to binary PCA

The sparse Hamming Machine

Motivation

The problem: Every code has to vote on every feature. If a code believes that certain features appear together, than it necessarily provides the same evidence for all other features to no appear.
This may not reflect the generative process that we wish to capture.
Proposed modification: $$ \tilde{u} \in \{-1,1\} \;\; \rightarrow \;\; \tilde{u} \in \{-1,0,1\} $$
This yields the full conditional: $$ p(\tilde{u}_{ld}|\text{rest}) = \text{Cat} \left( \underset{\tilde{u}' \in \{-1,0,1\}}{\mathcal{S}} \left[ - \sum\limits_n \gamma_{\lambda}(a_{nd,\tilde{u}'}) \right] \right) $$

Synthetic example

compressed data

inferred codes

reconstruction of codes

latent variables

codes for ${L{=}3}$

color legend

Latent representation are sparser than for the traditional HM
The obvious single-bar representation takes L=14 codes.

Reconstruction error

Future work

The minimal Hamming Machine

A very intuitive way of combining codes to generate observations. $$ p(x_{nd}=1|\mathbf{u}_d,\mathbf{z}_n) \propto \exp\left[{\lambda\, h(x,\min(\mathbf{u}^T\mathbf{z},1))} \right] $$
The binomial parameter for node $x_{nd}$ takes one of only 2 possible values, ${\sigma(\pm \lambda)}$.
- It equals ${\sigma(+\lambda)}$ if a single pair of its parent variables is turned on, $(z_{nl},u_{ld}) = (1,1)$, indepedent of the value of all other parents.
- It equald ${-\lambda}$ if all parents are turned off.

Scalable inference

Simple fixed point equations

Break the explaining away dependency between the latent variables. $$ p(\mathbf{x}|\mathbf{z},\mathbf{u}) \approx \prod\limits_{d,l} \sigma\left[ \lambda x_{d} z_l \tilde{u}_{ld} \right] $$
Iterate through all $z_{nl}$ and $u_{ld}$ and optimise every single one $$ \sum\limits_d \tilde{x}_{nd} \tilde{u}_{ld} + \sum\limits_{l^{[2]}} z^{[2]}_{nl} \tilde{u}^{[2]}_{ld} > 0 \; \rightarrow z_{nl} = 1,\; \text{else}\; z_{nl}=0 $$ $$ \sum\limits_n \tilde{x}_{nd} z_{nl} > 0 \; \rightarrow u_{ld} = 1,\; \text{else}\; u_{ld}=0 $$
This will converge very quickly but depend heavily on the intial conditions.

Further applications

Networks
- Adjacency matrices as input, e.g. brain networks from EEG/fMRI for different timepoints or different individuals.
- Binarised single cell expression data.
- Hannah/Garrett?

The Hamming Machine

Tammo Rukat

September 7, 2016

Table of Contents

Introducing the Hamming Machine

Notation

Model derivation

Graphical model representation

Toy example

Toy example

Synthetic example: Calculator digits

Noiseless calculator digits

Other perfect solutions

Reconstruction Error

Denoising

The multi-layer Hamming Machine

Inference and learning

Abbrevations

Gibbs sampling – full conditionals

The modified metropolised Gibbs sampler

Alternative sampling schemes

Approaches

Results

Priors on the codes

Effect on the conditionals

Types of priors

Effect of the prior for synthetic data - flat prior (old example)

Effect of the prior for synthetic data - step exp sparsity prior

Cancer mutational landscapes

First hidden layer

Second hidden layer

PCA

tSNE

Comparison to binary PCA

The sparse Hamming Machine

Motivation

Synthetic example

Reconstruction error

Future work

The minimal Hamming Machine

Scalable inference

Simple fixed point equations

Further applications

Appendix A: Further example – MNIST

Sampling

Reconstructions

Codes

Codes