Variance regularized SSL maps neatly onto classical learning rules

Consider a one-layer neural network with $N$ input units and $M$ output units. Let $x^{(t)} \in R^{N}$ be the input to the network at time $t$ , $W \in R^{M \times N}$ be the feedforward matrix to be learned, $a^{(t)} = W x^{(t)} \in R^{M}$ be the pre-activations, and $z_{i}^{(t)} = f (a_{i}^{(t)})$ the activity of the $i$ th output neuron at time $t$ .

Variance term

\begin{array}{r} \begin{aligned} L_{v a r}^{(t)} & = \sum_{i = 1}^{M} R e L U (1 - σ_{z_{i}}) \\ = \sum_{i = 1}^{M} R e L U (1 - α \sqrt{(z_{i}^{(t)} - {\bar{z}}_{i})^{2} + Φ + ϵ}) \end{aligned} \end{array}

where $Φ$ represents the sum of terms like $(z_{i}^{(t)} - {\bar{z}}_{i})^{2}$ for all other samples in a minibatch, and $α$ the corresponding averaging factor ( $\frac{1}{\sqrt{B - 1}}$ for a minibatch size of $B$ ). Assuming we can estimate the variance, and hence the standard deviation, online with a slow-moving filter, we set $Φ = 0$ so that the term under the square root is now the one-sample contribution to an estimate of the variance of the unit's activity (assuming also that a reliable estimate of the mean activity ${\bar{z}}_{i}$ is available).

\frac{\partial L_{v a r}^{(t)}}{\partial W_{i j}} = - α \frac{Θ (1 - σ_{z_{i}})}{σ_{z_{i}}} (z_{i}^{(t)} - {\bar{z}}_{i}) f^{'} (a_{i}^{(t)}) x_{j}^{(t)}

where $N$ is the Heaviside function. Importantly, $σ_{z_{i}}$ should be a reliable estimate of the standard deviation of the unit activities calculated over a sufficiently long timescale reflecting responses to several diverse inputs (this corresponds to estimating the mean and standard deviation over a set of inputs in a minibatch). However, the contribution of the current sample $z^{(t)}$ to the estimate of the gradient $\frac{\partial L {v a r}^{(t)}}{\partial W i j}$ does not change except for the scaling factor $α$ .

With the understanding that $\bar{z_{i}}$ and $σ_{z_{i}}$ are long-term estimates of the mean and standard deviations of the output activities, we will drop the superscript assuming all quantities to correspond to the current time step unless specified otherwise.

\frac{\partial L_{v a r}}{\partial W_{i j}} = - α \frac{Θ (1 - σ_{z_{i}})}{σ_{z_{i}}} (z_{i} - {\bar{z}}_{i}) f^{'} (a_{i}) x_{j}

Log variance loss is equivalent to Oja's rule

Consider a simpler alternative functional form for the variance regularization loss, namely the log variance of the output activity.

\begin{array}{r} \begin{aligned} L_{v a r}^{(t)} & = \sum_{i = 1}^{M} - l o g (σ_{z_{i}}^{2}) \\ = \sum_{i = 1}^{M} - l o g (α^{2} ((z_{i}^{(t)} - {\bar{z}}_{i})^{2} + Φ + ϵ)) \end{aligned} \end{array}

which yields the gradient

\frac{\partial L_{v a r}^{(t)}}{\partial W_{i j}} = - \frac{α^{2}}{σ_{z_{i}}^{2}} (z_{i}^{(t)} - {\bar{z}}_{i}) f^{'} (a_{i}^{(t)}) x_{j}^{(t)}

We now consider the case of a single output neuron ( $M = 1$ ), with no nonlinearity ( $f^{'} (a) = 1$ ), along with the assumption that the input is zero-centered ( ${\bar{x}}_{j} = 0$ ). Consequently, $\bar{z} = \sum_{j} W_{j} {\bar{x}}_{j} = 0$ and $σ_{z}^{2} = ⟨ {(z - \bar{z})}^{2} ⟩ = ⟨ z^{2} ⟩$ , which yields a very simple update rule for the variance term as:

Δ W_{j} = - \frac{\partial L_{v a r}}{\partial W_{i j}} = α^{2} \frac{z x_{j}}{⟨ z^{2} ⟩}

This update rule along with a weight decay (with coefficient $η$ ) yields a learning rule that, on average, is equivalent to Oja's rule upto a scaling factor, and in fact has exactly the same fixed points if $\frac{η}{α^{2}} = 1$ .

Δ W_{j} = α^{2} \frac{z x_{j}}{⟨ z^{2} ⟩} - η W_{j}

\begin{array}{r} \begin{aligned} ⟨ Δ W_{j} ⟩ & = α^{2} \frac{⟨ z x_{j} ⟩}{⟨ z^{2} ⟩} - η W_{j} \\ = \frac{α^{2}}{⟨ z^{2} ⟩} (⟨ z x_{j} ⟩ - \frac{η}{α^{2}} W_{j} ⟨ z^{2} ⟩) \end{aligned} \end{array}

Oja's rule

Δ W_{j}^{O j a} = z x_{j} - W_{j} z^{2}

⟨ Δ W_{j}^{O j a} ⟩ = ⟨ z x_{j} ⟩ - W_{j} ⟨ z^{2} ⟩

Invariance term

The invariance term is simply the squared L2 distance between the output activities in two consecutive time steps, which can be expressed as the sum of unit-wise squared differences across time.

L_{p u l l}^{(t)} = \frac{1}{2} ‖ z^{(t)} - S G (z^{(t - 1)}) ‖^{2} = \frac{1}{2} \sum_{i = 1}^{M} {(z_{i}^{(t)} - S G (z_{i}^{(t - 1)}))}^{2}

Here $S G$ is the stopgrad function, reflecting the fact that we do not evaluate the gradient with respect to quantities in the past. This gives us the gradient

\frac{\partial L_{p u l l}^{(t)}}{\partial W_{i j}} = (z_{i}^{(t)} - z_{i}^{(t - 1)}) f^{'} (a_{i}^{(t)}) x_{j}^{(t)}

Dropping the superscript for the current time step $t$ ,

\frac{\partial L_{p u l l}}{\partial W_{i j}} = (z_{i} - z_{i}^{(t - 1)}) f^{'} (a_{i}) x_{j}

Covariance term

The covariance objective is sum of the squared off-diagonal terms of the covariance matrix between units.

L_{d e c o r r} = β \sum_{i = 1}^{M} \sum_{k \neq i} (z_{i} - {\bar{z}}_{i})^{2} (z_{k} - {\bar{z}}_{k})^{2}

\begin{array}{r} \begin{matrix} \frac{\partial L_{d e c o r r}}{\partial W_{i j}} = β (z_{i} - {\bar{z}}_{i}) f^{'} (a_{i}) x_{j} \sum_{k \neq i} (z_{k} - {\bar{z}}_{k})^{2} \end{matrix} \end{array}

Here, $β = \frac{1}{M - 1}$ is a scaling factor that keeps the objective invariant to the number of units in the population. The sum is over all other units' variance estimate, and represents a non-local unit-specific quantity. However, we could make a useful approximation $\sum_{k \neq i} (z_{k}^{(2)} - {\bar{z}}_{k})^{2} \approx \sum_{k}^{M} (z_{k}^{(2)} - {\bar{z}}_{k})^{2}$ which turns this sum into a population-level measure that is common to all units in the population, and could be seen as a global (within a given sub-population) third-factor.

Total Loss

Combining the three gradients, we can write the weight updates in a single-layer VICReg model as

Δ W_{i j} = - \frac{\partial L_{p u l l}}{\partial W_{i j}} - λ_{1} \frac{\partial L_{v a r}}{\partial W_{i j}} - λ_{2} \frac{\partial L_{d e c o r r}}{\partial W_{i j}}

where $λ_{1}$ and $λ_{2}$ are loss coefficients, that have also here absorbed the scaling factors $α$ and $β$ .

Δ W_{i j} = (- (z_{i} - z^{(t - 1)} i) + λ_{1} \frac{Θ (1 - σ z_{i})}{σ_{z_{i}}} (z_{i} - {\bar{z}}_{i}) - λ_{2} (z_{i} - \bar{z} i) \sum_{k \neq i} (z_{k} - {\bar{z}}_{k})^{2}) f^{'} (a_{i}) x_{j}

Links

Sources

Halvagal and Zenke (2023) - The combination of Hebbian and predictive plasticity learns invariant object representations in deep sensory networks