Consider a one-layer neural network with input units and output units. Let be the input to the network at time , be the feedforward matrix to be learned, be the pre-activations, and the activity of the th output neuron at time .
Variance term
where represents the sum of terms like for all other samples in a minibatch, and the corresponding averaging factor ( for a minibatch size of ). Assuming we can estimate the variance, and hence the standard deviation, online with a slow-moving filter, we set so that the term under the square root is now the one-sample contribution to an estimate of the variance of the unit's activity (assuming also that a reliable estimate of the mean activity is available).
where is the Heaviside function. Importantly, should be a reliable estimate of the standard deviation of the unit activities calculated over a sufficiently long timescale reflecting responses to several diverse inputs (this corresponds to estimating the mean and standard deviation over a set of inputs in a minibatch). However, the contribution of the current sample to the estimate of the gradient does not change except for the scaling factor .
With the understanding that and are long-term estimates of the mean and standard deviations of the output activities, we will drop the superscript assuming all quantities to correspond to the current time step unless specified otherwise.
Log variance loss is equivalent to Oja's rule
Consider a simpler alternative functional form for the variance regularization loss, namely the log variance of the output activity.
which yields the gradient
We now consider the case of a single output neuron (), with no nonlinearity (), along with the assumption that the input is zero-centered (). Consequently, and , which yields a very simple update rule for the variance term as:
This update rule along with a weight decay (with coefficient ) yields a learning rule that, on average, is equivalent to Oja's rule upto a scaling factor, and in fact has exactly the same fixed points if .
Oja's rule
Invariance term
The invariance term is simply the squared L2 distance between the output activities in two consecutive time steps, which can be expressed as the sum of unit-wise squared differences across time.
Here is the stopgrad function, reflecting the fact that we do not evaluate the gradient with respect to quantities in the past. This gives us the gradient
Dropping the superscript for the current time step ,
Covariance term
The covariance objective is sum of the squared off-diagonal terms of the covariance matrix between units.
Here, is a scaling factor that keeps the objective invariant to the number of units in the population. The sum is over all other units' variance estimate, and represents a non-local unit-specific quantity. However, we could make a useful approximation which turns this sum into a population-level measure that is common to all units in the population, and could be seen as a global (within a given sub-population) third-factor.
Total Loss
Combining the three gradients, we can write the weight updates in a single-layer VICReg model as
where and are loss coefficients, that have also here absorbed the scaling factors and .