Predictive V-information

The idea is to quantify reduction in uncertainty under specific constraints of the prediction function.

Standard Shannon Mutual Information, $I (X; Y)$ , assumes an observer with infinite computational power. This leads to counter-intuitive results in practice: for example, an encrypted message has the same mutual information as the decrypted message, even though the former is useless to a computationally bounded observer.
The authors introduce a new metric that measures "usable" information relative to a specific family of predictive models, denoted as $V$ (e.g., linear models, neural networks).
The Predictive $V -$ Information from $X$ to $Y$ is defined as the reduction in uncertainty about $Y$ when $X$ is known, strictly using models from family $V$ :$$I_\mathcal{V}(X→Y)=H_\mathcal{V}(Y∣∅)−H_\mathcal{V}(Y∣X)$$Where:
- $H_{V} (Y ∣ \emptyset)$ is the best possible prediction of $Y$ without knowing $X$ (baseline), still constrained to the functions $f \in V$
- $H_{V} (Y ∣ X)$ is the best possible prediction of $Y$ given $X$ , constrained to functions $f \in V$ .
- The predictive family $V$ must satisfy "Optional Ignorance", ensuring that observing $X$ never increases the error.
A central finding is that Usable Information violates the Data Processing Inequality (DPI). While Shannon information can only decrease or stay constant during processing ( $I (X; Y) \geq I (f (X); Y))$ , usable information can increase via computation: $I_{V} (f (X) \to Y) > I_{V} (X \to Y)$
This explains mathematically how Deep Learning works: layers of a network process data to "create" information that is usable by the final classification layer.
Connection to Standard Theory
- If $V = Ω$ (the set of all possible measurable functions), then $I_{V} (X \to Y) = I (X; Y)$ (Shannon Mutual Information).
- If $V = L$ (the set of linear functions), then $I_{V} (X \to Y)$ is equivalent to the coefficient of determination $R^{2}$ .

Links

Sources