Predictive V-information
The idea is to quantify reduction in uncertainty under specific constraints of the prediction function.
- Standard Shannon Mutual Information,
, assumes an observer with infinite computational power. This leads to counter-intuitive results in practice: for example, an encrypted message has the same mutual information as the decrypted message, even though the former is useless to a computationally bounded observer. - The authors introduce a new metric that measures "usable" information relative to a specific family of predictive models, denoted as
(e.g., linear models, neural networks). - The Predictive
Information from to is defined as the reduction in uncertainty about when is known, strictly using models from family :$$I_\mathcal{V}(X→Y)=H_\mathcal{V}(Y∣∅)−H_\mathcal{V}(Y∣X)$$Where: is the best possible prediction of without knowing (baseline), still constrained to the functions is the best possible prediction of given , constrained to functions . - The predictive family
must satisfy "Optional Ignorance", ensuring that observing never increases the error.
- A central finding is that Usable Information violates the Data Processing Inequality (DPI). While Shannon information can only decrease or stay constant during processing (
, usable information can increase via computation: - This explains mathematically how Deep Learning works: layers of a network process data to "create" information that is usable by the final classification layer.
- Connection to Standard Theory
- If
(the set of all possible measurable functions), then (Shannon Mutual Information). - If
(the set of linear functions), then is equivalent to the coefficient of determination .
- If
Sources