VC-dimension

Shattering

A hypothesis class $H$ is said to shatter a finite set $C \subset X$ if the restriction of $H$ to C is the set of all functions from $C$ to ${0, 1}$ (for binary classification). In other words, $| H_{C} | = 2^{| C |}$ and every possible binary function on $C$ is part of some hypothesis in $H$ and is, in that sense, not restrictive on the possible solutions.

Whenever some set $C \subset X$ is shattered by $H$ , an adversary is not restricted by $H$ as they can construct a distribution over $C$ based on any target function from $C$ to ${0, 1}$ , while still maintaining realizability (on $C$ ), yielding the No-Free-Lunch theorem.

Intuitively, if a set $C$ is shattered by $H$ , and we receive a sample containing half the instances of $C$ , the labels of these instances give us no information about the labels of the rest of the instances in $C$ – every possible labeling of the rest of the instances can be explained by some hypothesis in $H$ .

Vapnik-Chervonenkis-dimension

The VC-dimension of a hypothesis class $H$ is the maximal size of a set $C \subset X$ that can be shattered by $H .$
If $H$ can shatter sets of arbitrarily large size, then $H$ is said to have infinite VC-dimension, and is consequently not PAC learnable. On the other hand, the following holds and is termed The Fundamental Theorem of Statistical Learning:

Let $H$ be a hypothesis class of functions from $X$ to ${0, 1}$ under the binary loss function. Then the uniform convergence property, PAC learnability and finite VC-dimension are all equivalent. Quantitatively, if $VCdim (H) = d < \infty$ , then there are absolute constants $C_{1}, C_{2}$ such that:

$H$ has the uniform convergence property with sample complexity:

C_{1} \frac{d + \log (\frac{1}{δ})}{ϵ^{2}} \leq m_{H}^{UC} (ϵ, δ) \leq C_{2} \frac{d + \log (\frac{1}{δ})}{ϵ^{2}}

$H$ is (agnostic) PAC learnable with sample complexity:

C_{1} \frac{d + \log (\frac{1}{δ})}{ϵ^{2}} \leq m_{H} (ϵ, δ) \leq C_{2} \frac{d + \log (\frac{1}{δ})}{ϵ^{2}}

$H$ is PAC learnable (i.e, under strict realizability) with sample complexity:

C_{1} \frac{d + \log (\frac{1}{δ})}{ϵ^{2}} \leq m_{H} (ϵ, δ) \leq C_{2} \frac{d \log (\frac{1}{ϵ}) + \log (\frac{1}{δ})}{ϵ^{2}}

This theorem holds for some other learning problems such as regression with the absolute or squared error loss, but not for all learning tasks.

Example

Let $H = {h_{a} : a \in R}$ be the class of threshold functions over $R$ , where $h_{a} : R \to {0, 1}$ is defined by $h_{a} (x) = 1_{[x < a]}$ (i.e., $1$ if $x < a$ and $0$ otherwise). Although $H$ is of infinite size, it is PAC learnable, as its VC dimension is finite:

$H$ shatters sets of size 1. Take $C = {c_{1}}$ . Choosing $a = c_{1} + 1$ gives $h_{a} (c_{1}) = 1$ , while $a = c_{1} - 1$ gives $h_{a} (c_{1}) = 0$ . Hence $H_{C}$ is the set of all functions from $C$ to ${0, 1}$ , and $H$ shatters $C$ .
$H$ shatters no set of size 2. Take $C = {c_{1}, c_{2}}$ with $c_{1} \leq c_{2}$ . No $h \in H$ can realize the labeling $(0, 1)$ : any threshold assigning label $0$ to $c_{1}$ must assign label $0$ to $c_{2}$ as well. Thus not all functions from $C$ to ${0, 1}$ are in $H_{C}$ , so $C$ is not shattered.
It follows that $VCdim (H) = 1$ .

Links

Sources

Understanding Machine Learning From Theory to Algorithms