PAC Learnability

A hypothesis class $H$ is Probably Approximately Correct (PAC) learnable with respect to a set $Z := X \times Y$ and a loss function $l : H \times Z \to R_{+}$ , if there exists a function $m_{H} : (0, 1)^{2} \to N$ and a learning algorithm with the following property: For every $ϵ, δ \in (0, 1)$ and for every distribution $D$ over $Z$ , running the learning algorithm on $m > m_{H} (ϵ, δ)$ i.i.d. examples generated by $D$ returns a hypothesis $h \in H$ such that, with probability of at least $1 - δ$ (over the choice of the $m$ training examples),

L_{D} (h) \leq min_{h^{'} \in H} L_{D} (h^{'}) + ϵ,

where $L_{D} (h) = E_{z \sim D} [l (h, z)]$ . The function $m_{H}$ determines the sample complexity of learning $H$ .

Under a realizability assumption ( $\exists h^{*} \in H s.t. L_{D, f} (h^{*}) = 0$ ), we can guarantee an absolute level of error instead of just a relative one w.r.t. some minimizer within the hypothesis class. For example, under the realizability assumption, every finite hypothesis class for a binary classification task is PAC learnable ( $L_{D} (h) \leq ϵ$ with probability $\geq 1 - δ$ ) with sample complexity:

m_{H} (ϵ, δ) \leq ⌈ \frac{\log \frac{| H |}{δ}}{ϵ} ⌉

Uniform convergence

A sufficient condition for PAC learnability is the uniform convergence condition, which essentially states that a sample training set is large enough that the empirical risks for all hypotheses in $H$ are good approximations of their true risks. Formally a training set $S$ is said to be $ϵ -$ representative if

\forall h \in H, | L_{S} (h) - L_{D} (h) | \leq ϵ .

We say that the hypothesis class has the uniform convergence property (w.r.t. domain $Z$ , hypothesis class $H$ , loss function $l$ , and distribution $D$ ) if there exists a function $m_{H}^{UC} : (0, 1)^{2} \to N$ such that for every $ϵ, δ \in (0, 1)$ and for every distribution $D$ over $Z$ , if $S$ is a sample of $m > m_{H}^{UC} (ϵ, δ)$ i.i.d. examples from $D$ , then, with probability of at least $1 - δ$ , $S$ is $ϵ -$ representative.

If a training set is $\frac{ϵ}{2} -$ representative, then any output $h_{S}$ of ${ERM}_{H} (S)$ satisfies $L_{D} (h_{s}) \leq \underset{h \in H}{L_{D} (h)} + ϵ$ . This means that if a class $H$ has the uniform convergence property with a function $m_{H}^{UC}$ , then the class is PAC learnable with sample complexity $m_{H} (ϵ, δ) \leq m_{H}^{UC} (\frac{ϵ}{2}, δ)$ . This can be used to prove that any finite hypothesis class is PAC learnable using the ERM algorithm with sample complexity:

m_{H} (ϵ, δ) \leq m_{H}^{UC} (\frac{ϵ}{2}, δ) \leq ⌈ \frac{2 \log (\frac{2 | H |}{δ})}{ϵ^{2}} ⌉

dropping the realizability assumption from above (now for general loss functions).

Links

Sources

Understanding Machine Learning From Theory to Algorithms