PAC Learnability

A hypothesis class H is Probably Approximately Correct (PAC) learnable with respect to a set Z:=X×Yand a loss function l:H×ZR+, if there exists a function mH:(0,1)2N and a learning algorithm with the following property: For every ϵ,δ(0,1) and for every distribution D over Z, running the learning algorithm on m>mH(ϵ,δ) i.i.d. examples generated by D returns a hypothesis hH such that, with probability of at least 1δ (over the choice of the m training examples),

LD(h)minhH LD(h)+ϵ,

where LD(h)=EzD[l(h,z)]. The function mH determines the sample complexity of learning H.

Under a realizability assumption (hH s.t. LD,f(h)=0), we can guarantee an absolute level of error instead of just a relative one w.r.t. some minimizer within the hypothesis class. For example, under the realizability assumption, every finite hypothesis class for a binary classification task is PAC learnable (LD(h)ϵ with probability 1δ) with sample complexity:

mH(ϵ,δ)log|H|δϵ

Uniform convergence

A sufficient condition for PAC learnability is the uniform convergence condition, which essentially states that a sample training set is large enough that the empirical risks for all hypotheses in H are good approximations of their true risks. Formally a training set S is said to be ϵrepresentative if

hH,|LS(h)LD(h)|ϵ.

We say that the hypothesis class has the uniform convergence property (w.r.t. domain Z, hypothesis class H, loss function l, and distribution D) if there exists a function mHUC:(0,1)2N such that for every ϵ,δ(0,1) and for every distribution D over Z, if S is a sample of m>mHUC(ϵ,δ) i.i.d. examples from D, then, with probability of at least 1δ, S is ϵrepresentative.

If a training set is ϵ2representative, then any output hS of ERMH(S) satisfies LD(hs)LD(h)hH+ϵ. This means that if a class H has the uniform convergence property with a function mHUC, then the class is PAC learnable with sample complexity mH(ϵ,δ)mHUC(ϵ2,δ). This can be used to prove that any finite hypothesis class is PAC learnable using the ERM algorithm with sample complexity:

mH(ϵ,δ)mHUC(ϵ2,δ)2log(2|H|δ)ϵ2

dropping the realizability assumption from above (now for general loss functions).


Links

Sources