Machine learning

Algorithms for converting experience (training data) into expertise/knowledge (prediction function or program). Can be categorized along different axes:

Formal Model - Statistical Learning Framework

Formally, given a distribution D over data space X, ground truth target function f:XY, training data S=((x1y1),(xm,ym)), a learning algorithm outputs a prediction function hS:XY.

The generalization error of h (for a classification problem) is the probability of randomly choosing an example x for which h(x)f(x), i.e.,

LD,f(h):=PxD[h(x)f(x)]

The goal of the algorithm is to find the predictor (which depends on the training set S) that minimizes the error with respect to the unknown D and f. The training error or empirical risk/error LS(h) is the error that the classifier incurs over the training samples.

Empirical risk minimization (ERM) is the simple learning paradigm that aims to minimize the training error, but one needs to watch out for overfitting, wherein the model performs excellently on the training data but poorly on the true distribution. A common remedy is to restrict the hypothesis class H of the predictor functions h choosing in advance the set of predictors we will optimize over based on some prior knowledge we have about the learning problem. This biases the algorithm to a particular set of predictors (inductive bias). Formally,

ERMH(S)argminhH LS(h)

LD,f(hS) depends on the training set S which arises from a random process (usually i.i.d sampling). This means there is randomness in the choice of the predictor hS and, consequently, in the risk LD,f(hS). The sample could be unrepresentative with some probability, and even if not, there could be errors due to the finiteness of the sample. This can be formalized by the notion of PAC Learnability, which characterizes learning problems (specifically hypothesis sets for arbitrary data-label distributions) that can be solved to arbitrarily low error rates ϵ with arbitrarily low probability of failure δ given a sufficient number of samples (sample complexity).

The limits of learnability

Although some guarantees on learning problems can be derived, it can also be proved that there is no panacea. This is called the No-Free-Lunch theorem, a formal statement that no learning algorithm is perfect for every possible dataset out there. This arises because the PAC learning framework makes absolutely no assumptions about the data distribution, so one can always construct an adversarial distribution for the outputs of any given learning algorithm, including an ERM learner.

We can decompose the error of an ERMH predictor into two components. If hS is an ERM hypothesis:

LD(hS)=ϵapp+ϵest,

where ϵapp=minhH LD(h) and ϵest=LD(hS)ϵapp. ϵapp approximation error is the bias of the learning algorithm, i.e., the minimum possible risk achievable in the hypothesis class. ϵest is the estimation error arising from not being guaranteed to find the true risk minimizer given a finite training sample.


Links

Sources