Machine learning

Algorithms for converting experience (training data) into expertise/knowledge (prediction function or program). Can be categorized along different axes:

Supervised learning (privileged information present in training data that's the object of prediction in test data) vs Unsupervised learning (no functional difference between training and test data).
Active learning vs Passive learning

Formal Model - Statistical Learning Framework

Formally, given a distribution $D$ over data space $X$ , ground truth target function $f : X \to Y$ , training data $S = ((x_{1} y_{1}), \dots (x_{m}, y_{m}))$ , a learning algorithm outputs a prediction function $h_{S} : X \to Y$ .

The generalization error of $h$ (for a classification problem) is the probability of randomly choosing an example $x$ for which $h (x) \neq f (x)$ , i.e.,

L_{D, f} (h) := \underset{x \sim D}{P} [h (x) \neq f (x)]

The goal of the algorithm is to find the predictor (which depends on the training set $S$ ) that minimizes the error with respect to the unknown $D$ and $f$ . The training error or empirical risk/error $L_{S} (h)$ is the error that the classifier incurs over the training samples.

Empirical risk minimization (ERM) is the simple learning paradigm that aims to minimize the training error, but one needs to watch out for overfitting, wherein the model performs excellently on the training data but poorly on the true distribution. A common remedy is to restrict the hypothesis class $H$ of the predictor functions $h$ choosing in advance the set of predictors we will optimize over based on some prior knowledge we have about the learning problem. This biases the algorithm to a particular set of predictors (inductive bias). Formally,

{ERM}_{H} (S) \in \underset{h \in H}{argmin} L_{S} (h)

$L_{D, f} (h_{S})$ depends on the training set $S$ which arises from a random process (usually i.i.d sampling). This means there is randomness in the choice of the predictor $h_{S}$ and, consequently, in the risk $L_{D, f} (h_{S})$ . The sample could be unrepresentative with some probability, and even if not, there could be errors due to the finiteness of the sample. This can be formalized by the notion of PAC Learnability, which characterizes learning problems (specifically hypothesis sets for arbitrary data-label distributions) that can be solved to arbitrarily low error rates $ϵ$ with arbitrarily low probability of failure $δ$ given a sufficient number of samples (sample complexity).

The limits of learnability

Although some guarantees on learning problems can be derived, it can also be proved that there is no panacea. This is called the No-Free-Lunch theorem, a formal statement that no learning algorithm is perfect for every possible dataset out there. This arises because the PAC learning framework makes absolutely no assumptions about the data distribution, so one can always construct an adversarial distribution for the outputs of any given learning algorithm, including an ERM learner.

We can decompose the error of an ${ERM}_{H}$ predictor into two components. If $h_{S}$ is an ERM hypothesis:

L_{D} (h_{S}) = ϵ_{app} + ϵ_{est},

where $ϵ_{app} = min_{h \in H} L_{D} (h)$ and $ϵ_{est} = L_{D} (h_{S}) - ϵ_{app}$ . $ϵ_{app}$ approximation error is the bias of the learning algorithm, i.e., the minimum possible risk achievable in the hypothesis class $H$ . In other words, it reflects the limitation of the prior knowledge we impose through the choice of $H$ . $ϵ_{est}$ is the estimation error arising from not being guaranteed to find the risk minimizer within $H$ given a finite training sample. PAC Learnability requires that the estimation error be bounded uniformly over all distributions.

Although finiteness of hypothesis classes guarantees PAC learnability (sample complexity $\leq ⌈ \frac{2 \log (\frac{2 | H |}{δ})}{ϵ^{2}} ⌉$ ), this is not a necessary condition. Infinite hypothesis classes can be learnable too. Instead, a correct characterization of the learnability of a hypothesis class is given by its VC-dimension.

Links

Sources

Understanding Machine Learning From Theory to Algorithms