Algorithms for converting experience (training data) into expertise/knowledge (prediction function or program). Can be categorized along different axes:
Supervised learning (privileged information present in training data that's the object of prediction in test data) vs Unsupervised learning (no functional difference between training and test data).
Formally, given a distribution over data space , ground truth target function , training data , a learning algorithm outputs a prediction function .
The generalization error of (for a classification problem) is the probability of randomly choosing an example for which , i.e.,
The goal of the algorithm is to find the predictor (which depends on the training set ) that minimizes the error with respect to the unknown and . The training error or empirical risk/error is the error that the classifier incurs over the training samples.
Empirical risk minimization (ERM) is the simple learning paradigm that aims to minimize the training error, but one needs to watch out for overfitting, wherein the model performs excellently on the training data but poorly on the true distribution. A common remedy is to restrict the hypothesis class of the predictor functions choosing in advance the set of predictors we will optimize over based on some prior knowledge we have about the learning problem. This biases the algorithm to a particular set of predictors (inductive bias). Formally,
depends on the training set which arises from a random process (usually i.i.d sampling). This means there is randomness in the choice of the predictor and, consequently, in the risk . The sample could be unrepresentative with some probability, and even if not, there could be errors due to the finiteness of the sample. This can be formalized by the notion of PAC Learnability, which characterizes learning problems (specifically hypothesis sets for arbitrary data-label distributions) that can be solved to arbitrarily low error rates with arbitrarily low probability of failure given a sufficient number of samples (sample complexity).
The limits of learnability
Although some guarantees on learning problems can be derived, it can also be proved that there is no panacea. This is called the No-Free-Lunch theorem, a formal statement that no learning algorithm is perfect for every possible dataset out there. This arises because the PAC learning framework makes absolutely no assumptions about the data distribution, so one can always construct an adversarial distribution for the outputs of any given learning algorithm, including an ERM learner.
We can decompose the error of an predictor into two components. If is an ERM hypothesis:
where and . approximation error is the bias of the learning algorithm, i.e., the minimum possible risk achievable in the hypothesis class. is the estimation error arising from not being guaranteed to find the true risk minimizer given a finite training sample.