Machine Learning 101 - What is regularization ?

June 13, 2016

When do we use regularization ?

In Machine learning and statistics, a common task is to fit a model to a set of training data. This model can be used later to make predictions or classify new data points.

When the model fits the training data but does not have a good predicting performance and generalization power, we have an overfitting problem.

Regularization is a technique used to avoid this overfitting problem.

The idea behind regularization is that models that overfit the data are complex models that have for example too many parameters. In the example below we see how three different models fit the same dataset.

We used different degrees of polynomials : 1 (linear), 2 (quadratic) and 3 (cubic).

Notice how the cubic polynomial “sticks” to the data but does not describe the underlying relationship of the data points.

Different fitting models

Cost function visualization

Training error
Complexity term

Move the cursor below to change the value of λ :
the regularization parameter λ =

How does regularization work ?

In order to find the best model, the common method in machine learning is to define a loss or cost function that describes how well the model fits the data. The goal is to find the model that minimzes this loss function In the case of polynomials we can define L as follows:L=m(i=0d(αixmi)ym)2L = \sum_{\mathclap{m}}(\sum_{\mathclap{i=0}}^{d}(\alpha_{i}x_{m}^{i})-y_{m})^2

i=0d(αixi)\sum_{i=0}^{d}(\alpha_{i}x^{i}) : is the polynomial expression : (Our Model).

(xm,ym)(x_{m}, y_{m}) are our mm observations (data points we are modeling)

Example for x2x^2 : d=1d= 1 , α0=0\alpha_{0}= 0, α1=1\alpha_{1}= 1

The idea is to penalize this loss function by adding a complexity term that would give a bigger loss for more complex models. In our case we can use the square sum of the polynomial parameters.

L=m(i=0d(αixmi)ym)2+λi=1αi2L = \sum_{\mathclap{m}}(\sum_{\mathclap{i=0}}^{d}(\alpha_{i}x_{m}^{i})-y_{m})^2 + \lambda\sum_{\mathclap{i=1}}\alpha_{i}^2

In the visualization above, you can play around with the value of Lambda to penalize more or less the complex models.

This way, for lambda very large, models with high complexity are ruled out. And for small lambda, models with high training errors are ruled out. The optimal solution lies somewhere in the middle.

Discuss on Twitter

Join the newsletter

I write about cloud and software architectures.

Hi, I'm Hassen. I'm a Product engineer based in Paris 🇫🇷. I'm currently building data products at YOOI.