Ridge regression is a method of penalizing coefficients in a regression model to force a more parsimonious model (one with fewer predictors) than would be produced by an ordinary least squares model. The term “ridge” was applied by Arthur Hoerl in 1970, who saw similarities to the ridges of quadratic response functions. In ordinary least squares, one minimizes the residual sum of squares (RSS) – the sum of the squared differences between predicted and actual values. In ridge regression, one minimizes the sum of RSS + [lambda(sum of squared coefficients)].
The second term as a whole – [lambda(sum of squared coefficients)] – is termed the shrinkage parameter, because it has the effect of shrinking the coefficient estimates towards 0. Lambda itself, the tuning parameter, is chosen by the user. Cross-validation can be useful in choosing an optimal value for lambda. For more information see Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, which is available online, Section 3.4.1.