## #MonthOfJulia Day 31: Regression

Today we’ll be looking at two packages for regression analyses in Julia: GLM and GLMNet. Let’s get both of those loaded so that we can begin.

Next we’ll create a synthetic data set which we’ll use for illustration purposes.

By design there is a linear relationship between the `x`

and `y`

fields. We can extract that relationship from the data using `glm()`

.

The third and forth arguments to `glm()`

stipulate that we are applying simple linear regression where we expect the residuals to have a Normal distribution. The parameter estimates are close to what was expected, taking into account the additive noise introduced into the data. The call to `glm()`

seems rather verbose for something as simple as linear regression and, consequently, there is a shortcut `lm()`

which gets the same result with less fuss.

Using the result of `glm()`

we can directly access the estimated coefficients along with their standard errors and the associated covariance matrix.

The data along with the linear regression fit are shown below.

Moving on to the GLMNet package, which implements linear models with penalised maximum likelihood estimators. We’ll use the Boston housing data from R’s MASS package for illustration.

Running `glmnet()`

which will fit models for various values of the regularisation parameter, λ.

The result is a set of 76 different models. We’ll have a look at the intercepts and coefficients for the first ten models (which correspond to the largest values of λ). The coefficients are held in the `betas`

field, which is an array with a column for each model and a row for each coefficient. Since the first few models are strongly penalised, each has only a few non-zero coefficients.

Now that we’ve got a bundle of models, how do we choose among them? Cross-validation, of course!

We find that the best results (on the basis of loss) were achieved when λ had a value of 0.028, which is relatively weak regularisation. We’ll put the parameters of the corresponding model neatly in a data frame.

From the fit coefficients we can conclude, for example, that average house value increases with the number of rooms in the house (`Rm`

) but decreases with nitrogen oxides concentration (`NOx`

), which is a proxy for traffic intensity.

Whew! That was exhilarating but exhausting. As a footnote, please have a look at the thesis “RegTools: A Julia Package for Assisting Regression Analysis” by Muzhou Liang. The RegTools package is available here. As always, the full code for today is available on github. Next time we’ll look at classification models. Below are a couple of pertinent videos to keep you busy in the meantime.