## #MonthOfJulia Day 32: Classification

Yesterday we had a look at Julia’s regression model capabilities. A natural counterpart to these are models which perform classification. We’ll be looking at the GLM and DecisionTree packages. But, before I move on to that, I should mention the MLBase package which provides a load of functionality for data preprocessing, performance evaluation, cross-validation and model tuning.

## Logistic Regression

Logistic regression lies on the border between the regression techniques we considered yesterday and the classification techniques we’re looking at today. In effect though it’s really a classification technique. We’ll use some data generated in yesterday’s post to illustrate. Specifically we’ll look at the relationship between the Boolean field `valid`

and the three numeric fields.

To further refresh your memory, the plot below shows the relationship between `valid`

and the variables `x`

and `y`

. We’re going to attempt to capture this relationship in our model.

Logistic regression is also applied with the `glm()`

function from the GLM package. The call looks very similar to the one used for linear regression except that the error family is now `Binomial()`

and we’re using a logit link function.

According to the model there is a significant relationship between `valid`

and both `y`

and `z`

but not `x`

. Looking at the plot above we can see that `x`

does have an influence on `valid`

(there is a gradual transition from false to true with increasing `x`

), but that this effect is rather “fuzzy”, hence the large p-value. By comparison there is a very clear and abrupt change in `valid`

at `y`

values of around 15. The effect of `y`

is also about twice as strong as that of `z`

. All of this makes sense in light of the way that the data were constructed.

## Decision Trees

Now we’ll look at another classification technique: decision trees. First load the required packages and then grab the iris data.

We’ll also define a Boolean variable to split the data into training and testing sets.

We split the data into features and labels and then feed those into `build_tree()`

. In this case we are building a classifier to identify whether or not a particular iris is of the versicolor variety.

Let’s have a look at the product of a labours.

The textual representation of the tree above breaks the decision process down into a number of branches where the model decides whether to go to the left (L) or right (R) branch according to whether or not the value of a given feature is above or below a threshold value. So, for example, on the third line of the output we must decide whether to move to the left or right depending on whether feature 3 (PetalLength) is less or greater than 4.8.

We can then apply the decision tree model to the testing data and see how well it performs using standard metrics.

A true positive rate of 87.5% and true negative rate of 100% is not too bad at all!

You can find a more extensive introduction to using decision trees with Julia here. The DecisionTree package also implements random forest and boosting models. Other related packages are:

- SVM (support vector machines);
- kNN (k-nearest neighbours);
- GradientBoost (gradient boosting);
- XGBoost (extreme gradient boosting);
- Orchestra (ensemble learning).

Definitely worth checking out if you have the time. My time is up though. Come back soon to hear about what Julia provides for evolutionary programming.