#MonthOfJulia Day 32: Classification
Yesterday we had a look at Julia’s regression model capabilities. A natural counterpart to these are models which perform classification. We’ll be looking at the GLM and DecisionTree packages. But, before I move on to that, I should mention the MLBase package which provides a load of functionality for data preprocessing, performance evaluation, cross-validation and model tuning.
Logistic regression lies on the border between the regression techniques we considered yesterday and the classification techniques we’re looking at today. In effect though it’s really a classification technique. We’ll use some data generated in yesterday’s post to illustrate. Specifically we’ll look at the relationship between the Boolean field
valid and the three numeric fields.
To further refresh your memory, the plot below shows the relationship between
valid and the variables
y. We’re going to attempt to capture this relationship in our model.
Logistic regression is also applied with the
glm() function from the GLM package. The call looks very similar to the one used for linear regression except that the error family is now
Binomial() and we’re using a logit link function.
According to the model there is a significant relationship between
valid and both
z but not
x. Looking at the plot above we can see that
x does have an influence on
valid (there is a gradual transition from false to true with increasing
x), but that this effect is rather “fuzzy”, hence the large p-value. By comparison there is a very clear and abrupt change in
y values of around 15. The effect of
y is also about twice as strong as that of
z. All of this makes sense in light of the way that the data were constructed.
We’ll also define a Boolean variable to split the data into training and testing sets.
We split the data into features and labels and then feed those into
build_tree(). In this case we are building a classifier to identify whether or not a particular iris is of the versicolor variety.
Let’s have a look at the product of a labours.
The textual representation of the tree above breaks the decision process down into a number of branches where the model decides whether to go to the left (L) or right (R) branch according to whether or not the value of a given feature is above or below a threshold value. So, for example, on the third line of the output we must decide whether to move to the left or right depending on whether feature 3 (PetalLength) is less or greater than 4.8.
We can then apply the decision tree model to the testing data and see how well it performs using standard metrics.
A true positive rate of 87.5% and true negative rate of 100% is not too bad at all!
- SVM (support vector machines);
- kNN (k-nearest neighbours);
- GradientBoost (gradient boosting);
- XGBoost (extreme gradient boosting);
- Orchestra (ensemble learning).
Definitely worth checking out if you have the time. My time is up though. Come back soon to hear about what Julia provides for evolutionary programming.