#MonthOfJulia Day 29: Distances
Today we’ll be looking at the Distances package, which implements a range of distance metrics. This might seem a rather obscure topic, but distance calculation is at the core of all clustering techniques (which are next on the agenda), so it’s prudent to know a little about how they work.
Note that there is a Distance package as well (singular!), which was deprecated in favour of the Distances package. So please install and load the latter.
We’ll start by finding the distance between a pair of vectors.
A simple application of Pythagora’s Theorem will tell you that the Euclidean distance between the tips of those vectors is 3. We can confirm our maths with Julia though. The general form of a distance calculation uses
evaluate(), where the first argument is a distance type. Common distance metrics (like Euclidean distance) also come with convenience functions.
We can just as easily calculate other metrics like the city block (or Manhattan), cosine or Chebyshev distances.
Moving on to distances between the columns of matrices. Again we’ll define a pair of matrices for illustration.
colwise() distances are calculated between corresponding columns in the two matrices. If one of the matrices has only a single column (see the example with
Chebyshev() below) then the distance is calculated between that column and all columns in the other matrix.
We also have the option of using
pairwise() which gives the distances between all pairs of columns from the two matrices. This is precisely the distance matrix that we would use for a cluster analysis.
As you might have observed from the last example above, it’s also possible to calculate weighted versions of some of the metrics.
Finally a less contrived example. We’ll look at the distances between observations in the iris data set. We first need to extract only the numeric component of each record and then transpose the resulting matrix so that observations become columns (rather than rows).
The full distance matrix is illustrated below as a heatmap using Plotly. Note how the clearly define blocks for each of the iris species setosa, versicolor, and virginica.
Tomorrow we’ll be back to look at clustering in Julia.