# satRday Cape Town: Call for Submissions

satRday Cape Town will happen on 18 February 2017 at Workshop 17, Victoria & Alfred Waterfront, Cape Town, South Africa.

## Keynotes and Workshops

We have a trio of fantastic keynote speakers: Hilary Parker, Jennifer Bryan and Julia Silge, who’ll be dazzling you on the day as well as presenting workshops on the two days prior to the satRday.

## Call for Submissions

We’re accepting submissions in four categories:

• Workshop [90 min],
• Lightning Talk [5 min],
• Standard Talk [20 min] and
• Poster.

Submit your proposals here. The deadline is 16 December, but there’s no reason to wait for the last minute: send us your ideas right now so that we can add you to the killer programme.

## Registration

Register for the conference and workshops here. The tickets are affordable and you’re going to get extraordinary value for money.

# Fitting a Statistical Distribution to Sampled Data

I’m generally not too interested in fitting analytical distributions to my data. With large enough samples (which I am normally fortunate enough to have!) I can safely assume normality for most statistics of interest.

Recently I had a relatively small chunk of data and finding a decent analytical approximation was important. So I had a look at the tools available in R for addressing this problem. The fitdistrplus package seemed like a good option. Here’s a sample workflow.

## Create some Data

To have something to work with, generate 1000 samples from a log-normal distribution.

> N <- 1000
>
> set.seed(37)
> #
> x <- rlnorm(N, meanlog = 0, sdlog = 0.5)


## Skewness-Kurtosis Plot

Load up the package and generate a skewness-kurtosis plot.

> library(fitdistrplus)
>
> descdist(x)
summary statistics
------
min:  0.2391517   max:  6.735326
median:  0.9831923
mean:  1.128276
estimated sd:  0.6239416
estimated skewness:  2.137708
estimated kurtosis:  12.91741


There’s nothing magical in those summary statistics, but the plot is most revealing. The data are represented by the blue point. Various distributions are represented by symbols, lines and shaded areas.

We can see that our data point is close to the log-normal curve (no surprises there!), which indicates that it is the most likely distribution.

We don’t need to take this at face value though because we can fit a few distributions and compare the results.

## Fitting Distributions

We’ll start out by fitting a log-normal distribution using fitdist().

> fit.lnorm = fitdist(x, "lnorm")
> fit.lnorm
Fitting of the distribution ' lnorm ' by maximum likelihood
Parameters:
estimate Std. Error
meanlog -0.009199794 0.01606564
sdlog    0.508040297 0.01135993
> plot(fit.lnorm)


The quantile-quantile plot indicates that, as expected, a log-normal distribution gives a pretty good representation of our data. We can compare this to the results of fitting a normal distribution, where we see that there is significant divergence of the tails of the quantile-quantile plot.

## Comparing Distributions

If we fit a selection of plausible distributions then we can objectively evaluate the quality of those fits.

> fit.metrics <- lapply(ls(pattern = "fit\\."), function(variable) {
+   fit = get(variable, envir = .GlobalEnv)
+   with(fit, data.frame(name = variable, aic, loglik))
+ })
> do.call(rbind, fit.metrics)
name      aic     loglik
1   fit.exp 2243.382 -1120.6909
2 fit.gamma 1517.887  -756.9436
3 fit.lnorm 1469.088  -732.5442
4 fit.logis 1737.104  -866.5520
5  fit.norm 1897.480  -946.7398


According to these data the log-normal distribution is the optimal fit: smallest AIC and largest log-likelihood.

Of course, with real (as opposed to simulated) data, the situation will probably not be as clear cut. But with these tools it’s generally possible to select an appropriate distribution and derive appropriate parameters.

# Rafal Lukawiecki – Putting Science into the Business of Data Science

A talk by Rafal Lukawiecki at the Microsoft Machine Learning & Data Science Summit (2016).

Data science relies on the scientific method of reasoning to help make business decisions based on analytics. Let Rafal explain how his customers apply the trusted processes and the principles of hypothesis testing with machine learning and statistics towards solving their day-to-day, practical business problems. Rafal will speak from his 10 years of experience in data mining and statistics, using the Microsoft data platform for high-value customer identification, recommendation and gap analysis, customer paths and acquisition modelling, price optimization and other forms of advanced analytics. He will share how this work leads to measurable results that not only make his customers more successful but also keep the data scientist true to his principles. This session will also help you understand how to start and justify data science projects in ways that business decision makers find attractive, and how to do it all with the help of the Microsoft platform.

# International Open Data Day

As part of International Open Data Day we spent the morning with a bunch of like minded people poring over some open Census South Africa data. Excellent initiative, @opendatadurban, I’m very excited to see where this is all going and look forward to contributing to the journey!

The data above show the distribution of ages in a segment of the South African population who have either no schooling (blue) or have completed Grade 12 (orange). Click the image to access the interactive plot. Being a subset of the complete data set, this does not tell the full story, but it’s still a rather interesting view on the state of education in South Africa.

# Durban Data Science Meetup

The next Durban Data Science Meetup will be taking place on 4 February. Check out the programme and come along if you’re in the area. Send enquiries to Andrew.

# Kaggle: Santa’s Stolen Sleigh

This morning I read Wendy Kan’s interesting post on Creating Santa’s Stolen Sleigh. I hadn’t really thought too much about the process of constructing an optimisation competition, but Wendy gave some interesting insights on the considerations involved in designing a competition which was both fun and challenging but still computationally feasible without military grade hardware.

This seems like an opportune time to jot down some of my personal notes and also take a look at the results. I know that this sort of discussion is normally the prerogative of the winners and I trust that my ramblings won’t be viewed as presumptuous.

Santa’s Stolen Sleigh closed on 8 January 2016. At the top of the leaderboard:

Well done, what a phenomenal effort! Check out the interview with Woshialex & Weezy.

## Problem Summary

The problem could be summarised as follows: 100 000 Christmas gifts have to be delivered from the North Pole to various locations across the Earth while minimising the Weighted Reindeer Weariness, a metric depending on weight carried and distance travelled. Distances were to be calculated using the Haversine formula which takes into account the curvature of the Earth’s quasi-spherical surface. Each gift had a weight between 1 and 50 (units unspecified). In addition the sleigh itself had a weight of 10 (same unspecified and obviously rather magical units) and a weight capacity of 1000.

… when the sleigh weight was too small, there’s no incentive to merge any trips.Wendy Kan

The problem could be decomposed into two parts:

1. partition the gifts into trips, where the total weight of gifts to be delivered on any particular trip must satisfy the weight constraint of the sleigh; and
2. choose the order in which the gifts should be delivered on each trip.

A solution would jointly solve both parts in such a way as to minimise Weighted Reindeer Weariness. It’s a variation of the Travelling Salesman Problem. In the course of doing some preliminary research I had a look at a couple of videos (video one and video two), which were definitely worth watching for some background information and ideas.

The competition was sponsored by FICO, who suggested a Mixed Integer Programming formulation for the problem. FICO made their Xpress Optimization Suite available to participants.

## Exploratory Analysis

The first thing I did was to take a look at the distribution of gifts across the planet. I found that they were more or less evenly distributed across all the continents. Who would have guessed that so many good children lived in Antarctica? Perhaps there’s limited scope to misbehave down there?

You better watch out
You better not cry
Better not pout
I’m telling you why
Santa Claus is coming to town.
Santa Claus Is Coming To Town

An obvious approach to the problem was to cluster the gifts according to their proximity and then deliver these clusters of gifts on a single trip. But this approach posed a problem: generally the k-means algorithm is implemented using the Euclidean distance, which makes sense on a plane but not the surface of a sphere. Of course, a distance matrix could be calculated using the Haversine formula but this would be a pretty enormous matrix, certainly too large for my hardware.

I could, however, readily calculate the distance between each gift and its nearest neighbour, the distribution of which is displayed below. The majority of gifts are located less than 20 km from their nearest neighbour. Some are much closer, within a few hundred metres of each other. Others are significantly more distant. The summary data below indicates that the most isolated location is 2783 km from its nearest neighbour. Obviously the horizontal axis of the plot has been aggressively truncated!

> summary(gifts\$dist_nearest)
Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
0.1424   11.6500   18.1300   19.6200   25.9400 2783.0000


Let’s look at the most isolated locations. The red points on the map below indicate gift locations which are at least 150 km distant from any other locations. Some of these are on extremely isolated pieces of land like Saint Helena, Coronation Island and the French Southern and Antarctic Lands. In terms of clustering, these points were going to be a little tricky and probably best delivered en route to somewhere else.

## Clustering and Sorting

My first approach to solving the problem involved a simple clustering scheme. My gut feel was that in a tight cluster of gifts it would be important to deliver the heaviest ones first. I tried applying this simple heuristic to my clusters and the results were not particularly good. I then used a Genetic Algorithm (GA) to optimise the order of delivery within each cluster. The table below shows gifts allocated to two distinct trips (identified by TripId) with the order of delivery determined by the GA. These data confirmed that the best strategy was not simply to deliver the heaviest gifts first. Useful (and obvious in retrospect!).

   GiftId TripId Latitude  Longitude    Weight
1   72834      1 52.08236  -99.84075 19.450129
2   31544      1 51.99433  -99.66718 16.401965
3   13514      1 51.54432 -100.80105 17.188096
4    3711      1 51.42292  -99.61607  5.959066
5   69035      1 51.42310  -99.21229  1.000000
6   50071      2 52.14381 -100.39226 10.185356
7   75708      2 52.05063 -100.09734 11.833779
8   90582      2 51.57074 -100.11467 25.335999
9   29591      2 50.94006  -98.51085 50.000000
10  94799      2 51.40380 -100.10319 10.719852
11  64255      2 51.39730  -99.99052  1.000000
12  31418      2 50.98414 -100.38163  1.000000
13   6254      2 51.37832 -100.42426  1.000000
14  40247      2 51.18464 -100.77533  9.352351
15  77396      2 51.27686 -100.77209  2.605262
16   2600      2 50.82170  -99.46544  1.000000


My submission based on simple clustering with GA optimisation was fairly decent but nothing to tweet about.

## Great Circle Solution

Thinking further about the problem it occurred to me that it might be efficient to deliver the gifts along paths that are as straight as possible, with minimal deviations. How about attempting to deliver the gifts along a Great Circle arc? For paths originating at the pole this would be equivalent to moving out and back close to a single line of longitude. It would certainly improve one aspect of Weighted Reindeer Weariness: the distance covered.

The minor arc of a great circle between two points is the shortest surface-path between them.Great Circle, Wikipedia

Putting this idea into practice and dividing the gifts up into longitudinal slices lead to an appreciable bump in my score.

My final innovation was to apply the same longitudinal slicing strategy but treat Antarctica separately. This again lead to an immediate improvement. I tried splitting off other areas of the globe, like Australia, but this didn’t yield positive results.

So in the end my clustering had very little to do with gifts that were close together but instead grouped gifts which lay close to a meridional path originating at the North Pole. Definitely not the ideal approach, but one based on good physical principles and yielding a solution which was both robust and quick to compute.

I then shifted my attention to optimising the order of delivery within each quasi-meridional group. I started off with a GA, which worked pretty well but seemed a bit pedestrian in producing innovation. I then tried 2-opt as an alternative but found that it didn’t make too much difference. Perhaps this was due to the linear arrangement of gifts? In the end I returned to the GA and just let that grind away until the competition closed, in the process trimming my score down to 12482212142. Progress was slow…

Below are some of the delivery routes designed using this approach. The size of the dots is proportional to weight and the dashed lines indicate the order of delivery. The first panel is an example in which it was optimal to simply deliver the gifts according to descending latitude. This worked well because all of the gifts lay very close to a straight line and no major lateral deviations were required. This ordering was, in fact, the starting point for my GA optimisation. The remaining panels show examples where it was more efficient to delay delivery of some of the gifts until the return journey. Notably, these were gifts with lower weight, where a relatively small penalty was incurred by delaying their delivery, but much was gained by delivering other gifts first.

Here are some examples of delivery schedules for gifts going to Antarctica. The range of longitudes spanned by each of these trips is larger than for the trips only going to higher latitudes, but this makes sense: once you have transported the gifts over the Equator and across the Southern Ocean, most of the travelling has already been done, so a bit of latitudinal distance does not have too much impact.

Okay, that’s enough about my attempts, let’s take a quick look at the leaderboard.

It’s apparent that the winner’s final solution was a significant improvement on his initial submissions. In fact, it’s interesting to see that his first three scores were 29121011015, 29120957549 and 29120751717 (all of which were a long way from the money at that stage of the competition) before plummeting to 13442292734. A series of smaller improvements whittled his score down to the final winning value of 12384507107.

On average competitors made around 7 submissions. Of the 1127 competitors, 275 made only a single submission (this was also the most common submission count). The relationship between submission count and best score is plotted below. Note that the submission counts here can differ from those reflected on the leaderboard: only submissions which resulted in an improved score are included below, whereas the leaderboard reflects the total number of submissions made, irrespective of whether they resulted in an improvement. It’s evident that those who made more submissions generally ended up with better final scores. There’s a couple of anomalies at 26 and 39 submissions caused by a pair of competitors who showed remarkable persistence.

Finally, looking at the number of submissions per day, we see that the rate is more or less constant (averaging around 200 per day) up until the final day of the competition when things spiked appreciably up to 510 submissions. Of those final day entries, 128 were submitted in the last hour and 5 in the final 30 seconds. The winner made his final submission around 8 minutes before the final bell.

Well, that’s about it. It was a really great competition. Fortunately there’ll be another similar one later on this year: yet another reason to look forward to Christmas.

# Kaggle: Walmart Trip Type Classification

Walmart Trip Type Classification was my first real foray into the world of Kaggle and I’m hooked. I previously dabbled in What’s Cooking but that was as part of a team and the team didn’t work out particularly well. As a learning experience the competition was second to none. My final entry put me at position 155 out of 1061 entries which, although not a stellar performance by any means, is just inside the top 15% and I’m pretty happy with that. Below are a few notes on the competition.

Before I get started, congratulations to the competitors at the top of the leaderboard! You guys killed it.

## Data Preparation

Getting to my final model was quite a process, with many moments of frustration, disappointment, enlightenment and exhilaration along the way.

The first step was to clean up the data. Generally it appeared to be in pretty good shape, with nothing extraordinary jumping out at me. There were some minor issues though, for example, department labels for both “MENSWEAR” and “MENS WEAR”, which needed to be consolidated into a single category.

My initial submission was a simple decision tree. This was an important part of the process for me because it established that I had a working analysis pipeline, a submission file in valid format, and a local value of the evaluation metric which was consistent with that on the public leaderboard. Submissions were gauged according to log loss and my first model scored 2.79291, quite a bit better than the Random Forest & Department Description benchmark at 5.77216, but still nowhere near competitive.

I then did a few very elementary things with the data and applied an XGBoost model, which resulted in a significant bump in model performance. I hadn’t worked with XGBoost before and I discovered what some of the hype’s about: even with the default parameters it produces an excellent model. It’s also blazing fast.

That result was a lot more respectable. Time to dig deeper.

## Feature Engineering

I realised that I would only be able to go so far with the existing features. Time to engineer some new ones. After taking a cursory look at the data, a few obvious options emerged:

• the number of departments visited;
• the total number of items bought (PositiveSum) or returned (NegativeSum);
• the net number of items bought (WholeSum, being the difference between PositiveSum and NegativeSum);
• the number of departments from which items were bought (DepartmentCount); and
• groupings of various departments, giving new categories like Grooming (a combination of BEAUTY and PERSONAL_CARE), Baby (INFANT_APPAREL and INFANT_CONSUMABLE_HARDLINES) and Clothes (everything relating to clothing from BOYS_WEAR to PLUS_AND_MATERNITY).

Throwing those into the mix provided another, smaller improvement.

To see why these new features were effective, take a look at the data below which illustrates the clear distinction in the distribution of PositiveSum for trip types 39 and 40.

Below are the relative feature importances generated by one of my models. It’s evident that both the WholeSum and PositiveSum (or its logarithm) were important. Clothes and financial services also featured highly.

I discovered something interesting while trolling the bottom end of the leaderboard page: you can download statistics for all competition entries. The data are presented as a CSV file. Here’s the head.

TeamId,TeamName,SubmissionDate,Score
230879,HulkBulk,"2015-10-26 18:58:32",34.53878
230879,HulkBulk,"2015-10-26 19:49:31",10.42797
230879,HulkBulk,"2015-10-26 20:03:20",7.90711
230907,"Bojan Tunguz","2015-10-26 20:12:06",34.53878
230940,"Paul H","2015-10-26 21:56:17",34.53878
230942,NxGTR,"2015-10-26 22:06:44",34.53878
230945,Chippy,"2015-10-26 22:14:40",3.44965
230940,"Paul H","2015-10-26 22:16:57",32.29692


Let’s first look at the distribution of best and worst scores per competitor. The histogram below shows a peak in both best and worst scores around the “All Zeros Benchmark” at 34.53878. The majority of the field ultimately achieved best scores below 5.

Scrutinising the distribution of best scores reveals a peak between 0.6 and 0.7. Only a small fraction of the competitors (6.3%) managed to push below the 0.6 boundary, leaving the elite few (0.6%) with final scores below 0.5.

        group count percent
(fctr) (int)   (dbl)
1  (0.45,0.5]     6  0.5655
2  (0.5,0.55]    21  1.9793
3  (0.55,0.6]    40  3.7700
4  (0.6,0.65]    93  8.7653
5  (0.65,0.7]    75  7.0688
6  (0.7,0.75]    39  3.6758
7  (0.75,0.8]    32  3.0160
8  (0.8,0.85]    31  2.9218
9  (0.85,0.9]    46  4.3355
10 (0.9,0.95]    21  1.9793


The scatter plot below shows the relationship between best and worst scores broken down by competitor.

Overplotting kind of kills that. Obviously a scatterplot is not the optimal way to visualise those data. A contour map offers a better view, yielding three distinct clusters: competitors who started off close to the “All Zeros Benchmark” and stayed there; ones who debuted near to the “All Zeros Benchmark” and subsequently improved dramatically and, finally, those whose initial entries were already substantially better than the “All Zeros Benchmark”.

Next I looked for a pattern in the best or worst submissions as a function of first submission date. There’s certainly evidence to suggest that many of the more competitive best scores were achieved by people who jumped onto this competition within the first week or so. Later in the competition there were more days when people joined the competition who would ultimately achieve poorer scores.

There’s less information to be gleaned from looking at the same data against the date of last submission. Throughout the competition there were final entries from competitors that were close to the “All Zeros Benchmark”. What happened to them? Were they discouraged or did they team up with other competitors?

The number of submissions per day remained consistently between 50 and 100 until the middle of December when it ramped up significantly, reaching a peak of 378 submissions on the last day of the competition. Of the entries on the final day, almost 20% were made during the last hour before the competition closed.

The box plot below indicates that there’s a relationship between the number of submissions and best score, with competitors who made more submissions generally ending up with a better final score. There are, of course, exceptions to this general rule. The winner (indicated by the orange square) made only 6 submissions, while the rest of the top ten competitors made between 19 and 83 entries.

Some of the competitors have posted their work on a source code thread in the Forums. There’ll be a lot to learn by browsing through that.

I’ve just finished the Santa’s Stolen Sleigh competition and I’ll be reporting on that in a few days time. Also working on a solution for the Homesite Quote Conversion competition, which is providing a new range of challenges.

# Making Sense of Logarithmic Loss

Logarithmic Loss, or simply Log Loss, is a classification loss function often used as an evaluation metric in kaggle competitions. Since success in these competitions hinges on effectively minimising the Log Loss, it makes sense to have some understanding of how this metric is calculated and how it should be interpreted.

Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is basically equivalent to maximising the accuracy of the classifier, but there is a subtle twist which we’ll get to in a moment.

In order to calculate Log Loss the classifier must assign a probability to each class rather than simply yielding the most likely class. Mathematically Log Loss is defined as

$- \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^M y_{ij} \log \, p_{ij}$

where N is the number of samples or instances, M is the number of possible labels, $y_{ij}$ is a binary indicator of whether or not label j is the correct classification for instance i, and $p_{ij}$ is the model probability of assigning label j to instance i. A perfect classifier would have a Log Loss of precisely zero. Less ideal classifiers have progressively larger values of Log Loss. If there are only two classes then the expression above simplifies to

$- \frac{1}{N} \sum_{i=1}^N [y_{i} \log \, p_{i} + (1 - y_{i}) \log \, (1 - p_{i})].$

Note that for each instance only the term for the correct class actually contributes to the sum.

## Log Loss Function

Let’s consider a simple implementation of a Log Loss function:

> LogLossBinary = function(actual, predicted, eps = 1e-15) {
+   predicted = pmin(pmax(predicted, eps), 1-eps)
+   - (sum(actual * log(predicted) + (1 - actual) * log(1 - predicted))) / length(actual)
+ }


Suppose that we are training a binary classifier and consider an instance which is known to belong to the target class. We’ll have a look at the effect of various predictions for class membership probability.

> LogLossBinary(1, c(0.5))
[1] 0.69315
> LogLossBinary(1, c(0.9))
[1] 0.10536
> LogLossBinary(1, c(0.1))
[1] 2.3026


In the first case the classification is neutral: it assigns equal probability to both classes, resulting in a Log Loss of 0.69315. In the second case the classifier is relatively confident in the first class. Since this is the correct classification the Log Loss is reduced to 0.10536. The third case is an equally confident classification, but this time for the wrong class. The resulting Log Loss escalates to 2.3026. Relative to the neutral classification, being confident in the wrong class resulted in a far greater change in Log Loss. Obviously the amount by which Log Loss can decrease is constrained, while increases are unbounded.

## Looking Closer

Let’s take a closer look at this relationship. The plot below shows the Log Loss contribution from a single positive instance where the predicted probability ranges from 0 (the completely wrong prediction) to 1 (the correct prediction). It’s apparent from the gentle downward slope towards the right that the Log Loss gradually declines as the predicted probability improves. Moving in the opposite direction though, the Log Loss ramps up very rapidly as the predicted probability approaches 0. That’s the twist I mentioned earlier.

Log Loss heavily penalises classifiers that are confident about an incorrect classification. For example, if for a particular observation, the classifier assigns a very small probability to the correct class then the corresponding contribution to the Log Loss will be very large indeed. Naturally this is going to have a significant impact on the overall Log Loss for the classifier. The bottom line is that it’s better to be somewhat wrong than emphatically wrong. Of course it’s always better to be completely right, but that is seldom achievable in practice! There are at least two approaches to dealing with poor classifications:

1. Examine the problematic observations relative to the full data set. Are they simply outliers? In this case, remove them from the data and re-train the classifier.
2. Consider smoothing the predicted probabilities using, for example, Laplace Smoothing. This will result in a less “certain” classifier and might improve the overall Log Loss.

## Code Support for Log Loss

Using Log Loss in your models is relatively simple. XGBoost has logloss and mlogloss options for the eval_metric parameter, which allow you to optimise your model with respect to binary and multiclass Log Loss respectively. Both metrics are available in caret‘s train() function as well. The Metrics package also implements a number of Machine Learning metrics including Log Loss.

# 2015 Data Science Salary Survey

The recently published 2015 Data Science Salary Survey conducted by O’Reilly takes a look at the salaries received, tools used and other interesting facts about Data Scientists around the World. It’s based on a survey of over 600 respondents from a variety of industries. The entire report is well worth a read, but I’ve picked out some highlights below.

The majority (67%) of the respondents in the survey were from the United States. They also demanded the highest median salaries across the globe. At the other end of the spectrum (and relevant to me personally), only 1% of the respondents were from Africa. These represented only one country: South Africa. The lowest salaries overall were recorded in Africa, while the lowest median salaries were found in Latin America.

In aggregate there’s a fairly broad distribution of Data Science salaries, with some earning more than 240k USD per annum. The majority are taking home around 80k USD.

Partitioning the data according to industry they found that there’s a reasonable degree of variability, with the highest salaries being paid in the Search and Social Networking space and the lowest in the Cloud Services, Hosting and Content Delivery Network industries. Most industries offer median salaries between 60k USD and 120k USD, with the Non-Profit and Trade Association sector being the only exception, falling just short of this range. The highest median salaries are found in the Hardware sector.

The majority (24%) of respondents were nominally Data Scientists, with median salary of around 100k USD. The highest paid were the 7% of respondents in Upper Management, while the lowest pay went to the 3% who were Students. Don’t worry guys, your time will come!

There’s a clear relationship between remuneration and the number of hours worked per week. As one would expect, those who work more get paid more. This relationship holds all the way up to those who work more than 60 hours per week, where there is a dip in median salary. Seems to me that there might be a hint of exploitation there!

The most commonly used tools, in order of decreasing importance, are SQL, Excel, Python and R. These four tools have been on top of the heap for three years running. I’m honestly a little surprised that Excel still has such a large market share. Scala is down in position 26, yet of these commonly used tools it’s associated with the highest median salary. Interestingly (and not too surprisingly) some of the more niche tools command higher salaries.

With respect to age, those in the 46-50 years group are getting paid the most, although they represent only 6% of the respondents. These guys are obviously at the peak of their profession. The most populous age group is 31-35 years. They’re earning significantly less than their more senior colleagues, but the escalation in salaries with age should give them hope for a prosperous future.

Finally, in terms of gender, roughly 80% of the respondents were male and are being paid significantly more than their female counterparts (when all other variables are accounted for). Not cool, especially given that some of the most talented Data Scientists I know are female!

For more details, have a look at the complete survey: it’s well worth a read!