## Comrades Marathon: Negative Splits and Cheating

With this year’s Comrades Marathon just less than a month away, I was reminded of a story from earlier in the year. Mark Dowdeswell, a statistician at Wits University, found evidence of cheating by some middle and back of the pack Comrades runners. He identified a group of 20 athletes who had suspicious negative splits: they ran much faster in the second half of the race. There was one runner in particular whose splits were just too good to be true. When the story was publicised, this particular runner claimed that it was a conspiracy.

This story emerged in February this year.

There was quite a fuss.

And then everything went quiet. The suspected runners were instructed to attend disciplinary hearings, but the outcomes of these hearings have not been publicised nor have the names of the suspected runners been released.

I have done some previous analyses using on Comrades Marathon data. Here I am going to use the same data set to explore these suspicious negative splits.

## Data Preparation

I started off by extracting a subset of the columns from my splits data.

The resulting records have fields for the year, athlete’s race number, a unique key identifying the runner, and time taken (in minutes) to reach the little town of Drummond (the half way point at around the marathon distance) and the finish. We will only keep the complete records (valid entries for both half way and the full distance) and then add a new field.

The ratio field is a number between -1 and 1 which quantifies the time difference between first and second halves of the race. So, for example, if a runner took 4.5 hours for the first half and then 5.0 hours for the second half, his ratio would be 0.11111, indicating that he ran around 11% slower in the second half of the race.

Conversely, if a runner took 5.0 hours for the first half and then finished the second half in 4.5 hours, his ratio would be -0.1, indicating that he ran about 10% faster in the second half.

Negative values of this ratio then indicate negative splits, while positive values are for positive splits and a value of exactly zero would be for even splits (same time for both halves of the race). Let’s look at the two extremes.

Large (positive) values of the split ratio mean that a runner ran the second half much slower than the first half. Unless the time for the first half is unrealistic, then these are not suspicious: it is quite reasonable that a runner should go out really hard in the first half, get to half way in good time but then find that the wheels fall off in the second half of the race. Take, for example, the runner with key 2c5ad823, whose time for the first half was blisteringly fast (just less than three hours) but who slowed down a lot in the second half, only finishing the race in around 11 hours.

At the other end of the spectrum we have runners with very low values of the split ratio, meaning that they ran the second half much faster than the first half. Take, for example, the runner with key 1a605ce5: she ran the first half in around five and a half hours but whipped through the second half in less than three hours. Seems a little odd, right?

Note that one runner (key 3c0ea3bc) crops up twice in the top 6 negative split ratios above. More about him later.

## Some Plots

Let’s have a look at the empirical distribution of split ratios.

We can see that only a very small fraction of the field achieves a negative split. And that these runners generally only shave a few percent off their first half times. The dashed lines on the plot indicate the extreme values of the split ratio. Both of these are a long way from the body of the distribution. In statistical terms, either of these extremes is highly improbable.

If we categorise the runners broadly by the number of hours required to finish the race then we get a slightly different view of the data.

Runners who finish the race in less than 6 hours (in the “5 hour” bin above, which includes the race winner) have split ratios between -0.061526 and 0.24595. The 8 hour bin has ratios which range from -0.563642 to 1.43530. So there was a runner in this group who was twice as fast in the second half… The 9 and 10 hour bins also have some inordinately large negative splits.

What about the distribution of splits in each of these categories?

Now that paints an interesting picture. We can clearly see that in the 5 hour bin quite a significant proportion of the elite runners manage to achieve negative splits. The proportion in all the other bins is appreciably smaller, yet the extreme negative splits are very much larger!

Note that the density curve for the 5 hour bin extends slightly beyond the dashed line indicating the smallest value in this group. This is an artifact of the kernel density method used to create these curves, for which there is a trade off between the smoothness of the curve and the fidelity of the curve to the data. With a smoother curve the data are effectively smeared out more.

We can quantify those proportions.

So, 14.3% of the runners in the 5 hour bin shave off some time in the second half of the race. In the other bins only around 2% to 3% of runners manage to achieve this feat.

Finally, before we dig into the details of some individual runners, let’s see how things vary from year to year.

These data are more or less consistent between years. The median of the ratio is around 10% to 20%; the maximum is always roughly 100% or more; the minimum fluctuates rather wildly, extending from the credible -9.7% all the way down to the incredible -56.4%

## Individual Runners

We are going to focus our attention on those runners with suspiciously large negative splits. These have been identified on the plot below as those with ratios less than -15% (that is, to the left of the dotted line). The threshold at -15% is somewhat arbitrary, but is certainly conservative.

We extract only those records with ratios less than -15% and discard fields (like race number) to enforce a degree of anonymity. We will also add in a field to indicate how many times a runner appears in the list.

That’s interesting, only one runner (the same guy with key 3c0ea3bc) appears twice.

We can take a look at the recent race history for these runners.

For a number of these runners there are only splits data for a few years, so it’s quite difficult to say anything conclusive. The negative split achieved by 1a605ce5 in 2001 looks pretty extreme though… Others runners, like 4d5a86d7, 9f83c1a5 and fce308d5 have a high degree of variability in both their first and second half times, so again it is difficult to spot an anomaly with certainty.

Let’s have a good look at 3c0ea3bc though. He has run the race consistently from 1991 to 2013. He did not finish in 1991 or 1997, but in the other years has managed to rack up 11 Bronze medals and 9 Vic Clapham medals, and in the process earned a double green number. The plot shows that his time to half way has been gradually increasing over the years. Not surprising since we all slow down with age. His finish time has mostly followed the same trend. Except for two major hiccups in 2009 and 2013. It’s hard to say for certain that these unusual negative splits were the result of cheating. But, equally, it’s hard to imagine how else they might have happened.

Here are the splits data for 3c0ea3bc:

So he was not recorded by either of the timing mats at Camperdown or Polly Shortts. It is well known that these mats are not perfect and sometimes they do miss runners. However, the missing splits at these mats plus the extraordinary time for the second half of the race are rather condemning.

I wonder what happened with those disciplinary hearings?