Tag Archives: Comrades Marathon

Comrades Marathon: A Race for Geriatrics?

It has been suggested that the average Comrades Marathon runner is gradually getting older. As an "average runner" myself, I will not deny that I am personally getting older. But, what I really mean is that the average age of all runners taking part in this great event is gradually increasing. This is not just an idle hypothesis: it is supported by the data. If you're interested in the technical details of the analysis, these are included at the end, otherwise read on for the results.


The histograms below show graphically how the distribution of runners' ages at the Comrades Marathon has changed every decade starting in the 1980s and proceeding through to the 2010s. The data are encoded using blue for male and pink for female runners (apologies for the banality!). It is readily apparent how the distributions have shifted consistently towards older ages with the passing of the decades. The vertical lines in each panel indicate the average age for male (dashed line) and female (solid line) runners. Whereas in the 1980s the average age for both genders was around 34, in the 2010s it has shifted to over 40 for females and almost 42 for males.


Maybe clumping the data together into decades is hiding some of the details. The plot below shows the average age for each gender as a function of the race year. The plotted points are the observed average age, the solid line is a linear model fitted to these data and the dashed lines delineate a 95% confidence interval.

Prior to 1990 the average age for both genders was around 35 and varies somewhat erratically from year to year. Interestingly there is a pronounced decrease in the average age for both genders around 1990. Evidently something attracted more young runners that year... Since 1990 though there has been a consistent increase in average age. In 2013 the average age for men was fractionally less than 42, while for women it was over 40.



Of course, the title of this article is hyperbolic. The Comrades Marathon is a long way from being a race for geriatrics. However, there is very clear evidence that the average age of runners is getting higher every year. A linear model, which is a reasonably good fit to the data, indicates that the average age increases by 0.26 years annually and is generally 0.6 years higher for men than women. If this trend continues then, by the time of the 100th edition of the race, the average age will be almost 45.

Is the aging Comrades Marathon field a problem and, if so, what can be done about it?


As before I have used the Comrades Marathon results from 1980 through to 2013. Since my last post on this topic I have refactored these data, which now look like this:

> head(results)
       key year age gender category   status  medal direction medal_count decade
1  6a18da7 1980  39   Male   Senior Finished Bronze         D          20   1980
2   6570be 1980  39   Male   Senior Finished Bronze         D          16   1980
3 4371bd17 1980  29   Male   Senior Finished Bronze         D           9   1980
4 58792c25 1980  24   Male   Senior Finished Silver         D          25   1980
5 16fe5d63 1980  58   Male   Master Finished Bronze         D           9   1980
6 541c273e 1980  43   Male  Veteran Finished Silver         D          18   1980

The first step in the analysis was to compile decadal and annual summary statistics using plyr.

> decade.statistics = ddply(results, .(decade, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
> #
> year.statistics = ddply(results, .(year, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
> head(decade.statistics)
  decade gender median.age mean.age
1   1980 Female         34   34.352
2   1980   Male         34   34.937
3   1990 Female         36   36.188
4   1990   Male         36   36.440
5   2000 Female         39   39.364
6   2000   Male         39   39.799
> head(year.statistics)
  year gender median.age mean.age
1 1980 Female       35.0   35.061
2 1980   Male       33.0   34.091
3 1981 Female       33.5   34.096
4 1981   Male       34.0   34.528
5 1982 Female       34.5   35.032
6 1982   Male       34.0   34.729

The decadal data were used to generate the histograms. I then considered a selection of linear models applied to the annual data.

> fit.1 <- lm(mean.age ~ year, data = year.statistics)
> fit.2 <- lm(mean.age ~ year + year:gender, data = year.statistics)
> fit.3 <- lm(mean.age ~ year + gender, data = year.statistics)
> fit.4 <- lm(mean.age ~ year + year * gender, data = year.statistics)

The first model applies a simple linear relationship between average age and year. There is no discrimination between genders. The model summary (below) indicates that the average age increases by about 0.26 years annually. Both the intercept and slope coefficients are highly significant.

> summary(fit.1)

lm(formula = mean.age ~ year, data = year.statistics)

    Min      1Q  Median      3Q     Max 
-1.3181 -0.5322 -0.0118  0.4971  1.9897 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.80e+02   1.83e+01   -26.2   <2e-16 ***
year         2.59e-01   9.15e-03    28.3   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.74 on 66 degrees of freedom
Multiple R-squared:  0.924,	Adjusted R-squared:  0.923 
F-statistic:  801 on 1 and 66 DF,  p-value: <2e-16

The second model considers the effect on the slope of an interaction between year and gender. Here we see that the slope is slightly large for males than females. Although this interaction coefficient is statistically significant, it is extremely small relative to the slope coefficient itself. However, given that the value of the abscissa is around 2000, it still contributes roughly 0.6 extra years to the average age for men.

> summary(fit.2)

lm(formula = mean.age ~ year + year:gender, data = year.statistics)

   Min     1Q Median     3Q    Max 
-1.103 -0.522  0.024  0.388  2.287 

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -4.80e+02   1.68e+01  -28.57  < 2e-16 ***
year             2.59e-01   8.41e-03   30.78  < 2e-16 ***
year:genderMale  3.00e-04   8.26e-05    3.63  0.00056 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935 
F-statistic:  481 on 2 and 65 DF,  p-value: <2e-16

The third model considers an offset on the intercept based on gender. Here, again, we see that the effect of gender is small, with the fit for males being shifted slightly upwards. Again, although this effect is statistically significant, it has only a small effect on the model. Note that the value of this coefficient (5.98e-01 years) is consistent with the effect of the interaction term (0.6 years for typical values of the abscissa) in the second model above.

> summary(fit.3)

lm(formula = mean.age ~ year + gender, data = year.statistics)

    Min      1Q  Median      3Q     Max 
-1.1038 -0.5225  0.0259  0.3866  2.2885 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.80e+02   1.68e+01  -28.58  < 2e-16 ***
year         2.59e-01   8.41e-03   30.79  < 2e-16 ***
genderMale   5.98e-01   1.65e-01    3.62  0.00057 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935 
F-statistic:  480 on 2 and 65 DF,  p-value: <2e-16

The fourth and final model considers both an interaction between year and gender as well as an offset of the intercept based on gender. Here we see that the data does not differ sufficiently on the basis of gender to support both of these effects, and neither of the resulting coefficients is statistically significant.

> summary(fit.4)

lm(formula = mean.age ~ year + year * gender, data = year.statistics)

    Min      1Q  Median      3Q     Max 
-1.0730 -0.5127 -0.0492  0.4225  2.1273 

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -460.3631    23.6813  -19.44   <2e-16 ***
year               0.2491     0.0119   21.00   <2e-16 ***
genderMale       -38.4188    33.4904   -1.15     0.26    
year:genderMale    0.0195     0.0168    1.17     0.25    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.679 on 64 degrees of freedom
Multiple R-squared:  0.938,	Adjusted R-squared:  0.935 
F-statistic:  322 on 3 and 64 DF,  p-value: <2e-16

On the basis of the above discussion, the fourth model can be immediately abandoned. But how do we choose between the three remaining models? An ANOVA indicates that the second model is a significant improvement over the first model. There is little to choose, however, between the second and third models. I find the second model more intuitive, since I would expect there to be a slight gender difference in the rate of aging, rather than a simple offset. We will thus adopt the second model, which indicates that the average age of runners increases by about 0.259 years annually, with the men aging slightly faster than the women.

> anova(fit.1, fit.2, fit.3, fit.4)
Analysis of Variance Table

Model 1: mean.age ~ year
Model 2: mean.age ~ year + year:gender
Model 3: mean.age ~ year + gender
Model 4: mean.age ~ year + year * gender
  Res.Df  RSS Df Sum of Sq     F  Pr(>F)    
1     66 36.2                               
2     65 30.1  1      6.09 13.23 0.00055 ***
3     65 30.1  0     -0.02                  
4     64 29.5  1      0.62  1.36 0.24833    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Lastly, I constructed a data frame based on the second model which gives both the model prediction and a 95% uncertainty interval. This was used to generate the second set of plots.

fit.data <- data.frame(year = rep(1980:2020, each = 2), gender = c("Female", "Male"))
fit.data <- cbind(fit.data, predict(fit.2, fit.data, level = 0.95, interval = "prediction"))

Comrades Marathon Negative Splits: Cheat Strikes Again

It looks likes one of the suspect runners from my previous posts cheated again in this year's Comrades Marathon. Brad Brown has published evidence that the runner in question, Kitty Chutergon (race number 25058), had another athlete running with his number for most of the race. This is a change in strategy from last year, where he appears to have been assisted along the route, having missed the timing mats at Camperdown and Polly Shortts.

Twins, Tripods and Phantoms at the Comrades Marathon

Having picked up a viral infection days before this year's Comrades Marathon, on 1 June I was left with time on my hands and somewhat desperate for any distraction. So I spent some time looking at my archive of Comrades data and considering some new questions. For example, what are the chances of two runners passing through halfway and the finish line at exactly the same time? How likely is it that three runners achieve the same feat?

My data for the 2013 up run gives times with one second precision, so these questions could be answered if I relaxed the constraints from "exactly the same time" to "within one second of each other". We'll call such simultaneous pairs of runners "twins" and simultaneous threesomes will be known as "tripods". How many twins are there? How many tripods? The answers are somewhat surprising. What's even more surprising is another category: "phantoms".

If you are not interested in the details of the analysis (and I'm guessing that you probably aren't), please skip forward to the pictures and analysis.

Looking at the Data

The first step is to subset the data, leaving a data frame containing only the times at halfway and the finish, indexed by a unique runner key.

> simultaneous = subset(splits,
+                      year == 2013 & !is.na(medal))[, c("key", "drummond.time", "race.time")]
> simultaneous = simultaneous[complete.cases(simultaneous),]
> #
> rownames(simultaneous) = simultaneous$key
> simultaneous$key <- NULL
> head(simultaneous)
         drummond.time race.time
4bdcb291        320.15    712.42
4e488aab        294.65    656.90
ab59fc97        304.62    643.67
89d3e09b        270.32    646.78
fc728816        211.27    492.95
7b761740        274.60    584.37

Next we calculate the "distance" (this is a distance in time and not in space) between runners, which is effectively the squared difference between the halfway and finish times for each pair of runners. This yields a rather large matrix with rows and columns labelled by runner key. These data are then transformed into a format where each row represents a pair of runners.

> simultaneous = dist(simultaneous)
> library(reshape2)
> simultaneous = melt(as.matrix(simultaneous))
> head(simultaneous)
      Var1     Var2   value
1 4bdcb291 4bdcb291   0.000
2 4e488aab 4bdcb291  61.093
3 ab59fc97 4bdcb291  70.483
4 89d3e09b 4bdcb291  82.408
5 fc728816 4bdcb291 244.992
6 7b761740 4bdcb291 135.910

We can immediately see that there are some redundant entries. We need to remove the matrix diagonal (obviously the times match when a runner is compared to himself!) and keep only one half of the matrix.

> simultaneous = subset(simultaneous, as.character(Var1) < as.character(Var2))

Finally we retain only the records for those pairs of runners who crossed both mats simultaneously (in retrospect, this could have been done earlier!).

> simultaneous = subset(simultaneous, value == 0)
> head(simultaneous)
            Var1     Var2 value
623174  5217dfc9 75a78d04     0
971958  d8c9c403 e6e0d6e3     0
2024105 2e8f7778 9acc46ee     0
2464116 5f18d86f 9a1697ff     0
2467712 63033429 9a1697ff     0
3538608 54a92b96 f574be97     0

We can then merge in the data for race numbers and names, leaving us with an (anonymised) data set that looks like this:

> simultaneous = simultaneous[order(simultaneous$race.time),]
> head(simultaneous)[, c(4, 6, 8)]
    race.number.x race.number.y race.time
133         59235         56915  07:54:21
9           26132         23470  08:06:55
62          44008         31833  08:25:58
61          25035         36706  08:35:42
54          28868         25910  08:46:42
26          47703         31424  08:47:08
> tail(simultaneous)[, c(4, 6, 8)]
   race.number.x race.number.y race.time
71         54689         16554  11:55:59
60          8846         23003  11:56:26
44          9235         49251  11:56:47
38         53354         53352  11:56:56
28         19268         59916  11:57:49
20         22499         40754  11:58:26


As it turns out, there are a remarkably large number of Comrades twins. In the 2013 race there were more than 100 such pairs. So they are not as rare as I had assumed they would be.


Although there were relatively many Comrades twins, there were only two tripods. In both cases, all three members of the tripod shared the same surname, so they are presumably related.

The members of the first tripod all belong to the same running club, two of them are in the 30-39 age category and the third is in the 60+ group. There's a clear family resemblance, so I'm guessing that they are father and sons. Dad had gathered 9 medals, while the sons had 2 and 3 medals respectively. What a day they must have had together!


The second tripod also consisted of three runners from the same club. Based on gender and age groups, I suspect that they are Mom, Dad and son. The parents had collected 8 medals each, while junior had 3. What a privilege to run the race with your folks! Lucky guy.


And now things get more interesting...

Phantom #1

The runner with race number 26132 appears to have run all the way from Durban to Pietermaritzburg with runner 23470! Check out the splits below.



Not only did they pass through halfway and the finish at the same time, but they crossed every mat along the route at precisely the same time. Yet, somewhat mysteriously, there is no sign of 23470 in the race photographs...




You might notice that there is another runner with 26132 in all three of the images above. That's not 23470. He has race number 28151 and he is not the phantom! His splits below show that he only started running with 26132 somewhere between Camperdown and Polly Shortts.


If you search the race photographs for the phantom's race number (23470), you will find that there are no pictures of him at all! That's right, nineteen photographs of 26132 and not a single photograph of 23470.

Phantom #2

The runner with race number 53367 was also accompanied by a phantom with race number 27587. Again, as can be seen from the splits below, these two crossed every mat on the course at precisely the same time.



Yet, despite the fact that 53367 is quite evident in the race photos, there is no sign of 27587.





I would have expected to see a photograph of 53367 embracing his running mate at the finish, yet we find him pictured with two other runners. In fact, if you search the race photographs for 27587 you will find that there are no photographs of him at all. You will, however, find twelve photographs of 53367.


Well done to the tripods, I think you guys are awesome! As for the phantoms (and their running mates), you have some explaining to do.

Personalised Comrades Marathon Pacing Chart

Although I have been thinking vaguely about my Plan A goal of a Bill Rowan medal at the Comrades Marathon this year, I have not really put a rigorous pacing plan in place. I know from previous experience that I am likely to be quite a bit slower towards the end of the race. I also know that I am going to lose a few minutes at the start. So how fast does this mean I need to run in order to get from Pietermaritzburg to Durban in under 9 hours?

Well, suppose that it takes me 3 minutes from the gun to get across the starting line. And, furthermore, assume that I will be running around 5% slower towards the end of the race. To still get to Durban under 9 hours I would need to run at roughly 5:52 per km at the beginning and gradually ease back to about 6:11 per km towards the end.

I arrived at these figures using a pacing spreadsheet. To get an idea of your pace requirements you will need to specify your goal time, the number of minutes you anticipate losing before crossing the start line and an estimate of how much you think you will slow down during the course of the race. This is done by editing the blue fields indicated in the image below. The rest of the spreadsheet will update on the basis of your selections.


The spreadsheet uses a simple linear model which assumes that your pace will gradually decline at the rate you have specified. If you give 0% for your slowing down percentage then the calculations are performed on the basis of a uniform pace throughout the race. Of course, neither the linear model nor a uniform pace are truly realistic. We all know that our pace will vary continuously throughout the race as a function of congestion, topography, hydration, fatigue, motivation and all of the other factors which come into play. However, as noted by the eminent statistician George Box "all models are wrong, but some are useful". In this case the linear model is a useful way to account for the effects of fatigue.

The spreadsheet will give you an indication of the splits (both relative to the start of the race as well as time of day) and pace (instantaneous and average) required to achieve your goal time. There are also a pair of charts which will be updated with your projected timing and pace information.


My plan on race day is to run according to my average pace. This works well because it smooths out all the perturbations associated with tables and walking breaks.

Losing Time at the Start

One interesting thing to play around with on the spreadsheet is the effect of losing time at the start. If you vary this number you should see that it really does not have a massive influence on your pacing requirements for the rest of the race. For example, if I change my estimate from 3 minutes to 10 minutes then my required average pace decreases from 6:02 per km to 5:57 per km. Sure, this amounts to 5 seconds shaved off every km, but it is not unmananagable: the delay at the start gets averaged out over the rest of the race.

Naturally, the faster you are hoping to finish the race, the more significant a delay at the start is going to become. However, if you are aiming for a really fast time then presumably you are in a good seeding batch. For the majority of runners it is probably not going to make an enormous difference and so it is not worth stressing about.

The important thing is to make sure that you just keep on moving forward. Don't stop. Just keep on putting one foot in front of the other.

Other Pacing Charts

The pacing chart by Dirk Cloete is based on the profile of the route. It breaks the route down into undulating, up and down sections and takes this into account when calculating splits. Don Oliver also has some static pacing charts.

Race Statistics for Comrades Marathon Novice Runners: Corrigendum

There was some significant bias in the histogram from my previous post: the data from all years were lumped together. This is important because as of 2003 (when the Vic Clapham medal was introduced) the final cutoff for the Comrades Marathon was extended from 11:00 to 12:00. In 2000 they also applied an extended cutoff.

I have consequently partitioned the data according to "strict" and "extended" cutoffs.

> novices$extended = factor(novices$year == 2000 | novices$year >= 2003,
+                           labels = c("Strict Cutoff", "Extended Cutoff"))


This paints a much more representative picture of the distribution of finish times now that the race has been extended to 12 hours.

The allocation of medals is complicated by the fact that new medals have been introduced at different times over recent years. Specifically, the Bill Rowan medal was first awarded in 2000, then the Vic Clapham medal was introduced in 2003 and, finally, 2007 saw the first Wally Hayward medals.

> novices$period = cut(novices$year, breaks = c(1900, 2000, 2003, 2007, 3000), right = FALSE,
+                      labels = c("before 2000", "2000 to 2002", "2003 to 2006", "after 2007"))
> novice.medals = table(novices$medal, novices$period)
> novice.medals = scale(novice.medals, scale = colSums(novice.medals), center = FALSE) * 100
> options(digits = 1)
> (novice.medals = t(novice.medals))
                Gold Wally Hayward Silver Bill Rowan Bronze Vic Clapham
  before 2000   0.07          0.00   4.80       0.00  95.13        0.00
  2000 to 2002  0.09          0.00   2.66      12.76  84.49        0.00
  2003 to 2006  0.15          0.00   4.05      17.51  47.63       30.66
  after 2007    0.08          0.03   2.60      12.28  46.40       38.62

So, currently, around 46% of novices get a Bronze medal while slight fewer, about 37%, get a Vic Clapham medal. A significant fraction, just over 12%, achieve a Bill Rowan, while only 2.6% get a Silver medal. The number of Wally Hayward and Gold medals among novices is very small indeed.


Thanks to Daniel for pointing out this issue!

Race Statistics for Comrades Marathon Novice Runners

Most novice Comrades Marathon runners finish the race on their first attempt and the majority of them walk (shuffle, crawl?) away with Bronze medals.

What is a Novice?

To paraphrase the dictionary, a novice is "a person who is new to or inexperienced in the circumstances in which he or she is placed; a beginner". In the context of the Comrades Marathon this definition can be interpreted in a few ways:

  1. a runner who has never run the Comrades Marathon (has never started the race);
  2. a runner who has never completed the Comrades Marathon (has never finished the race); or
  3. a runner who has not completed both an "up" and a "down" Comrades Marathon.

For the purposes of this article I will be adopting the first definition. This is probably the one of most interest to runners who are embarking on their first Comrades journey.

Identifying a Novice

I'll be using the same data sets that I have discussed in previous articles. Before we focus on the data for the novices we'll start by just retaining the fields of interest.

> novices = results[, c("key", "year", "category", "gender", "medal", "medal.count", "status", "ftime")]
> head(novices)
       key year     category gender       medal medal.count   status   ftime
1 100030f4 2008 Ages 20 - 29 Female Vic Clapham           1 Finished 11.3728
2 100030f4 2009 Ages 20 - 29 Female        <NA>           1      DNF      NA
3 100030f4 2013 Ages 20 - 29 Female        <NA>           1      DNS      NA
4 10007cb6 2005 Ages 26 - 39   Male      Bronze           1 Finished  9.1589
5 10007cb6 2006 Ages 30 - 39   Male  Bill Rowan           2 Finished  8.2564
6 10007cb6 2007 Ages 30 - 39   Male  Bill Rowan           3 Finished  8.0344

To satisfy our definition of novice we'll need to exclude the "did not start" (DNS) records.

> novices = subset(novices, status != "DNS")
> head(novices)
       key year     category gender       medal medal.count   status   ftime
1 100030f4 2008 Ages 20 - 29 Female Vic Clapham           1 Finished 11.3728
2 100030f4 2009 Ages 20 - 29 Female        <NA>           1      DNF      NA
4 10007cb6 2005 Ages 26 - 39   Male      Bronze           1 Finished  9.1589
5 10007cb6 2006 Ages 30 - 39   Male  Bill Rowan           2 Finished  8.2564
6 10007cb6 2007 Ages 30 - 39   Male  Bill Rowan           3 Finished  8.0344
7 10007cb6 2008 Ages 30 - 39   Male  Bill Rowan           4 Finished  8.8514

Some runners do not finish the race on their first attempt but they bravely come back to run the race again. We will retain only the first record for each runner, because the second time they attempt the race they are (according to our definition) no longer novices since already have some race experience.

> novices = novices[order(novices$year),]
> novices <- novices[which(!duplicated(novices$key)),]

Percentage of Novice Finishers

I suppose that the foremost question going through the minds of many Comrades novices is "Will I finish?".

> table(novices$status) / nrow(novices) * 100

Finished      DNF 
  80.035   19.965 

Well, there's some good news: around 80% of all novices finish the race. Those are quite compelling odds. Of course, a number of factors can influence the success of each individual, but if you have done the training and you run sensibly, then the odds are in your favour.

Medal Distribution for Novice Finishers

What medal is a novice most likely to receive?

> table(novices$medal) / nrow(subset(novices, !is.na(medal))) * 100

         Gold Wally Hayward        Silver    Bill Rowan        Bronze   Vic Clapham 
    0.0829671     0.0051854     4.0264976     5.6469490    79.4708254    10.7675754 

The vast majority (again around 80%) claim a Bronze medal. There are also a significant proportion (just over 10%) who miss the eleven hour cutoff and get a Vic Clapham medal. Around 6% of novices achieve a Bill Rowan medal and a surprisingly large fraction, just over 4%, manage to finish in a Silver medal time of under seven and a half hours. There are very few Wally Hayward and Gold medals won by novices. The odds for a novice Gold medal are around one in 1200, all else being equal (which it very definitely isn't!).

Distribution of Novice Finishing Times


As one would expect, the chart slopes up towards the right: progressively more runners come in later in the day. There is very clear evidence of clustering of runners just before the medal cutoffs at 07:30, 09:00, 11:00 and 12:00. There is also a peak before the psychological cutoff at 10:00.

Take Away Message

The data for previous years indicates that the outlook for novices is rather good. 80% of them will finish the race and, of those, around 80% will receive Bronze medals.

How can you help ensure that you have a successful race? Here are some of the things I would think about:

  1. Start slowly. It's going to be a long day.
  2. Take regular walking breaks and start doing this early on. A few minutes' recovery will power you up for a number of kms.
  3. Stay hydrated. Take something at every water table. Just don't overdo it.
  4. Be inspired by the other runners: they all have the guts to indulge in this madness with you and every one of them is fighting their own battle.
  5. Enjoy the support: the hordes of people beside the road have come out to see YOU run by. And they all want you to finish.
  6. Enjoy the day: as far as entertainment is concerned, the Comrades Marathon is about the best value for money that you can get.

See you in Pietermaritzburg at 05:30 on 1 June!


Thanks to Daniel for suggesting this article.

Comrades Marathon Negative Splits: The Plot Thickens

I have been thinking a little more about those mysterious negative splits. Not too surprisingly, this thinking happened while I was out running along the Durban beachfront this morning.

Let's have a look at the ten most extreme negative splits from Comrades Marathon 2013:

> split.ratio.2013 = subset(split.ratio, year == 2013)
> #
> split.ratio.2013 = head(split.ratio.2013[order(split.ratio.2013$ratio),], 10)
> #
> rownames(split.ratio.2013) <- 1:nrow(split.ratio.2013)
> split.ratio.2013[, c(-2, -7)]
   year      key drummond.time race.time     ratio
1  2013 3c0ea3bc        368.12    636.50 -0.270929
2  2013 e22d8c74        359.00    633.17 -0.236305
3  2013 5cd624eb        354.87    640.05 -0.196365
4  2013 4d5a86d7        359.45    659.88 -0.164186
5  2013  61fa6b5        345.33    644.38 -0.134025
6  2013 e5d6fa0e        344.33    649.83 -0.112778
7  2013 63a33c8d        368.88    696.88 -0.110830
8  2013 e445f2d1        340.15    647.20 -0.097310
9  2013 fed967de        338.67    647.77 -0.087303
10 2013 553aeb62        364.02    697.90 -0.082780

Below are the splits data for these runners (in the same order as the table above).


The top one you have seen before (it was presented in my previous post). And, as previously noted, this runner's time was not captured by the mat at either Camperdown or Polly Shortts. But if we look at the runner with the next most extreme negative split (e22d8c74) we see that the same thing happened: mysteriously he too was missed by those timing mats. The mats must have been having a bad day. The next two major negative splits (5cd624eb and 4d5a86d7): same story, no times at either of those mats. The next runner (61fa6b5) was captured on all five timing mats. And if we look at his splits, he is getting progressively faster during the course of the race. I suspect that this guy actually just had a very well planned and executed race. But the following runner on the list (e5d6fa0e) has also managed to elude both the mats in the second half of the race. Very strange indeed. The final four runners all have splits registered for every timing mat. And, again, if you look at their pace for each of the legs, it is not too hard to believe that these runners were playing by the rules and just had a very good day on the road.

So, of the top ten runners with extreme negative splits, five of them (yes, that's 50%) inexplicably missed both timing mats in the second half of the race. Coincidence? I think not.

Comrades Marathon: Negative Splits and Cheating

With this year's Comrades Marathon just less than a month away, I was reminded of a story from earlier in the year. Mark Dowdeswell, a statistician at Wits University, found evidence of cheating by some middle and back of the pack Comrades runners. He identified a group of 20 athletes who had suspicious negative splits: they ran much faster in the second half of the race. There was one runner in particular whose splits were just too good to be true. When the story was publicised, this particular runner claimed that it was a conspiracy.

This story emerged in February this year.

There was quite a fuss.

And then everything went quiet. The suspected runners were instructed to attend disciplinary hearings, but the outcomes of these hearings have not been publicised nor have the names of the suspected runners been released.

I have done some previous analyses using on Comrades Marathon data. Here I am going to use the same data set to explore these suspicious negative splits.

Data Preparation

I started off by extracting a subset of the columns from my splits data.

> split.ratio = splits[, c("year", "race.number", "key", "drummond.time", "race.time")]
> tail(split.ratio)
          year race.number      key drummond.time race.time
2013-9911 2013        9911 eb4b3b0c        303.40    686.68
2013-9912 2013        9912 c8d6cfdd        218.73    484.00
2013-9940 2013        9940 f46204ad        249.87    582.03
2013-9954 2013        9954 4bd1ca76        307.62    669.23
2013-9955 2013        9955 b2b9ed60        286.85    651.87
2013-9964 2013        9964 6f14470d        242.20    573.78

The resulting records have fields for the year, athlete's race number, a unique key identifying the runner, and time taken (in minutes) to reach the little town of Drummond (the half way point at around the marathon distance) and the finish. We will only keep the complete records (valid entries for both half way and the full distance) and then add a new field.

> # This is derived from (race.time - drummond.time) / drummond.time - 1
> #
> split.ratio = transform(split.ratio,
+                         ratio = race.time / drummond.time - 2
+                         )
> head(split.ratio)
           year race.number      key drummond.time race.time   ratio
2000-10003 2000       10003 f1dffb06        243.65    532.33 0.18483
2000-10009 2000       10009 b06cab7f        274.47    599.95 0.18588
2000-10010 2000       10010 929fd7ee        273.38    620.35 0.26916
2000-10013 2000       10013 5d7aa79c        295.72    633.80 0.14327
2000-10014 2000       10014 c0578dad        247.18    533.80 0.15953
2000-10016 2000       10016 d64e4b42        257.60    657.65 0.55299
> summary(split.ratio$ratio)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.5640  0.0934  0.1630  0.1870  0.2540  1.7400

The ratio field is a number between -1 and 1 which quantifies the time difference between first and second halves of the race. So, for example, if a runner took 4.5 hours for the first half and then 5.0 hours for the second half, his ratio would be 0.11111, indicating that he ran around 11% slower in the second half of the race.

> 9.5 / 4.5 - 2
[1] 0.11111

Conversely, if a runner took 5.0 hours for the first half and then finished the second half in 4.5 hours, his ratio would be -0.1, indicating that he ran about 10% faster in the second half.

> 9.5 / 5.0 - 2
[1] -0.1

Negative values of this ratio then indicate negative splits, while positive values are for positive splits and a value of exactly zero would be for even splits (same time for both halves of the race). Let's look at the two extremes.

> head(split.ratio[order(split.ratio$ratio, decreasing = TRUE),])[, -2]
           year      key drummond.time race.time  ratio
2009-37874 2009 2c5ad823        178.72    668.70 1.7417
2008-36570 2008 d4033ea2        189.98    710.13 1.7379
2005-30155 2005 5a961d21        175.13    643.78 1.6760
2009-33945 2009 a1e79747        183.08    671.57 1.6681
2009-57185 2009 fdc6a261        186.92    653.70 1.4973
2011-56513 2011 df77e8bb        172.38    598.12 1.4697

Large (positive) values of the split ratio mean that a runner ran the second half much slower than the first half. Unless the time for the first half is unrealistic, then these are not suspicious: it is quite reasonable that a runner should go out really hard in the first half, get to half way in good time but then find that the wheels fall off in the second half of the race. Take, for example, the runner with key 2c5ad823, whose time for the first half was blisteringly fast (just less than three hours) but who slowed down a lot in the second half, only finishing the race in around 11 hours.

> head(split.ratio[order(split.ratio$ratio),])[, -2]
           year      key drummond.time race.time    ratio
2001-45410 2001 1a605ce5        340.32    488.82 -0.56364
2009-25058 2009 3c0ea3bc        359.08    591.63 -0.35238
2000-2187  2000 ef35f2e6        337.08    569.48 -0.31056
2000-8152  2000 18e59575        324.03    557.25 -0.28027
2013-25058 2013 3c0ea3bc        368.12    636.50 -0.27093
2012-48382 2012 7889f60a        336.85    592.57 -0.24086

At the other end of the spectrum we have runners with very low values of the split ratio, meaning that they ran the second half much faster than the first half. Take, for example, the runner with key 1a605ce5: she ran the first half in around five and a half hours but whipped through the second half in less than three hours. Seems a little odd, right?

Note that one runner (key 3c0ea3bc) crops up twice in the top 6 negative split ratios above. More about him later.

Some Plots

Let's have a look at the empirical distribution of split ratios.


We can see that only a very small fraction of the field achieves a negative split. And that these runners generally only shave a few percent off their first half times. The dashed lines on the plot indicate the extreme values of the split ratio. Both of these are a long way from the body of the distribution. In statistical terms, either of these extremes is highly improbable.

If we categorise the runners broadly by the number of hours required to finish the race then we get a slightly different view of the data.

> split.ratio = transform(split.ratio,
+                         ihour = factor(floor(race.time / 60)))
> levels(split.ratio$ihour) = sprintf("%s hour", levels(split.ratio$ihour))
> #
> (split.ratio.range = ddply(split.ratio, .(ihour), summarize, min = min(ratio), max = max(ratio)))
    ihour       min     max
1  5 hour -0.061526 0.24595
2  6 hour -0.130848 0.57918
3  7 hour -0.172256 0.84996
4  8 hour -0.563642 1.43530
5  9 hour -0.352379 1.46969
6 10 hour -0.270929 1.67596
7 11 hour -0.115299 1.74168

Runners who finish the race in less than 6 hours (in the "5 hour" bin above, which includes the race winner) have split ratios between -0.061526 and 0.24595. The 8 hour bin has ratios which range from -0.563642 to 1.43530. So there was a runner in this group who was twice as fast in the second half... The 9 and 10 hour bins also have some inordinately large negative splits.

What about the distribution of splits in each of these categories?


Now that paints an interesting picture. We can clearly see that in the 5 hour bin quite a significant proportion of the elite runners manage to achieve negative splits. The proportion in all the other bins is appreciably smaller, yet the extreme negative splits are very much larger!

Note that the density curve for the 5 hour bin extends slightly beyond the dashed line indicating the smallest value in this group. This is an artifact of the kernel density method used to create these curves, for which there is a trade off between the smoothness of the curve and the fidelity of the curve to the data. With a smoother curve the data are effectively smeared out more.

We can quantify those proportions.

> negsplit.ihour = with(split.ratio, table(ihour, ratio < 0))
> negsplit.ihour = negsplit.ihour / rowSums(negsplit.ihour)
> #
> negsplit.ihour[,2] * 100
 5 hour  6 hour  7 hour  8 hour  9 hour 10 hour 11 hour 
14.2857  2.8740  2.2335  3.0653  3.1862  3.8505  1.9485 

So, 14.3% of the runners in the 5 hour bin shave off some time in the second half of the race. In the other bins only around 2% to 3% of runners manage to achieve this feat.

Finally, before we dig into the details of some individual runners, let's see how things vary from year to year.


These data are more or less consistent between years. The median of the ratio is around 10% to 20%; the maximum is always roughly 100% or more; the minimum fluctuates rather wildly, extending from the credible -9.7% all the way down to the incredible -56.4%

> ddply(split.ratio, .(year), summarize, median = median(ratio), min = min(ratio), max = max(ratio))
   year   median       min    max
1  2000 0.163330 -0.310556 1.3282
2  2001 0.168550 -0.563642 1.0321
3  2002 0.211599 -0.175799 1.2257
4  2003 0.171931 -0.151615 1.3793
5  2004 0.201743 -0.172256 1.2693
6  2005 0.151614 -0.183591 1.6760
7  2006 0.179430 -0.131274 1.0500
8  2007 0.153102 -0.129477 1.3033
9  2008 0.208643 -0.096563 1.7379
10 2009 0.163242 -0.352379 1.7417
11 2010 0.093532 -0.206322 1.3878
12 2011 0.150365 -0.125141 1.4697
13 2012 0.118362 -0.240859 1.1876
14 2013 0.204870 -0.270929 1.1596

Individual Runners

We are going to focus our attention on those runners with suspiciously large negative splits. These have been identified on the plot below as those with ratios less than -15% (that is, to the left of the dotted line). The threshold at -15% is somewhat arbitrary, but is certainly conservative.


We extract only those records with ratios less than -15% and discard fields (like race number) to enforce a degree of anonymity. We will also add in a field to indicate how many times a runner appears in the list.

> suspect = subset(split.ratio, ratio < RMIN)[, c("year", "key", "race.time", "ratio")]
> (suspect = ddply(suspect, .(key), mutate, entries = length(ratio)))
   year      key race.time    ratio entries
1  2000 12bade96    545.20 -0.15863       1
2  2000 18e59575    557.25 -0.28027       1
3  2001 1a605ce5    488.82 -0.56364       1
4  2002 2edeb04e    556.53 -0.17580       1
5  2009 3c0ea3bc    591.63 -0.35238       2
6  2013 3c0ea3bc    636.50 -0.27093       2
7  2001   4abfd3    526.87 -0.19741       1
8  2013 4d5a86d7    659.88 -0.16419       1
9  2013 5cd624eb    640.05 -0.19636       1
10 2004 5ec9a72b    445.12 -0.17226       1
11 2012 7889f60a    592.57 -0.24086       1
12 2005  81f2015    538.72 -0.18359       1
13 2010 9f83c1a5    639.75 -0.15252       1
14 2003 a229ca86    544.75 -0.15161       1
15 2012 a59982c4    633.30 -0.18235       1
16 2010 a962c295    644.05 -0.20632       1
17 2010 ab59fc97    626.78 -0.15986       1
18 2001 c293e8f5    618.82 -0.17395       1
19 2013 e22d8c74    633.17 -0.23630       1
20 2000 ef35f2e6    569.48 -0.31056       1
21 2000 efdaf288    611.33 -0.22502       1
22 2005 fce308d5    638.98 -0.18083       1

That's interesting, only one runner (the same guy with key 3c0ea3bc) appears twice.

We can take a look at the recent race history for these runners.


For a number of these runners there are only splits data for a few years, so it's quite difficult to say anything conclusive. The negative split achieved by 1a605ce5 in 2001 looks pretty extreme though... Others runners, like 4d5a86d7, 9f83c1a5 and fce308d5 have a high degree of variability in both their first and second half times, so again it is difficult to spot an anomaly with certainty.

Let's have a good look at 3c0ea3bc though. He has run the race consistently from 1991 to 2013. He did not finish in 1991 or 1997, but in the other years has managed to rack up 11 Bronze medals and 9 Vic Clapham medals, and in the process earned a double green number. The plot shows that his time to half way has been gradually increasing over the years. Not surprising since we all slow down with age. His finish time has mostly followed the same trend. Except for two major hiccups in 2009 and 2013. It's hard to say for certain that these unusual negative splits were the result of cheating. But, equally, it's hard to imagine how else they might have happened.

Here are the splits data for 3c0ea3bc:


So he was not recorded by either of the timing mats at Camperdown or Polly Shortts. It is well known that these mats are not perfect and sometimes they do miss runners. However, the missing splits at these mats plus the extraordinary time for the second half of the race are rather condemning.

I wonder what happened with those disciplinary hearings?

Other Links to This Story

Mark Dowdeswell had something to say about this in an interview on Run Talk SA.

A Chart of Recent Comrades Marathon Winners

Continuing on my quest to document the Comrades Marathon results, today I have put together a chart showing the winners of both the men and ladies races since 1980. Click on the image below to see a larger version.


The analysis started off with the same data set that I was working with before, from which I extracted only the records for the winners.

> winners = subset(results, gender.position == 1, select = c(year, name, gender, race.time))
> head(winners)
     year               name gender race.time
1    1980          Alan Robb   Male  05:38:25
428  1980 Isavel Roche-Kelly Female  07:18:00
3981 1981      Bruce Fordyce   Male  05:37:28
4055 1981 Isavel Roche-Kelly Female  06:44:35
7643 1982      Bruce Fordyce   Male  05:34:22
7873 1982        Cheryl Winn Female  07:04:59

I then added in a field which gives a count of the number of times each person won the race.

> library(plyr)
> winners = ddply(winners, .(name), function(df) {
+     df = df[order(df$year),]
+     df$count = 1:nrow(df)
+     return(df)
+ })
> subset(winners, name == "Bruce Fordyce")
   year          name gender race.time count
7  1981 Bruce Fordyce   Male  05:37:28     1
8  1982 Bruce Fordyce   Male  05:34:22     2
9  1983 Bruce Fordyce   Male  05:30:12     3
10 1984 Bruce Fordyce   Male  05:27:18     4
11 1985 Bruce Fordyce   Male  05:37:01     5
12 1986 Bruce Fordyce   Male  05:24:07     6
13 1987 Bruce Fordyce   Male  05:37:01     7
14 1988 Bruce Fordyce   Male  05:27:42     8
15 1990 Bruce Fordyce   Male  05:40:25     9

The chart was generated as a scatter plot using ggplot2. The size of the points relates to the number of times each person won the race. The colour scale is as you might imagine: pink for the ladies and blue for the men.

> library(ggplot2)
> ggplot(winners, aes(x = year, y = name, color = gender)) +
+     geom_point(aes(size = count), shape = 19, alpha = 0.75) +
+     scale_size_continuous(range = c(5, 15)) +
+     ylab("") + xlab("") +
+     scale_x_discrete(expand = c(0, 1)) +
+     theme(
+         axis.text.x = element_text(angle = 45, hjust = 1, colour = "black"),
+         axis.text.y = element_text(colour = "black"),
+         legend.position = "none",
+         panel.background = element_blank(),
+         panel.grid.major = element_line(linetype = "dotted", colour = "grey"),
+         panel.grid.major.x = element_blank()
+         )

Two of the key aspects of getting this to look just right were:

  • the call to scale_size_continuous() which ensured that a reasonable range of point sizes was used and
  • the call to scale_x_discrete() which expanded the plot very slightly so that the points near the borders were not cropped.