Tag Archives: Comrades Marathon

Comrades Marathon Medal Predictions

IMG_5239

With only a few days to go until race day, most Comrades Marathon athletes will focusing on resting, getting enough sleep, hydrating, eating and giving a wide berth to anybody who looks even remotely ill.

They will probably also be thinking a lot about Sunday's race. What will the weather be like? Will it be cold at the start? (Unlikely since it's been so warm in Durban.) How will they feel on the day? Will they manage to find their seconds along the route?

For the more performance oriented among them (and, let's face it, that's most runners!), there will also be thoughts of what time they will do on the day and what medal they'll walk away with. I've considered ways for projecting finish times in a previous article. Today I'm going to focus on a somewhat simpler goal: making a Comrades Marathon medal prediction.

In the process I have put together a small application which will make medal predictions based on recent race times.

medal-prediction-interface

I'm not going to delve too deeply into the details, but if you really don't have the patience, feel free to skip forward to the results or click on the image above, which will take you to the application. If you have trouble accessing the application it's probable that you are sitting behind a firewall that is blocking it. Try again from home.

Raw Data

The data for this analysis were compiled from a variety of sources. I scraped the medal results off the Comrades Marathon Results Archive. Times for other distances were cobbled together from Two Oceans Marathon Results, RaceTec Results and the home pages of some well organised running clubs.

The distribution of the data is broken down below as a function of gender, Comrades Marathon medal and other distances for which I have data. For instance, I have data for 45 female runners who got a Bronze medal and for whom a 32 km race time was available.

medal-race-distance-count

Unfortunately the data are pretty sparse for Gold, Wally Hayward and Silver medalists, especially for females. I'll be collecting more data over the coming months and the coverage in these areas should improve. Athletes that are contenders for these medals should have a pretty good idea of what their likely prospects are anyway, so the model is not likely to be awfully interesting for them. This model is intended more for runners who are aiming at a Bill Rowan, Bronze or Vic Clapham medal.

Decision Tree

The first step in the modelling process was to build a decision tree. Primarily this was to check whether it was feasible to predict a medal class based on race times for other distances (I'm happy to say that it was!). The secondary motivation was to assess what the most important variables were. The resulting tree is plotted below. Open this plot in a new window so that you can zoom in on the details. As far as the labels on the tree are concerned, "min" stands for "minimum" time over the corresponding distance and times (labels on the branches) are given in decimal hours.

medal-ctree

The first thing to observe is that the most important predictor is 56 km race time. This dominates the first few levels in the tree hierachy. Of slightly lesser importance is 42.2 km race time, followed by 25 km race time. It's interesting to note that 32 km and 10 km results does no feature at all in the tree, probably due to the relative scarcity of results over these distances in the data.

Some specific observations from the tree are:

  • Male runners who can do 56 km in less than 03:30 have around 20% chance of getting a Gold medal.
  • Female runners who can do 56 km in less than 04:06 have about 80% chance of getting a Gold medal.
  • Runners who can do 42.2 km in less than about 02:50 are very likely to get a Silver medal.
  • Somewhat more specifically, runners who do 56 km in less than 05:53 and 42.2 km in more than 04:49 are probably in line for a Vic Clapham.

Note that the first three observations above should be taken with a pinch of salt since, due to a lack of data, the model is not well trained for Gold, Wally Hayward and Silver medals.

You'd readily be forgiven for thinking that this decision tree is an awfully complex piece of apparatus for calculating something as simple as the colour of your medal.

Well, yes, it is. And I am going to make it simpler for you. But before I make it simpler, I am going to make it slightly more complicated.

A Forest of Decision Trees

Instead of just using a single decision tree, I built a Random Forest consisting of numerous trees, each of which was trained on a subset of the data. Unfortunately the resulting model is not as easy to visualise as a single decision tree, but the results are far more robust.

Medal Prediction Application

To make this a little more accessible I bundled the model up in a Shiny application which I deployed here on Amazon Web Services. Give it a try. You'll need to enter the times you achieved over one or more race distances during the last few months. Note that these are race times, not training run times. The latter are not good predictors for your Comrades medal.

Let's have a quick look at some sample predictions. Suppose that you are a male athlete who has recent times of 00:45, 01:45, 04:00 and 05:00 for 10, 21.1, 42.2 and 56 km races respectively, then according to the model you have a 77% probability of getting a Bronze medal and around 11% chance of getting either a Bill Rowan or Vic Clapham medal. There's a small chance (less than 1%) that you might be in the running for a Silver medal.

medal-predictions-bog-standard

What about a male runner who recently ran 03:20 for 56 km? There is around 20% chance that he would get a Gold medal. Failing that he would most likely (60% chance) get a Silver.

medal-predictions-gold-male

If you happen to have race results for the last few years that I could incorporate into the model, please get in touch. I'm keen to collaborate on improving this tool.

Comrades Marathon Finish Predictions

* If you see a bunch of [Math Processing Error] errors, you might want to try opening the page in a different browser. I have had some trouble with MathJax and Internet Explorer. Yet another reason to never use Windows.

There are various approaches to predicting Comrades Marathon finishing times. Lindsey Parry, for example, suggests that you use two and a half times your recent marathon time. Sports Digest provides a calculator which predicts finishing time using recent times over three distances. I understand that this calculator is based on the work of Norrie Williamson.

Let's give them a test. I finished the 2013 Comrades Marathon in 09:41. Based on my marathon time from February that year, which was 03:38, Parry's formula suggests that I should have finished at around 09:07. Throwing in my Two Oceans time for that year, 04:59, and a 21.1 km time of 01:58 a few weeks before Comrades, the Sports Digest calculator gives a projected finish time of 08:59. Clearly, relative to both of those predictions, I under-performed that year! Either that or the predictions were way off the mark.

It seems to me that, given the volume of data we gather on our runs, we should be able to generate better predictions. If the thought of maths or data makes you want to doze off, feel free to jump ahead, otherwise read on.

Riegel's Formula

In 1977 Peter Riegel published a formula for predicting running times, which became popular due to its simplicity. The formula itself looks like this:

 \Delta t_2 = \Delta t_1 \left( \frac{d_2}{d_1} \right)^{1.06}


which allows you to predict \Delta t_2 the time it will take you to run distance d_2, given that you know it takes you time \Delta t_1 to run distance d_1. Riegel called this his "endurance equation".

Reverse-Engineering Riegel's Model

Riegel's formula is an empirical model: it's based on data. In order to reverse engineer the model we are going to need some data too. Unfortunately I do not have access to data for a cohort of elite runners. However, I do have ample data for one particular runner: me. Since I come from the diametrically opposite end of the running spectrum (I believe the technical term would be "bog standard runner"), I think these data are probably more relevant to most runners anyway.

I compiled my data for the last three years based on the records kept by my trusty Garmin 910XT. A plot of time versus distance is given below.

training-data-linear

At first glance it looks like you could fit a straight line through those points. And you can, indeed, make a pretty decent linear fit.

> fit <- lm(TimeHours ~ Distance, data = training)
> 
> summary(fit)

Call:
lm(formula = TimeHours ~ Distance, data = training)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64254 -0.04592 -0.00618  0.02361  1.24900 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1029964  0.0107648  -9.568   <2e-16 ***
Distance     0.1012847  0.0008664 116.902   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1394 on 442 degrees of freedom
Multiple R-squared:  0.9687,	Adjusted R-squared:  0.9686 
F-statistic: 1.367e+04 on 1 and 442 DF,  p-value: < 2.2e-16

However, applying a logarithmic transform to both axes gives a more uniform distribution of the data, which also now looks more linear.

training-data-log

Riegel observed that data for a variety of disciplines (running, swimming, cycling and race walking) conformed to the same pattern. Figure 1 from this paper is included below.

Figure 1 from Riegel's paper "Athletic Records and Human Endurance".

Figure 1 from Riegel's paper "Athletic Records and Human Endurance".

If we were to fit a straight line to the data on logarithmic axes then the relationship we'd be contemplating would have the form

\log \Delta t = m \log d + c


or, equivalently,

\Delta t = k d^m


which is a power law relating elapsed time to distance. It's pretty easy to get Riegel's formula from this. Taking two particular points on the power law, \Delta t_1 = k d_1^m and \Delta t_2 = k d_2^m, and eliminating k gives

\Delta t_2 = \Delta t_2 \left( \frac{d_2}{d_1} \right)^m


which is Riegel's formula with an unspecified value for the exponent. We'll call the exponent the "fatigue factor" since it determines the degree to which a runner slows down as distance increases.

How do we get a value for the fatigue factor? Well, by fitting the data, of course!

> fit <- lm(log(TimeHours) ~ log(Distance), data = training)
> 
> summary(fit)

Call:
lm(formula = log(TimeHours) ~ log(Distance), data = training)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27095 -0.04809 -0.01843  0.01552  0.80351 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -2.522669   0.018111  -139.3   <2e-16 ***
log(Distance)  1.045468   0.008307   125.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09424 on 442 degrees of freedom
Multiple R-squared:  0.9729,	Adjusted R-squared:  0.9728 
F-statistic: 1.584e+04 on 1 and 442 DF,  p-value: < 2.2e-16

The fitted value for the exponent is 1.05 (rounding up), which is pretty close to the value in Riegel's formula. The fitted model is included in the logarithmic plot above as the solid line, with the 95% prediction confidence interval indicated by the coloured ribbon. The linear plot below shows the data points used for training the model, the fit and confidence interval as well as dotted lines for constant paces ranging from 04:00 per km to 07:00 per km.

training-data-linear-model

Our model also provides us with an indication of the uncertainty in the fitted exponent: the 95% confidence interval extends from 1.029 to 1.062.

> confint(fit)
                  2.5 %    97.5 %
(Intercept)   -2.558264 -2.487074
log(Distance)  1.029142  1.061794

Estimating the Fatigue Factor

A fatigue factor of 1 would correspond to a straight line, which implies a constant pace regardless of distance. A value less than 1 implies faster pace at larger distances (rather unlikely in practice). Finally, a value larger than 1 implies progressively slower pace at larger distances.

The problem with the fatigue factor estimate above, which was based on a single model fit to all of the data, is that it's probably biased by the fact that most of the data are for runs of around 10 km. In this regime the relationship between time and distance is approximately linear, so that the resulting estimate of the fatigue factor is probably too small.

To get around this problem, I employed a bootstrap technique, creating a large number of subsets from the data. In each subset I weighted the samples to ensure that there was a more even distribution of distances in the mix. I calculated the fatigue factor for each subset, resulting in a range of estimates. Their distribution is plotted below.

fatigue-factor-bootstrap

According to this analysis, my personal fatigue factor is around 1.07 (the median value indicated by the dashed line in the plot above). The Shapiro-Wilk test suggests that the data is sufficiently non-Normal to justify a non-parameteric estimate of the 95% confidence interval for the fatigue factor, which runs from 1.03 to 1.11.

> shapiro.test(fatigue.factor)

	Shapiro-Wilk normality test

data:  fatigue.factor
W = 0.9824, p-value = 1.435e-12

> median(fatigue.factor)
[1] 1.072006
> quantile(fatigue.factor, c(0.025, 0.975))
    2.5%    97.5% 
1.030617 1.107044 

Riegel's analysis also lead to a range of values for the fatigue factor. As can be seen from the table below (extracted from his paper), the values range from 1.01 for Nordic skiing to 1.14 for roller skating. Values for running range from 1.05 to 1.08 depending on age group and gender.

Table 1 from Riegel's paper "Athletic Records and Human Endurance".

Table 1 from Riegel's paper "Athletic Records and Human Endurance".

Generating Predictions

The rules mentioned above for predicting finishing times are generally applied to data for a single race (or a selection of three races). But, again, given that we have so much data on hand, would it not make sense to generate a larger set of predictions?

Predicted finishing times for the Comrades Marathon

The distributions above indicate the predictions for this year's Comrades Marathon (which is apparently going to be 89 km) based on all of my training data this year and using both the default (1.06) and personalised (1.07) values for the fatigue factor. The distributions are interesting, but what we are really interested in is the expected finish times, which are 08:59 and 09:18 depending on what value you use for the fatigue factor. I have a little more confidence in my personalised value, so I am going to be aiming for 09:18 this year.

Comrades is a long day and a variety of factors can affect your finish time. It's good to have a ball-park idea though. If you would like me to generate a set of personalised predictions for you, just get in touch via the replies below.

Postscript

I repeated the analysis for one of my friends and colleagues. His fatigue factor also comes out as 1.07 although, interestingly, the distribution is bi-modal. I think I understand the reason for this though: his runs are divided clearly into two groups: training runs and short runs back and forth between work and the gym.

fatigue-factor-bootstrap-karl-wilhelm

Comrades Runners Disqualified: I'm Not Convinced

Apparently 14 runners have been disqualified for failing to complete the full distance at the 2012 and 2013 Comrades Marathons.

I'm not convinced.

Although 20 runners were charged with misconduct, six of them had a valid story. These runners had retired from the race but the bailers' bus had dropped them back on the course and they were "forced" to cross the finish line. I find this story a hard to digest. My understanding is that race numbers are either confiscated, destroyed or permanently marked when entering the bus. If this story is true then it should have thus been immediately obvious to officials that the runners in question had dropped out of the race. Their times should never have been recorded and they certainly should not have received medals (which presumably at least some of them did, since they have been instructed to return them!).

So that leaves the 14 runners who were disqualified. Were any of them among the group of mysterious negative splits identified previously? Unless KZNA or the CMA releases the names or race numbers, I guess we'll never know.

Comrades Marathon Pacing Chart: Up Run

I've updated my Comrades Marathon pacing chart to include both the Up and Down runs. You can grab it here. The data for this year's race are not yet finalised (I think we will be running 87 km), but you can make changes when it's all confirmed.

The use of the chart is explained in a previous post. Any feedback on how this can be improved would be appreciated.

comrades-pacing-2015

Comrades Marathon: A Race for Geriatrics?

It has been suggested that the average Comrades Marathon runner is gradually getting older. As an "average runner" myself, I will not deny that I am personally getting older. But, what I really mean is that the average age of all runners taking part in this great event is gradually increasing. This is not just an idle hypothesis: it is supported by the data. If you're interested in the technical details of the analysis, these are included at the end, otherwise read on for the results.

Results

The histograms below show graphically how the distribution of runners' ages at the Comrades Marathon has changed every decade starting in the 1980s and proceeding through to the 2010s. The data are encoded using blue for male and pink for female runners (apologies for the banality!). It is readily apparent how the distributions have shifted consistently towards older ages with the passing of the decades. The vertical lines in each panel indicate the average age for male (dashed line) and female (solid line) runners. Whereas in the 1980s the average age for both genders was around 34, in the 2010s it has shifted to over 40 for females and almost 42 for males.

age-decade-histogram

Maybe clumping the data together into decades is hiding some of the details. The plot below shows the average age for each gender as a function of the race year. The plotted points are the observed average age, the solid line is a linear model fitted to these data and the dashed lines delineate a 95% confidence interval.

Prior to 1990 the average age for both genders was around 35 and varies somewhat erratically from year to year. Interestingly there is a pronounced decrease in the average age for both genders around 1990. Evidently something attracted more young runners that year... Since 1990 though there has been a consistent increase in average age. In 2013 the average age for men was fractionally less than 42, while for women it was over 40.

mean-age-year-regression

Conclusion

Of course, the title of this article is hyperbolic. The Comrades Marathon is a long way from being a race for geriatrics. However, there is very clear evidence that the average age of runners is getting higher every year. A linear model, which is a reasonably good fit to the data, indicates that the average age increases by 0.26 years annually and is generally 0.6 years higher for men than women. If this trend continues then, by the time of the 100th edition of the race, the average age will be almost 45.

Is the aging Comrades Marathon field a problem and, if so, what can be done about it?

Analysis

As before I have used the Comrades Marathon results from 1980 through to 2013. Since my last post on this topic I have refactored these data, which now look like this:

> head(results)
       key year age gender category   status  medal direction medal_count decade
1  6a18da7 1980  39   Male   Senior Finished Bronze         D          20   1980
2   6570be 1980  39   Male   Senior Finished Bronze         D          16   1980
3 4371bd17 1980  29   Male   Senior Finished Bronze         D           9   1980
4 58792c25 1980  24   Male   Senior Finished Silver         D          25   1980
5 16fe5d63 1980  58   Male   Master Finished Bronze         D           9   1980
6 541c273e 1980  43   Male  Veteran Finished Silver         D          18   1980

The first step in the analysis was to compile decadal and annual summary statistics using plyr.

> decade.statistics = ddply(results, .(decade, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
> #
> year.statistics = ddply(results, .(year, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
> head(decade.statistics)
  decade gender median.age mean.age
1   1980 Female         34   34.352
2   1980   Male         34   34.937
3   1990 Female         36   36.188
4   1990   Male         36   36.440
5   2000 Female         39   39.364
6   2000   Male         39   39.799
> head(year.statistics)
  year gender median.age mean.age
1 1980 Female       35.0   35.061
2 1980   Male       33.0   34.091
3 1981 Female       33.5   34.096
4 1981   Male       34.0   34.528
5 1982 Female       34.5   35.032
6 1982   Male       34.0   34.729

The decadal data were used to generate the histograms. I then considered a selection of linear models applied to the annual data.

> fit.1 <- lm(mean.age ~ year, data = year.statistics)
> fit.2 <- lm(mean.age ~ year + year:gender, data = year.statistics)
> fit.3 <- lm(mean.age ~ year + gender, data = year.statistics)
> fit.4 <- lm(mean.age ~ year + year * gender, data = year.statistics)

The first model applies a simple linear relationship between average age and year. There is no discrimination between genders. The model summary (below) indicates that the average age increases by about 0.26 years annually. Both the intercept and slope coefficients are highly significant.

> summary(fit.1)

Call:
lm(formula = mean.age ~ year, data = year.statistics)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3181 -0.5322 -0.0118  0.4971  1.9897 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.80e+02   1.83e+01   -26.2   <2e-16 ***
year         2.59e-01   9.15e-03    28.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.74 on 66 degrees of freedom
Multiple R-squared:  0.924,	Adjusted R-squared:  0.923 
F-statistic:  801 on 1 and 66 DF,  p-value: <2e-16

The second model considers the effect on the slope of an interaction between year and gender. Here we see that the slope is slightly large for males than females. Although this interaction coefficient is statistically significant, it is extremely small relative to the slope coefficient itself. However, given that the value of the abscissa is around 2000, it still contributes roughly 0.6 extra years to the average age for men.

> summary(fit.2)

Call:
lm(formula = mean.age ~ year + year:gender, data = year.statistics)

Residuals:
   Min     1Q Median     3Q    Max 
-1.103 -0.522  0.024  0.388  2.287 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -4.80e+02   1.68e+01  -28.57  < 2e-16 ***
year             2.59e-01   8.41e-03   30.78  < 2e-16 ***
year:genderMale  3.00e-04   8.26e-05    3.63  0.00056 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935 
F-statistic:  481 on 2 and 65 DF,  p-value: <2e-16

The third model considers an offset on the intercept based on gender. Here, again, we see that the effect of gender is small, with the fit for males being shifted slightly upwards. Again, although this effect is statistically significant, it has only a small effect on the model. Note that the value of this coefficient (5.98e-01 years) is consistent with the effect of the interaction term (0.6 years for typical values of the abscissa) in the second model above.

> summary(fit.3)

Call:
lm(formula = mean.age ~ year + gender, data = year.statistics)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1038 -0.5225  0.0259  0.3866  2.2885 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.80e+02   1.68e+01  -28.58  < 2e-16 ***
year         2.59e-01   8.41e-03   30.79  < 2e-16 ***
genderMale   5.98e-01   1.65e-01    3.62  0.00057 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935 
F-statistic:  480 on 2 and 65 DF,  p-value: <2e-16

The fourth and final model considers both an interaction between year and gender as well as an offset of the intercept based on gender. Here we see that the data does not differ sufficiently on the basis of gender to support both of these effects, and neither of the resulting coefficients is statistically significant.

> summary(fit.4)

Call:
lm(formula = mean.age ~ year + year * gender, data = year.statistics)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0730 -0.5127 -0.0492  0.4225  2.1273 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -460.3631    23.6813  -19.44   <2e-16 ***
year               0.2491     0.0119   21.00   <2e-16 ***
genderMale       -38.4188    33.4904   -1.15     0.26    
year:genderMale    0.0195     0.0168    1.17     0.25    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.679 on 64 degrees of freedom
Multiple R-squared:  0.938,	Adjusted R-squared:  0.935 
F-statistic:  322 on 3 and 64 DF,  p-value: <2e-16

On the basis of the above discussion, the fourth model can be immediately abandoned. But how do we choose between the three remaining models? An ANOVA indicates that the second model is a significant improvement over the first model. There is little to choose, however, between the second and third models. I find the second model more intuitive, since I would expect there to be a slight gender difference in the rate of aging, rather than a simple offset. We will thus adopt the second model, which indicates that the average age of runners increases by about 0.259 years annually, with the men aging slightly faster than the women.

> anova(fit.1, fit.2, fit.3, fit.4)
Analysis of Variance Table

Model 1: mean.age ~ year
Model 2: mean.age ~ year + year:gender
Model 3: mean.age ~ year + gender
Model 4: mean.age ~ year + year * gender
  Res.Df  RSS Df Sum of Sq     F  Pr(>F)    
1     66 36.2                               
2     65 30.1  1      6.09 13.23 0.00055 ***
3     65 30.1  0     -0.02                  
4     64 29.5  1      0.62  1.36 0.24833    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Lastly, I constructed a data frame based on the second model which gives both the model prediction and a 95% uncertainty interval. This was used to generate the second set of plots.

fit.data <- data.frame(year = rep(1980:2020, each = 2), gender = c("Female", "Male"))
fit.data <- cbind(fit.data, predict(fit.2, fit.data, level = 0.95, interval = "prediction"))

Comrades Marathon Negative Splits: Cheat Strikes Again

It looks likes one of the suspect runners from my previous posts cheated again in this year's Comrades Marathon. Brad Brown has published evidence that the runner in question, Kitty Chutergon (race number 25058), had another athlete running with his number for most of the race. This is a change in strategy from last year, where he appears to have been assisted along the route, having missed the timing mats at Camperdown and Polly Shortts.

Twins, Tripods and Phantoms at the Comrades Marathon

Having picked up a viral infection days before this year's Comrades Marathon, on 1 June I was left with time on my hands and somewhat desperate for any distraction. So I spent some time looking at my archive of Comrades data and considering some new questions. For example, what are the chances of two runners passing through halfway and the finish line at exactly the same time? How likely is it that three runners achieve the same feat?

My data for the 2013 up run gives times with one second precision, so these questions could be answered if I relaxed the constraints from "exactly the same time" to "within one second of each other". We'll call such simultaneous pairs of runners "twins" and simultaneous threesomes will be known as "tripods". How many twins are there? How many tripods? The answers are somewhat surprising. What's even more surprising is another category: "phantoms".

If you are not interested in the details of the analysis (and I'm guessing that you probably aren't), please skip forward to the pictures and analysis.

Looking at the Data

The first step is to subset the data, leaving a data frame containing only the times at halfway and the finish, indexed by a unique runner key.

> simultaneous = subset(splits,
+                      year == 2013 & !is.na(medal))[, c("key", "drummond.time", "race.time")]
> simultaneous = simultaneous[complete.cases(simultaneous),]
> #
> rownames(simultaneous) = simultaneous$key
> simultaneous$key <- NULL
> head(simultaneous)
         drummond.time race.time
4bdcb291        320.15    712.42
4e488aab        294.65    656.90
ab59fc97        304.62    643.67
89d3e09b        270.32    646.78
fc728816        211.27    492.95
7b761740        274.60    584.37

Next we calculate the "distance" (this is a distance in time and not in space) between runners, which is effectively the squared difference between the halfway and finish times for each pair of runners. This yields a rather large matrix with rows and columns labelled by runner key. These data are then transformed into a format where each row represents a pair of runners.

> simultaneous = dist(simultaneous)
> library(reshape2)
> simultaneous = melt(as.matrix(simultaneous))
> head(simultaneous)
      Var1     Var2   value
1 4bdcb291 4bdcb291   0.000
2 4e488aab 4bdcb291  61.093
3 ab59fc97 4bdcb291  70.483
4 89d3e09b 4bdcb291  82.408
5 fc728816 4bdcb291 244.992
6 7b761740 4bdcb291 135.910

We can immediately see that there are some redundant entries. We need to remove the matrix diagonal (obviously the times match when a runner is compared to himself!) and keep only one half of the matrix.

> simultaneous = subset(simultaneous, as.character(Var1) < as.character(Var2))

Finally we retain only the records for those pairs of runners who crossed both mats simultaneously (in retrospect, this could have been done earlier!).

> simultaneous = subset(simultaneous, value == 0)
> head(simultaneous)
            Var1     Var2 value
623174  5217dfc9 75a78d04     0
971958  d8c9c403 e6e0d6e3     0
2024105 2e8f7778 9acc46ee     0
2464116 5f18d86f 9a1697ff     0
2467712 63033429 9a1697ff     0
3538608 54a92b96 f574be97     0

We can then merge in the data for race numbers and names, leaving us with an (anonymised) data set that looks like this:

> simultaneous = simultaneous[order(simultaneous$race.time),]
> head(simultaneous)[, c(4, 6, 8)]
    race.number.x race.number.y race.time
133         59235         56915  07:54:21
9           26132         23470  08:06:55
62          44008         31833  08:25:58
61          25035         36706  08:35:42
54          28868         25910  08:46:42
26          47703         31424  08:47:08
> tail(simultaneous)[, c(4, 6, 8)]
   race.number.x race.number.y race.time
71         54689         16554  11:55:59
60          8846         23003  11:56:26
44          9235         49251  11:56:47
38         53354         53352  11:56:56
28         19268         59916  11:57:49
20         22499         40754  11:58:26

Twins

As it turns out, there are a remarkably large number of Comrades twins. In the 2013 race there were more than 100 such pairs. So they are not as rare as I had assumed they would be.

Tripods

Although there were relatively many Comrades twins, there were only two tripods. In both cases, all three members of the tripod shared the same surname, so they are presumably related.

The members of the first tripod all belong to the same running club, two of them are in the 30-39 age category and the third is in the 60+ group. There's a clear family resemblance, so I'm guessing that they are father and sons. Dad had gathered 9 medals, while the sons had 2 and 3 medals respectively. What a day they must have had together!

tripod-11832

The second tripod also consisted of three runners from the same club. Based on gender and age groups, I suspect that they are Mom, Dad and son. The parents had collected 8 medals each, while junior had 3. What a privilege to run the race with your folks! Lucky guy.

tripod-53713

And now things get more interesting...

Phantom #1

The runner with race number 26132 appears to have run all the way from Durban to Pietermaritzburg with runner 23470! Check out the splits below.

splits-26132

splits-23470

Not only did they pass through halfway and the finish at the same time, but they crossed every mat along the route at precisely the same time. Yet, somewhat mysteriously, there is no sign of 23470 in the race photographs...

phantom-26132-A

phantom-26132-B

phantom-26132-C

You might notice that there is another runner with 26132 in all three of the images above. That's not 23470. He has race number 28151 and he is not the phantom! His splits below show that he only started running with 26132 somewhere between Camperdown and Polly Shortts.

splits-28151

If you search the race photographs for the phantom's race number (23470), you will find that there are no pictures of him at all! That's right, nineteen photographs of 26132 and not a single photograph of 23470.

Phantom #2

The runner with race number 53367 was also accompanied by a phantom with race number 27587. Again, as can be seen from the splits below, these two crossed every mat on the course at precisely the same time.

splits-53367

splits-27587

Yet, despite the fact that 53367 is quite evident in the race photos, there is no sign of 27587.

phantom-53367-B

phantom-53367-C

phantom-53367-D

phantom-53367-A

I would have expected to see a photograph of 53367 embracing his running mate at the finish, yet we find him pictured with two other runners. In fact, if you search the race photographs for 27587 you will find that there are no photographs of him at all. You will, however, find twelve photographs of 53367.

Conclusion

Well done to the tripods, I think you guys are awesome! As for the phantoms (and their running mates), you have some explaining to do.

Comrades Marathon Pacing Chart: Down Run

Although I have been thinking vaguely about my Plan A goal of a Bill Rowan medal at the Comrades Marathon this year, I have not really put a rigorous pacing plan in place. I know from previous experience that I am likely to be quite a bit slower towards the end of the race. I also know that I am going to lose a few minutes at the start. So how fast does this mean I need to run in order to get from Pietermaritzburg to Durban in under 9 hours?

Well, suppose that it takes me 3 minutes from the gun to get across the starting line. And, furthermore, assume that I will be running around 5% slower towards the end of the race. To still get to Durban under 9 hours I would need to run at roughly 5:52 per km at the beginning and gradually ease back to about 6:11 per km towards the end.

I arrived at these figures using a pacing spreadsheet. To get an idea of your pace requirements you will need to specify your goal time, the number of minutes you anticipate losing before crossing the start line and an estimate of how much you think you will slow down during the course of the race. This is done by editing the blue fields indicated in the image below. The rest of the spreadsheet will update on the basis of your selections.

pace-calc-1

The spreadsheet uses a simple linear model which assumes that your pace will gradually decline at the rate you have specified. If you give 0% for your slowing down percentage then the calculations are performed on the basis of a uniform pace throughout the race. Of course, neither the linear model nor a uniform pace are truly realistic. We all know that our pace will vary continuously throughout the race as a function of congestion, topography, hydration, fatigue, motivation and all of the other factors which come into play. However, as noted by the eminent statistician George Box "all models are wrong, but some are useful". In this case the linear model is a useful way to account for the effects of fatigue.

The spreadsheet will give you an indication of the splits (both relative to the start of the race as well as time of day) and pace (instantaneous and average) required to achieve your goal time. There are also a pair of charts which will be updated with your projected timing and pace information.

pace-calc-2

My plan on race day is to run according to my average pace. This works well because it smooths out all the perturbations associated with tables and walking breaks.

Losing Time at the Start

One interesting thing to play around with on the spreadsheet is the effect of losing time at the start. If you vary this number you should see that it really does not have a massive influence on your pacing requirements for the rest of the race. For example, if I change my estimate from 3 minutes to 10 minutes then my required average pace decreases from 6:02 per km to 5:57 per km. Sure, this amounts to 5 seconds shaved off every km, but it is not unmananagable: the delay at the start gets averaged out over the rest of the race.

Naturally, the faster you are hoping to finish the race, the more significant a delay at the start is going to become. However, if you are aiming for a really fast time then presumably you are in a good seeding batch. For the majority of runners it is probably not going to make an enormous difference and so it is not worth stressing about.

The important thing is to make sure that you just keep on moving forward. Don't stop. Just keep on putting one foot in front of the other.

Other Pacing Charts

The pacing chart by Dirk Cloete is based on the profile of the route. It breaks the route down into undulating, up and down sections and takes this into account when calculating splits. Don Oliver also has some static pacing charts.

Race Statistics for Comrades Marathon Novice Runners: Corrigendum

There was some significant bias in the histogram from my previous post: the data from all years were lumped together. This is important because as of 2003 (when the Vic Clapham medal was introduced) the final cutoff for the Comrades Marathon was extended from 11:00 to 12:00. In 2000 they also applied an extended cutoff.

I have consequently partitioned the data according to "strict" and "extended" cutoffs.

> novices$extended = factor(novices$year == 2000 | novices$year >= 2003,
+                           labels = c("Strict Cutoff", "Extended Cutoff"))

novice-finish-times-hist

This paints a much more representative picture of the distribution of finish times now that the race has been extended to 12 hours.

The allocation of medals is complicated by the fact that new medals have been introduced at different times over recent years. Specifically, the Bill Rowan medal was first awarded in 2000, then the Vic Clapham medal was introduced in 2003 and, finally, 2007 saw the first Wally Hayward medals.

> novices$period = cut(novices$year, breaks = c(1900, 2000, 2003, 2007, 3000), right = FALSE,
+                      labels = c("before 2000", "2000 to 2002", "2003 to 2006", "after 2007"))
>
> novice.medals = table(novices$medal, novices$period)
> novice.medals = scale(novice.medals, scale = colSums(novice.medals), center = FALSE) * 100
> options(digits = 1)
> (novice.medals = t(novice.medals))
              
                Gold Wally Hayward Silver Bill Rowan Bronze Vic Clapham
  before 2000   0.07          0.00   4.80       0.00  95.13        0.00
  2000 to 2002  0.09          0.00   2.66      12.76  84.49        0.00
  2003 to 2006  0.15          0.00   4.05      17.51  47.63       30.66
  after 2007    0.08          0.03   2.60      12.28  46.40       38.62

So, currently, around 46% of novices get a Bronze medal while slight fewer, about 37%, get a Vic Clapham medal. A significant fraction, just over 12%, achieve a Bill Rowan, while only 2.6% get a Silver medal. The number of Wally Hayward and Gold medals among novices is very small indeed.

Acknowledgement

Thanks to Daniel for pointing out this issue!