# Comrades Marathon Finish Predictions

* If you see a bunch of [Math Processing Error] errors, you might want to try opening the page in a different browser. I have had some trouble with MathJax and Internet Explorer. Yet another reason to never use Windows.

There are various approaches to predicting Comrades Marathon finishing times. Lindsey Parry, for example, suggests that you use two and a half times your recent marathon time. Sports Digest provides a calculator which predicts finishing time using recent times over three distances. I understand that this calculator is based on the work of Norrie Williamson.

Let's give them a test. I finished the 2013 Comrades Marathon in 09:41. Based on my marathon time from February that year, which was 03:38, Parry's formula suggests that I should have finished at around 09:07. Throwing in my Two Oceans time for that year, 04:59, and a 21.1 km time of 01:58 a few weeks before Comrades, the Sports Digest calculator gives a projected finish time of 08:59. Clearly, relative to both of those predictions, I under-performed that year! Either that or the predictions were way off the mark.

It seems to me that, given the volume of data we gather on our runs, we should be able to generate better predictions. If the thought of maths or data makes you want to doze off, feel free to jump ahead, otherwise read on.

## Riegel's Formula

In 1977 Peter Riegel published a formula for predicting running times, which became popular due to its simplicity. The formula itself looks like this:

which allows you to predict $\Delta t_2$ the time it will take you to run distance $d_2$, given that you know it takes you time $\Delta t_1$ to run distance $d_1$. Riegel called this his "endurance equation".

## Reverse-Engineering Riegel's Model

Riegel's formula is an empirical model: it's based on data. In order to reverse engineer the model we are going to need some data too. Unfortunately I do not have access to data for a cohort of elite runners. However, I do have ample data for one particular runner: me. Since I come from the diametrically opposite end of the running spectrum (I believe the technical term would be "bog standard runner"), I think these data are probably more relevant to most runners anyway.

I compiled my data for the last three years based on the records kept by my trusty Garmin 910XT. A plot of time versus distance is given below.

At first glance it looks like you could fit a straight line through those points. And you can, indeed, make a pretty decent linear fit.

> fit <- lm(TimeHours ~ Distance, data = training)
>
> summary(fit)

Call:
lm(formula = TimeHours ~ Distance, data = training)

Residuals:
Min       1Q   Median       3Q      Max
-0.64254 -0.04592 -0.00618  0.02361  1.24900

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1029964  0.0107648  -9.568   <2e-16 ***
Distance     0.1012847  0.0008664 116.902   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1394 on 442 degrees of freedom
Multiple R-squared:  0.9687,	Adjusted R-squared:  0.9686
F-statistic: 1.367e+04 on 1 and 442 DF,  p-value: < 2.2e-16


However, applying a logarithmic transform to both axes gives a more uniform distribution of the data, which also now looks more linear.

Riegel observed that data for a variety of disciplines (running, swimming, cycling and race walking) conformed to the same pattern. Figure 1 from this paper is included below.

Figure 1 from Riegel's paper "Athletic Records and Human Endurance".

If we were to fit a straight line to the data on logarithmic axes then the relationship we'd be contemplating would have the form

or, equivalently,

which is a power law relating elapsed time to distance. It's pretty easy to get Riegel's formula from this. Taking two particular points on the power law, $\Delta t_1 = k d_1^m$ and $\Delta t_2 = k d_2^m$, and eliminating $k$ gives

which is Riegel's formula with an unspecified value for the exponent. We'll call the exponent the "fatigue factor" since it determines the degree to which a runner slows down as distance increases.

How do we get a value for the fatigue factor? Well, by fitting the data, of course!

> fit <- lm(log(TimeHours) ~ log(Distance), data = training)
>
> summary(fit)

Call:
lm(formula = log(TimeHours) ~ log(Distance), data = training)

Residuals:
Min       1Q   Median       3Q      Max
-0.27095 -0.04809 -0.01843  0.01552  0.80351

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   -2.522669   0.018111  -139.3   <2e-16 ***
log(Distance)  1.045468   0.008307   125.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09424 on 442 degrees of freedom
Multiple R-squared:  0.9729,	Adjusted R-squared:  0.9728
F-statistic: 1.584e+04 on 1 and 442 DF,  p-value: < 2.2e-16


The fitted value for the exponent is 1.05 (rounding up), which is pretty close to the value in Riegel's formula. The fitted model is included in the logarithmic plot above as the solid line, with the 95% prediction confidence interval indicated by the coloured ribbon. The linear plot below shows the data points used for training the model, the fit and confidence interval as well as dotted lines for constant paces ranging from 04:00 per km to 07:00 per km.

Our model also provides us with an indication of the uncertainty in the fitted exponent: the 95% confidence interval extends from 1.029 to 1.062.

> confint(fit)
2.5 %    97.5 %
(Intercept)   -2.558264 -2.487074
log(Distance)  1.029142  1.061794


## Estimating the Fatigue Factor

A fatigue factor of 1 would correspond to a straight line, which implies a constant pace regardless of distance. A value less than 1 implies faster pace at larger distances (rather unlikely in practice). Finally, a value larger than 1 implies progressively slower pace at larger distances.

The problem with the fatigue factor estimate above, which was based on a single model fit to all of the data, is that it's probably biased by the fact that most of the data are for runs of around 10 km. In this regime the relationship between time and distance is approximately linear, so that the resulting estimate of the fatigue factor is probably too small.

To get around this problem, I employed a bootstrap technique, creating a large number of subsets from the data. In each subset I weighted the samples to ensure that there was a more even distribution of distances in the mix. I calculated the fatigue factor for each subset, resulting in a range of estimates. Their distribution is plotted below.

According to this analysis, my personal fatigue factor is around 1.07 (the median value indicated by the dashed line in the plot above). The Shapiro-Wilk test suggests that the data is sufficiently non-Normal to justify a non-parameteric estimate of the 95% confidence interval for the fatigue factor, which runs from 1.03 to 1.11.

> shapiro.test(fatigue.factor)

Shapiro-Wilk normality test

data:  fatigue.factor
W = 0.9824, p-value = 1.435e-12

> median(fatigue.factor)
[1] 1.072006
> quantile(fatigue.factor, c(0.025, 0.975))
2.5%    97.5%
1.030617 1.107044


Riegel's analysis also lead to a range of values for the fatigue factor. As can be seen from the table below (extracted from his paper), the values range from 1.01 for Nordic skiing to 1.14 for roller skating. Values for running range from 1.05 to 1.08 depending on age group and gender.

Table 1 from Riegel's paper "Athletic Records and Human Endurance".

## Generating Predictions

The rules mentioned above for predicting finishing times are generally applied to data for a single race (or a selection of three races). But, again, given that we have so much data on hand, would it not make sense to generate a larger set of predictions?

The distributions above indicate the predictions for this year's Comrades Marathon (which is apparently going to be 89 km) based on all of my training data this year and using both the default (1.06) and personalised (1.07) values for the fatigue factor. The distributions are interesting, but what we are really interested in is the expected finish times, which are 08:59 and 09:18 depending on what value you use for the fatigue factor. I have a little more confidence in my personalised value, so I am going to be aiming for 09:18 this year.

Comrades is a long day and a variety of factors can affect your finish time. It's good to have a ball-park idea though. If you would like me to generate a set of personalised predictions for you, just get in touch via the replies below.

## Postscript

I repeated the analysis for one of my friends and colleagues. His fatigue factor also comes out as 1.07 although, interestingly, the distribution is bi-modal. I think I understand the reason for this though: his runs are divided clearly into two groups: training runs and short runs back and forth between work and the gym.

# Comrades Runners Disqualified: I'm Not Convinced

Apparently 14 runners have been disqualified for failing to complete the full distance at the 2012 and 2013 Comrades Marathons.

I'm not convinced.

Although 20 runners were charged with misconduct, six of them had a valid story. These runners had retired from the race but the bailers' bus had dropped them back on the course and they were "forced" to cross the finish line. I find this story a hard to digest. My understanding is that race numbers are either confiscated, destroyed or permanently marked when entering the bus. If this story is true then it should have thus been immediately obvious to officials that the runners in question had dropped out of the race. Their times should never have been recorded and they certainly should not have received medals (which presumably at least some of them did, since they have been instructed to return them!).

So that leaves the 14 runners who were disqualified. Were any of them among the group of mysterious negative splits identified previously? Unless KZNA or the CMA releases the names or race numbers, I guess we'll never know.

# Comrades Marathon Pacing Chart: Up Run

I've updated my Comrades Marathon pacing chart to include both the Up and Down runs. You can grab it here. The data for this year's race are not yet finalised (I think we will be running 87 km), but you can make changes when it's all confirmed.

The use of the chart is explained in a previous post. Any feedback on how this can be improved would be appreciated.

# Comrades Marathon: A Race for Geriatrics?

It has been suggested that the average Comrades Marathon runner is gradually getting older. As an "average runner" myself, I will not deny that I am personally getting older. But, what I really mean is that the average age of all runners taking part in this great event is gradually increasing. This is not just an idle hypothesis: it is supported by the data. If you're interested in the technical details of the analysis, these are included at the end, otherwise read on for the results.

## Results

The histograms below show graphically how the distribution of runners' ages at the Comrades Marathon has changed every decade starting in the 1980s and proceeding through to the 2010s. The data are encoded using blue for male and pink for female runners (apologies for the banality!). It is readily apparent how the distributions have shifted consistently towards older ages with the passing of the decades. The vertical lines in each panel indicate the average age for male (dashed line) and female (solid line) runners. Whereas in the 1980s the average age for both genders was around 34, in the 2010s it has shifted to over 40 for females and almost 42 for males.

Maybe clumping the data together into decades is hiding some of the details. The plot below shows the average age for each gender as a function of the race year. The plotted points are the observed average age, the solid line is a linear model fitted to these data and the dashed lines delineate a 95% confidence interval.

Prior to 1990 the average age for both genders was around 35 and varies somewhat erratically from year to year. Interestingly there is a pronounced decrease in the average age for both genders around 1990. Evidently something attracted more young runners that year... Since 1990 though there has been a consistent increase in average age. In 2013 the average age for men was fractionally less than 42, while for women it was over 40.

## Conclusion

Of course, the title of this article is hyperbolic. The Comrades Marathon is a long way from being a race for geriatrics. However, there is very clear evidence that the average age of runners is getting higher every year. A linear model, which is a reasonably good fit to the data, indicates that the average age increases by 0.26 years annually and is generally 0.6 years higher for men than women. If this trend continues then, by the time of the 100th edition of the race, the average age will be almost 45.

Is the aging Comrades Marathon field a problem and, if so, what can be done about it?

## Analysis

As before I have used the Comrades Marathon results from 1980 through to 2013. Since my last post on this topic I have refactored these data, which now look like this:

> head(results)
key year age gender category   status  medal direction medal_count decade
1  6a18da7 1980  39   Male   Senior Finished Bronze         D          20   1980
2   6570be 1980  39   Male   Senior Finished Bronze         D          16   1980
3 4371bd17 1980  29   Male   Senior Finished Bronze         D           9   1980
4 58792c25 1980  24   Male   Senior Finished Silver         D          25   1980
5 16fe5d63 1980  58   Male   Master Finished Bronze         D           9   1980
6 541c273e 1980  43   Male  Veteran Finished Silver         D          18   1980


The first step in the analysis was to compile decadal and annual summary statistics using plyr.

> decade.statistics = ddply(results, .(decade, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
> #
> year.statistics = ddply(results, .(year, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
decade gender median.age mean.age
1   1980 Female         34   34.352
2   1980   Male         34   34.937
3   1990 Female         36   36.188
4   1990   Male         36   36.440
5   2000 Female         39   39.364
6   2000   Male         39   39.799
year gender median.age mean.age
1 1980 Female       35.0   35.061
2 1980   Male       33.0   34.091
3 1981 Female       33.5   34.096
4 1981   Male       34.0   34.528
5 1982 Female       34.5   35.032
6 1982   Male       34.0   34.729


The decadal data were used to generate the histograms. I then considered a selection of linear models applied to the annual data.

> fit.1 <- lm(mean.age ~ year, data = year.statistics)
> fit.2 <- lm(mean.age ~ year + year:gender, data = year.statistics)
> fit.3 <- lm(mean.age ~ year + gender, data = year.statistics)
> fit.4 <- lm(mean.age ~ year + year * gender, data = year.statistics)


The first model applies a simple linear relationship between average age and year. There is no discrimination between genders. The model summary (below) indicates that the average age increases by about 0.26 years annually. Both the intercept and slope coefficients are highly significant.

> summary(fit.1)

Call:
lm(formula = mean.age ~ year, data = year.statistics)

Residuals:
Min      1Q  Median      3Q     Max
-1.3181 -0.5322 -0.0118  0.4971  1.9897

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.80e+02   1.83e+01   -26.2   <2e-16 ***
year         2.59e-01   9.15e-03    28.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.74 on 66 degrees of freedom
Multiple R-squared:  0.924,	Adjusted R-squared:  0.923
F-statistic:  801 on 1 and 66 DF,  p-value: <2e-16


The second model considers the effect on the slope of an interaction between year and gender. Here we see that the slope is slightly large for males than females. Although this interaction coefficient is statistically significant, it is extremely small relative to the slope coefficient itself. However, given that the value of the abscissa is around 2000, it still contributes roughly 0.6 extra years to the average age for men.

> summary(fit.2)

Call:
lm(formula = mean.age ~ year + year:gender, data = year.statistics)

Residuals:
Min     1Q Median     3Q    Max
-1.103 -0.522  0.024  0.388  2.287

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     -4.80e+02   1.68e+01  -28.57  < 2e-16 ***
year             2.59e-01   8.41e-03   30.78  < 2e-16 ***
year:genderMale  3.00e-04   8.26e-05    3.63  0.00056 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935
F-statistic:  481 on 2 and 65 DF,  p-value: <2e-16


The third model considers an offset on the intercept based on gender. Here, again, we see that the effect of gender is small, with the fit for males being shifted slightly upwards. Again, although this effect is statistically significant, it has only a small effect on the model. Note that the value of this coefficient (5.98e-01 years) is consistent with the effect of the interaction term (0.6 years for typical values of the abscissa) in the second model above.

> summary(fit.3)

Call:
lm(formula = mean.age ~ year + gender, data = year.statistics)

Residuals:
Min      1Q  Median      3Q     Max
-1.1038 -0.5225  0.0259  0.3866  2.2885

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.80e+02   1.68e+01  -28.58  < 2e-16 ***
year         2.59e-01   8.41e-03   30.79  < 2e-16 ***
genderMale   5.98e-01   1.65e-01    3.62  0.00057 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935
F-statistic:  480 on 2 and 65 DF,  p-value: <2e-16


The fourth and final model considers both an interaction between year and gender as well as an offset of the intercept based on gender. Here we see that the data does not differ sufficiently on the basis of gender to support both of these effects, and neither of the resulting coefficients is statistically significant.

> summary(fit.4)

Call:
lm(formula = mean.age ~ year + year * gender, data = year.statistics)

Residuals:
Min      1Q  Median      3Q     Max
-1.0730 -0.5127 -0.0492  0.4225  2.1273

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     -460.3631    23.6813  -19.44   <2e-16 ***
year               0.2491     0.0119   21.00   <2e-16 ***
genderMale       -38.4188    33.4904   -1.15     0.26
year:genderMale    0.0195     0.0168    1.17     0.25
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.679 on 64 degrees of freedom
Multiple R-squared:  0.938,	Adjusted R-squared:  0.935
F-statistic:  322 on 3 and 64 DF,  p-value: <2e-16


On the basis of the above discussion, the fourth model can be immediately abandoned. But how do we choose between the three remaining models? An ANOVA indicates that the second model is a significant improvement over the first model. There is little to choose, however, between the second and third models. I find the second model more intuitive, since I would expect there to be a slight gender difference in the rate of aging, rather than a simple offset. We will thus adopt the second model, which indicates that the average age of runners increases by about 0.259 years annually, with the men aging slightly faster than the women.

> anova(fit.1, fit.2, fit.3, fit.4)
Analysis of Variance Table

Model 1: mean.age ~ year
Model 2: mean.age ~ year + year:gender
Model 3: mean.age ~ year + gender
Model 4: mean.age ~ year + year * gender
Res.Df  RSS Df Sum of Sq     F  Pr(>F)
1     66 36.2
2     65 30.1  1      6.09 13.23 0.00055 ***
3     65 30.1  0     -0.02
4     64 29.5  1      0.62  1.36 0.24833
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Lastly, I constructed a data frame based on the second model which gives both the model prediction and a 95% uncertainty interval. This was used to generate the second set of plots.

fit.data <- data.frame(year = rep(1980:2020, each = 2), gender = c("Female", "Male"))
fit.data <- cbind(fit.data, predict(fit.2, fit.data, level = 0.95, interval = "prediction"))


# Comrades Marathon Negative Splits: Cheat Strikes Again

It looks likes one of the suspect runners from my previous posts cheated again in this year's Comrades Marathon. Brad Brown has published evidence that the runner in question, Kitty Chutergon (race number 25058), had another athlete running with his number for most of the race. This is a change in strategy from last year, where he appears to have been assisted along the route, having missed the timing mats at Camperdown and Polly Shortts.

# Twins, Tripods and Phantoms at the Comrades Marathon

Having picked up a viral infection days before this year's Comrades Marathon, on 1 June I was left with time on my hands and somewhat desperate for any distraction. So I spent some time looking at my archive of Comrades data and considering some new questions. For example, what are the chances of two runners passing through halfway and the finish line at exactly the same time? How likely is it that three runners achieve the same feat?

My data for the 2013 up run gives times with one second precision, so these questions could be answered if I relaxed the constraints from "exactly the same time" to "within one second of each other". We'll call such simultaneous pairs of runners "twins" and simultaneous threesomes will be known as "tripods". How many twins are there? How many tripods? The answers are somewhat surprising. What's even more surprising is another category: "phantoms".

If you are not interested in the details of the analysis (and I'm guessing that you probably aren't), please skip forward to the pictures and analysis.

## Looking at the Data

The first step is to subset the data, leaving a data frame containing only the times at halfway and the finish, indexed by a unique runner key.

> simultaneous = subset(splits,
+                      year == 2013 & !is.na(medal))[, c("key", "drummond.time", "race.time")]
> simultaneous = simultaneous[complete.cases(simultaneous),]
> #
> rownames(simultaneous) = simultaneous$key > simultaneous$key <- NULL
drummond.time race.time
4bdcb291        320.15    712.42
4e488aab        294.65    656.90
ab59fc97        304.62    643.67
89d3e09b        270.32    646.78
fc728816        211.27    492.95
7b761740        274.60    584.37


Next we calculate the "distance" (this is a distance in time and not in space) between runners, which is effectively the squared difference between the halfway and finish times for each pair of runners. This yields a rather large matrix with rows and columns labelled by runner key. These data are then transformed into a format where each row represents a pair of runners.

> simultaneous = dist(simultaneous)
> library(reshape2)
> simultaneous = melt(as.matrix(simultaneous))
Var1     Var2   value
1 4bdcb291 4bdcb291   0.000
2 4e488aab 4bdcb291  61.093
3 ab59fc97 4bdcb291  70.483
4 89d3e09b 4bdcb291  82.408
5 fc728816 4bdcb291 244.992
6 7b761740 4bdcb291 135.910


We can immediately see that there are some redundant entries. We need to remove the matrix diagonal (obviously the times match when a runner is compared to himself!) and keep only one half of the matrix.

> simultaneous = subset(simultaneous, as.character(Var1) < as.character(Var2))


Finally we retain only the records for those pairs of runners who crossed both mats simultaneously (in retrospect, this could have been done earlier!).

> simultaneous = subset(simultaneous, value == 0)
Var1     Var2 value
623174  5217dfc9 75a78d04     0
971958  d8c9c403 e6e0d6e3     0
2024105 2e8f7778 9acc46ee     0
2464116 5f18d86f 9a1697ff     0
2467712 63033429 9a1697ff     0
3538608 54a92b96 f574be97     0


We can then merge in the data for race numbers and names, leaving us with an (anonymised) data set that looks like this:

> simultaneous = simultaneous[order(simultaneous$race.time),] > head(simultaneous)[, c(4, 6, 8)] race.number.x race.number.y race.time 133 59235 56915 07:54:21 9 26132 23470 08:06:55 62 44008 31833 08:25:58 61 25035 36706 08:35:42 54 28868 25910 08:46:42 26 47703 31424 08:47:08 > tail(simultaneous)[, c(4, 6, 8)] race.number.x race.number.y race.time 71 54689 16554 11:55:59 60 8846 23003 11:56:26 44 9235 49251 11:56:47 38 53354 53352 11:56:56 28 19268 59916 11:57:49 20 22499 40754 11:58:26  ## Twins As it turns out, there are a remarkably large number of Comrades twins. In the 2013 race there were more than 100 such pairs. So they are not as rare as I had assumed they would be. ## Tripods Although there were relatively many Comrades twins, there were only two tripods. In both cases, all three members of the tripod shared the same surname, so they are presumably related. The members of the first tripod all belong to the same running club, two of them are in the 30-39 age category and the third is in the 60+ group. There's a clear family resemblance, so I'm guessing that they are father and sons. Dad had gathered 9 medals, while the sons had 2 and 3 medals respectively. What a day they must have had together! The second tripod also consisted of three runners from the same club. Based on gender and age groups, I suspect that they are Mom, Dad and son. The parents had collected 8 medals each, while junior had 3. What a privilege to run the race with your folks! Lucky guy. And now things get more interesting... ## Phantom #1 The runner with race number 26132 appears to have run all the way from Durban to Pietermaritzburg with runner 23470! Check out the splits below. Not only did they pass through halfway and the finish at the same time, but they crossed every mat along the route at precisely the same time. Yet, somewhat mysteriously, there is no sign of 23470 in the race photographs... You might notice that there is another runner with 26132 in all three of the images above. That's not 23470. He has race number 28151 and he is not the phantom! His splits below show that he only started running with 26132 somewhere between Camperdown and Polly Shortts. If you search the race photographs for the phantom's race number (23470), you will find that there are no pictures of him at all! That's right, nineteen photographs of 26132 and not a single photograph of 23470. ## Phantom #2 The runner with race number 53367 was also accompanied by a phantom with race number 27587. Again, as can be seen from the splits below, these two crossed every mat on the course at precisely the same time. Yet, despite the fact that 53367 is quite evident in the race photos, there is no sign of 27587. I would have expected to see a photograph of 53367 embracing his running mate at the finish, yet we find him pictured with two other runners. In fact, if you search the race photographs for 27587 you will find that there are no photographs of him at all. You will, however, find twelve photographs of 53367. ## Conclusion Well done to the tripods, I think you guys are awesome! As for the phantoms (and their running mates), you have some explaining to do. # Comrades Marathon Pacing Chart: Down Run Although I have been thinking vaguely about my Plan A goal of a Bill Rowan medal at the Comrades Marathon this year, I have not really put a rigorous pacing plan in place. I know from previous experience that I am likely to be quite a bit slower towards the end of the race. I also know that I am going to lose a few minutes at the start. So how fast does this mean I need to run in order to get from Pietermaritzburg to Durban in under 9 hours? Well, suppose that it takes me 3 minutes from the gun to get across the starting line. And, furthermore, assume that I will be running around 5% slower towards the end of the race. To still get to Durban under 9 hours I would need to run at roughly 5:52 per km at the beginning and gradually ease back to about 6:11 per km towards the end. I arrived at these figures using a pacing spreadsheet. To get an idea of your pace requirements you will need to specify your goal time, the number of minutes you anticipate losing before crossing the start line and an estimate of how much you think you will slow down during the course of the race. This is done by editing the blue fields indicated in the image below. The rest of the spreadsheet will update on the basis of your selections. The spreadsheet uses a simple linear model which assumes that your pace will gradually decline at the rate you have specified. If you give 0% for your slowing down percentage then the calculations are performed on the basis of a uniform pace throughout the race. Of course, neither the linear model nor a uniform pace are truly realistic. We all know that our pace will vary continuously throughout the race as a function of congestion, topography, hydration, fatigue, motivation and all of the other factors which come into play. However, as noted by the eminent statistician George Box "all models are wrong, but some are useful". In this case the linear model is a useful way to account for the effects of fatigue. The spreadsheet will give you an indication of the splits (both relative to the start of the race as well as time of day) and pace (instantaneous and average) required to achieve your goal time. There are also a pair of charts which will be updated with your projected timing and pace information. My plan on race day is to run according to my average pace. This works well because it smooths out all the perturbations associated with tables and walking breaks. ## Losing Time at the Start One interesting thing to play around with on the spreadsheet is the effect of losing time at the start. If you vary this number you should see that it really does not have a massive influence on your pacing requirements for the rest of the race. For example, if I change my estimate from 3 minutes to 10 minutes then my required average pace decreases from 6:02 per km to 5:57 per km. Sure, this amounts to 5 seconds shaved off every km, but it is not unmananagable: the delay at the start gets averaged out over the rest of the race. Naturally, the faster you are hoping to finish the race, the more significant a delay at the start is going to become. However, if you are aiming for a really fast time then presumably you are in a good seeding batch. For the majority of runners it is probably not going to make an enormous difference and so it is not worth stressing about. The important thing is to make sure that you just keep on moving forward. Don't stop. Just keep on putting one foot in front of the other. ## Other Pacing Charts The pacing chart by Dirk Cloete is based on the profile of the route. It breaks the route down into undulating, up and down sections and takes this into account when calculating splits. Don Oliver also has some static pacing charts. # Race Statistics for Comrades Marathon Novice Runners: Corrigendum There was some significant bias in the histogram from my previous post: the data from all years were lumped together. This is important because as of 2003 (when the Vic Clapham medal was introduced) the final cutoff for the Comrades Marathon was extended from 11:00 to 12:00. In 2000 they also applied an extended cutoff. I have consequently partitioned the data according to "strict" and "extended" cutoffs. > novices$extended = factor(novices$year == 2000 | novices$year >= 2003,
+                           labels = c("Strict Cutoff", "Extended Cutoff"))


This paints a much more representative picture of the distribution of finish times now that the race has been extended to 12 hours.

The allocation of medals is complicated by the fact that new medals have been introduced at different times over recent years. Specifically, the Bill Rowan medal was first awarded in 2000, then the Vic Clapham medal was introduced in 2003 and, finally, 2007 saw the first Wally Hayward medals.

> novices$period = cut(novices$year, breaks = c(1900, 2000, 2003, 2007, 3000), right = FALSE,
+                      labels = c("before 2000", "2000 to 2002", "2003 to 2006", "after 2007"))
>
> novice.medals = table(novices$medal, novices$period)
> novice.medals = scale(novice.medals, scale = colSums(novice.medals), center = FALSE) * 100
> options(digits = 1)
> (novice.medals = t(novice.medals))

Gold Wally Hayward Silver Bill Rowan Bronze Vic Clapham
before 2000   0.07          0.00   4.80       0.00  95.13        0.00
2000 to 2002  0.09          0.00   2.66      12.76  84.49        0.00
2003 to 2006  0.15          0.00   4.05      17.51  47.63       30.66
after 2007    0.08          0.03   2.60      12.28  46.40       38.62


So, currently, around 46% of novices get a Bronze medal while slight fewer, about 37%, get a Vic Clapham medal. A significant fraction, just over 12%, achieve a Bill Rowan, while only 2.6% get a Silver medal. The number of Wally Hayward and Gold medals among novices is very small indeed.

### Acknowledgement

Thanks to Daniel for pointing out this issue!

# Race Statistics for Comrades Marathon Novice Runners

Most novice Comrades Marathon runners finish the race on their first attempt and the majority of them walk (shuffle, crawl?) away with Bronze medals.

## What is a Novice?

To paraphrase the dictionary, a novice is "a person who is new to or inexperienced in the circumstances in which he or she is placed; a beginner". In the context of the Comrades Marathon this definition can be interpreted in a few ways:

1. a runner who has never run the Comrades Marathon (has never started the race);
2. a runner who has never completed the Comrades Marathon (has never finished the race); or
3. a runner who has not completed both an "up" and a "down" Comrades Marathon.

For the purposes of this article I will be adopting the first definition. This is probably the one of most interest to runners who are embarking on their first Comrades journey.

## Identifying a Novice

I'll be using the same data sets that I have discussed in previous articles. Before we focus on the data for the novices we'll start by just retaining the fields of interest.

> novices = results[, c("key", "year", "category", "gender", "medal", "medal.count", "status", "ftime")]
key year     category gender       medal medal.count   status   ftime
1 100030f4 2008 Ages 20 - 29 Female Vic Clapham           1 Finished 11.3728
2 100030f4 2009 Ages 20 - 29 Female        <NA>           1      DNF      NA
3 100030f4 2013 Ages 20 - 29 Female        <NA>           1      DNS      NA
4 10007cb6 2005 Ages 26 - 39   Male      Bronze           1 Finished  9.1589
5 10007cb6 2006 Ages 30 - 39   Male  Bill Rowan           2 Finished  8.2564
6 10007cb6 2007 Ages 30 - 39   Male  Bill Rowan           3 Finished  8.0344


To satisfy our definition of novice we'll need to exclude the "did not start" (DNS) records.

> novices = subset(novices, status != "DNS")
key year     category gender       medal medal.count   status   ftime
1 100030f4 2008 Ages 20 - 29 Female Vic Clapham           1 Finished 11.3728
2 100030f4 2009 Ages 20 - 29 Female        <NA>           1      DNF      NA
4 10007cb6 2005 Ages 26 - 39   Male      Bronze           1 Finished  9.1589
5 10007cb6 2006 Ages 30 - 39   Male  Bill Rowan           2 Finished  8.2564
6 10007cb6 2007 Ages 30 - 39   Male  Bill Rowan           3 Finished  8.0344
7 10007cb6 2008 Ages 30 - 39   Male  Bill Rowan           4 Finished  8.8514


Some runners do not finish the race on their first attempt but they bravely come back to run the race again. We will retain only the first record for each runner, because the second time they attempt the race they are (according to our definition) no longer novices since already have some race experience.

> novices = novices[order(novices$year),] > novices <- novices[which(!duplicated(novices$key)),]


## Percentage of Novice Finishers

I suppose that the foremost question going through the minds of many Comrades novices is "Will I finish?".

> table(novices$status) / nrow(novices) * 100 Finished DNF 80.035 19.965  Well, there's some good news: around 80% of all novices finish the race. Those are quite compelling odds. Of course, a number of factors can influence the success of each individual, but if you have done the training and you run sensibly, then the odds are in your favour. ## Medal Distribution for Novice Finishers What medal is a novice most likely to receive? > table(novices$medal) / nrow(subset(novices, !is.na(medal))) * 100

Gold Wally Hayward        Silver    Bill Rowan        Bronze   Vic Clapham
0.0829671     0.0051854     4.0264976     5.6469490    79.4708254    10.7675754


The vast majority (again around 80%) claim a Bronze medal. There are also a significant proportion (just over 10%) who miss the eleven hour cutoff and get a Vic Clapham medal. Around 6% of novices achieve a Bill Rowan medal and a surprisingly large fraction, just over 4%, manage to finish in a Silver medal time of under seven and a half hours. There are very few Wally Hayward and Gold medals won by novices. The odds for a novice Gold medal are around one in 1200, all else being equal (which it very definitely isn't!).

## Distribution of Novice Finishing Times

As one would expect, the chart slopes up towards the right: progressively more runners come in later in the day. There is very clear evidence of clustering of runners just before the medal cutoffs at 07:30, 09:00, 11:00 and 12:00. There is also a peak before the psychological cutoff at 10:00.

## Take Away Message

The data for previous years indicates that the outlook for novices is rather good. 80% of them will finish the race and, of those, around 80% will receive Bronze medals.

How can you help ensure that you have a successful race? Here are some of the things I would think about:

1. Start slowly. It's going to be a long day.
2. Take regular walking breaks and start doing this early on. A few minutes' recovery will power you up for a number of kms.
3. Stay hydrated. Take something at every water table. Just don't overdo it.
4. Be inspired by the other runners: they all have the guts to indulge in this madness with you and every one of them is fighting their own battle.
5. Enjoy the support: the hordes of people beside the road have come out to see YOU run by. And they all want you to finish.
6. Enjoy the day: as far as entertainment is concerned, the Comrades Marathon is about the best value for money that you can get.

See you in Pietermaritzburg at 05:30 on 1 June!

### Acknowledgement

Thanks to Daniel for suggesting this article.