On Friday I received my copy of The Official Results Brochure for the 2013 Comrades Marathon. Always makes for a diverting half an hour’s reading. And the tables at the front provide some very interesting statistics. Seemed like a good opportunity to update my Chart of Comrades Winners.

# Tag Archives: Comrades Marathon

# A Chart of Recent Comrades Marathon Winners

Continuing on my quest to document the Comrades Marathon results, today I have put together a chart showing the winners of both the men and ladies races since 1980. Click on the image below to see a larger version.

The analysis started off with the same data set that I was working with before, from which I extracted only the records for the winners.

> winners = subset(results, gender.position == 1, select = c(year, name, gender, race.time)) > head(winners) year name gender race.time 1 1980 Alan Robb Male 05:38:25 428 1980 Isavel Roche-Kelly Female 07:18:00 3981 1981 Bruce Fordyce Male 05:37:28 4055 1981 Isavel Roche-Kelly Female 06:44:35 7643 1982 Bruce Fordyce Male 05:34:22 7873 1982 Cheryl Winn Female 07:04:59

I then added in a field which gives a count of the number of times each person won the race.

> library(plyr) > winners = ddply(winners, .(name), function(df) { + df = df[order(df$year),] + df$count = 1:nrow(df) + return(df) + }) > subset(winners, name == "Bruce Fordyce") year name gender race.time count 7 1981 Bruce Fordyce Male 05:37:28 1 8 1982 Bruce Fordyce Male 05:34:22 2 9 1983 Bruce Fordyce Male 05:30:12 3 10 1984 Bruce Fordyce Male 05:27:18 4 11 1985 Bruce Fordyce Male 05:37:01 5 12 1986 Bruce Fordyce Male 05:24:07 6 13 1987 Bruce Fordyce Male 05:37:01 7 14 1988 Bruce Fordyce Male 05:27:42 8 15 1990 Bruce Fordyce Male 05:40:25 9

The chart was generated as a scatter plot using ggplot2. The size of the points relates to the number of times each person won the race. The colour scale is as you might imagine: pink for the ladies and blue for the men.

> library(ggplot2) > ggplot(winners, aes(x = year, y = name, color = gender)) + + geom_point(aes(size = count), shape = 19, alpha = 0.75) + + scale_size_continuous(range = c(5, 15)) + + ylab("") + xlab("") + + scale_x_discrete(expand = c(0, 1)) + + theme( + axis.text.x = element_text(angle = 45, hjust = 1, colour = "black"), + axis.text.y = element_text(colour = "black"), + legend.position = "none", + panel.background = element_blank(), + panel.grid.major = element_line(linetype = "dotted", colour = "grey"), + panel.grid.major.x = element_blank() + )

Two of the key aspects of getting this to look just right were:

- the call to scale_size_continuous() which ensured that a reasonable range of point sizes was used and
- the call to scale_x_discrete() which expanded the plot very slightly so that the points near the borders were not cropped.

# Comrades Marathon Inference Trees

Following up on my previous posts regarding the results of the Comrades Marathon, I was planning on putting together a set of models which would predict likelihood to finish and probable finishing time. Along the way I got distracted by something else that is just as interesting and which produces results which readily yield to qualitative interpretation: Conditional Inference Trees as implemented in the R package party.

Just to recall what the data look like:

> head(splits.2013) gender age.category drummond.time race.time status medal 2013-10014 Male 50-59 5.510833 NA DNF <NA> 2013-10016 Male 60-69 6.070833 NA DNF <NA> 2013-10019 Male 20-29 5.335833 11.87361 Finished Vic Clapham 2013-10031 Male 20-29 4.910833 10.94833 Finished Bronze 2013-10047 Male 50-59 5.076944 10.72778 Finished Bronze 2013-10049 Male 50-59 5.729444 NA DNF <NA>

Here the drummond.time and finish.time fields are expressed in decimal hours and correspond to the time taken to reach the half-way mark and the finish respectively. The status field indicates whether a runner finished the race or did not finish (DNF).

I am going to consider two models. The first will look at the probability of finishing and the second will look at the distribution of medals. The features which will be used to predict these outcomes will be gender, age category and half-way time at Drummond. To build the first model, first load the party library and then call ctree.

> library(party) > tree.status = ctree(status ~ gender + age.category + drummond.time, data = splits.2013, + control = ctree_control(minsplit = 750)) > tree.status Conditional inference tree with 17 terminal nodes Response: status Inputs: gender, age.category, drummond.time Number of observations: 13917 1) drummond.time <= 5.669167; criterion = 1, statistic = 2985.908 2) drummond.time <= 5.4825; criterion = 1, statistic = 494.826 3) age.category <= 40-49; criterion = 1, statistic = 191.12 4) drummond.time <= 5.078611; criterion = 1, statistic = 76.962 5) gender == {Male}; criterion = 1, statistic = 73.4 6)* weights = 5419 5) gender == {Female} 7)* weights = 836 4) drummond.time > 5.078611 8) gender == {Male}; criterion = 1, statistic = 63.347 9) drummond.time <= 5.379722; criterion = 1, statistic = 15.55 10)* weights = 1123 9) drummond.time > 5.379722 11)* weights = 447 8) gender == {Female} 12)* weights = 634 3) age.category > 40-49 13) drummond.time <= 5.038056; criterion = 1, statistic = 68.556 14) age.category <= 50-59; criterion = 1, statistic = 40.471 15) gender == {Female}; criterion = 1, statistic = 32.419 16)* weights = 118 15) gender == {Male} 17)* weights = 886 14) age.category > 50-59 18)* weights = 170 13) drummond.time > 5.038056 19)* weights = 701 2) drummond.time > 5.4825 20) gender == {Male}; criterion = 1, statistic = 56.149 21) age.category <= 40-49; criterion = 0.995, statistic = 9.826 22)* weights = 636 21) age.category > 40-49 23)* weights = 259 20) gender == {Female} 24)* weights = 352 1) drummond.time > 5.669167 25) drummond.time <= 5.811389; criterion = 1, statistic = 301.482 26) age.category <= 30-39; criterion = 1, statistic = 37.006 27)* weights = 315 26) age.category > 30-39 28)* weights = 553 25) drummond.time > 5.811389 29) drummond.time <= 5.940556; criterion = 1, statistic = 75.164 30) age.category <= 30-39; criterion = 1, statistic = 25.519 31)* weights = 299 30) age.category > 30-39 32)* weights = 475 29) drummond.time > 5.940556 33)* weights = 694

There is a deluge of information in the textual representation of the model. Making sense of this is a lot easier with a plot.

> plot(tree.status)

The image below is a little small. You will want to click on it to bring up a larger version.

To interpret the tree, start at the top node (Node 1) labelled drummond.time, indicating that of the features considered, the most important variable in determining a successful outcome at the race is the time to the half-way mark. We are presented with two options: times that are either less than or greater than 5.669 hours. The cutoff time at Drummond is 6.167 hours (06:10:00), so runners reaching half-way after 5.669 hours are already getting quite close to the cutoff time. Suppose that we take the > 5.669 branch. The next node again depends on the half-way time, in this case dividing the population at 5.811 hours. If we take the left branch then we are considering runners who got to Drummond after 5.669 hours but before 5.811 hours. The next node depends on age category. The two branches here are for runners who are 39 and younger (left branch) and older runners (right branch). If we take the right branch then we reach the terminal node. There were 553 runners in this category and the spine plot indicates that around 35% of those runners successfully finished the race.

Rummaging around in this tree, there is a lot of interesting information to be found. For example, female runners who are aged less than 49 years and pass through Drummond in a time of between 5.079 and 5.482 hours are around 95% likely to finish the race. In fact, this is the most successful group of runners (there were 634 of them in the field). The next best group was male runners in the same age category who got to half-way in less than 5.079 hour: roughly 90% of the 5419 runners in this group finished the race.

Constructing a model for medal allocation is done in a similar fashion.

> splits.2013.finishers = subset(splits.2013, status == "Finished" & !is.na(medal)) > # > levels(splits.2013.finishers$medal) <- c("G", "WH", "S", "BR", "B", "VC")

Here I first extracted the subset of runners who finished the race (and for whom I have information on the medal allocated). Then, to make the plotting a little easier, the names of the levels in the medal factor are changed to a more compact representation.

> tree.medal = ctree(medal ~ gender + age.category + drummond.time, data = splits.2013.finishers, + control = ctree_control(minsplit = 750)) > tree.medal Conditional inference tree with 19 terminal nodes Response: medal Inputs: gender, age.category, drummond.time Number of observations: 10221 1) drummond.time <= 4.124167; criterion = 1, statistic = 7452.85 2) drummond.time <= 3.438889; criterion = 1, statistic = 1031.778 3)* weights = 571 2) drummond.time > 3.438889 4) drummond.time <= 3.812222; criterion = 1, statistic = 342.628 5) drummond.time <= 3.708056; criterion = 1, statistic = 53.658 6)* weights = 549 5) drummond.time > 3.708056 7)* weights = 250 4) drummond.time > 3.812222 8) drummond.time <= 3.976111; criterion = 1, statistic = 37.853 9)* weights = 386 8) drummond.time > 3.976111 10)* weights = 431 1) drummond.time > 4.124167 11) drummond.time <= 5.043611; criterion = 1, statistic = 4144.845 12) drummond.time <= 4.55; criterion = 1, statistic = 596.673 13) drummond.time <= 4.288333; criterion = 1, statistic = 81.996 14)* weights = 603 13) drummond.time > 4.288333 15) gender == {Male}; criterion = 0.996, statistic = 10.468 16)* weights = 993 15) gender == {Female} 17)* weights = 148 12) drummond.time > 4.55 18) drummond.time <= 4.862778; criterion = 1, statistic = 77.052 19) gender == {Male}; criterion = 1, statistic = 34.077 20) drummond.time <= 4.653611; criterion = 0.994, statistic = 9.583 21)* weights = 353 20) drummond.time > 4.653611 22)* weights = 762 19) gender == {Female} 23)* weights = 237 18) drummond.time > 4.862778 24) gender == {Male}; criterion = 1, statistic = 45.95 25)* weights = 756 24) gender == {Female} 26)* weights = 193 11) drummond.time > 5.043611 27) drummond.time <= 5.265833; criterion = 1, statistic = 544.833 28) gender == {Male}; criterion = 1, statistic = 54.559 29) drummond.time <= 5.174444; criterion = 1, statistic = 26.917 30)* weights = 545 29) drummond.time > 5.174444 31)* weights = 402 28) gender == {Female} 32)* weights = 327 27) drummond.time > 5.265833 33) drummond.time <= 5.409722; criterion = 1, statistic = 88.926 34) gender == {Male}; criterion = 1, statistic = 40.693 35)* weights = 675 34) gender == {Female} 36)* weights = 277 33) drummond.time > 5.409722 37)* weights = 1763

Apologies for the bit of information overload. A plot brings out the salient information though.

> plot(tree.medal)

Again you will want to click on the image below to make it legible.

Again the most important feature is the time at the half-way mark. If we look at the terminal node on the left (Node 3), which is the only one which contains athletes who received either Gold or Wally Hayward medals, then we see that they all passed through Drummond in a time of less than 3.439 hours. Almost all of the Silver medal athletes were also in this group, along with a good number of Bill Rowan runners. There are still a few Silver medal athletes in Node 6, which corresponds to runners who got to Drummond in less than 3.708 hours.

Shifting across to the other end of the plot and looking at runners who reached half-way in more than 5.266 hours. These are further divided into a group whose half-way time was more than 5.41 hours: these almost all got Vic Clapham medals. Interestingly, the outcome for athletes whose time at Drummond was greater than 5.266 hours but less than 5.41 hours depends on gender: the ladies achieved a higher proportion of Bronze medals than the men.

I could pore over these plots for hours. The take home message from this is that your outcome at the Comrades Marathon is most strongly determined by your pace in the first half of the race. Gender and age don’t seem to be particularly important, although they do exert an influence on your first half pace. Ladies who get to half-way at between 05:00 and 05:30 seem to have hit the sweet spot though with close to 100% success rate. Nice!

# Are Green Number Runners More Likely to Bail?

Comrades Marathon runners are awarded a permanent green race number once they have completed 10 journeys between Durban and Pietermaritzburg. For many runners, once they have completed the race a few times, achieving a green number becomes a possibility. And once the idea takes hold, it can become something of a compulsion. I can testify to this: I am thoroughly compelled! For runners with this goal in mind, every finish is one step closer to a green number. They are slowly chipping away, year after year and the idea of bailing is anathema. However, once the green number is in the bag, does the imperative to complete the race fade?

I am going to explore the hypothesis that runners with green numbers are more likely to bail.

Let’s start by looking at the proportions of runners who finish the race as opposed to those who do not finish (DNF) and those who enter but do not start (DNS). As can be seen from the plot below, the proportion of runners who finish the race seems to increase with the number of medals that the runners in question have. So, for example, of the runners with one medal, 68.6% finished while only 21.7% were DNF. For runners with ten medals, 87.1% finished and only 9.5% were DNF.

On the face of it, this seems to make sense: there is a natural selection effect. Runners who have more medals are probably a little more hard core and thus less likely to bail. Less experienced runners might be more likely to jump on the bus when the going gets really tough.

But, unfortunately, it is not quite that simple.

The analysis above has a serious problem: consider those runners with one medal. We are comparing the number of finishers (those that have just received that medal) to non-finishers (who already have a medal!). So we are not really comparing apples with apples! What we really should be working with are the number of finishers who had *i-1* medals before the race and the number of non-finishers who had *i* medals.

Compiling these data takes a little work, but nothing too taxing. Let’s consider an anonymous (but real) runner whose Comrades Marathon history looks like this:

year status medal.count 1985 Finished 1 1986 Finished 2 1987 Finished 3 1988 Finished 4 1989 Finished 5 1990 Finished 6 1991 Finished 7 1992 DNF 7 1993 DNF 7 1998 DNF 7 1999 Finished 8 2000 Finished 9 2001 DNF 9 2002 DNF 9 2003 DNF 9 2009 DNS 9 2010 DNS 9 2011 DNS 9 2012 DNS 9 2013 DNS 9

What we want is a table that shows how many times he ran with a given number of medals. So, for our anonymous hero, this would be:

0 1 2 3 4 5 6 7 8 9 Finished 1 1 1 1 1 1 1 1 1 0 DNF 0 0 0 0 0 0 0 3 0 3 DNS 0 0 0 0 0 0 0 0 0 5

Things went well for the first seven years. On the first year he had no medal (column 0) but he finished (so there is a 1 in the first row). The same applies for columns 1 to 6. Then on year 7 he finished, gaining his seventh medal (hence the 1 in the first row of column 6: he already had 6 medals when he ran this time!). However, for the next three years (when he already had 7 medals) he got a DNF (hence the 3 in the second row of column 7). On his fourth attempt he got medal number 8 (giving the 1 in the first row of column 7: he already had 7 medals when he ran this time!). And the following year he got medal number 9. Then he suffered a string of 3 DNFs (the 3 in the second row of column 9), followed by a series of 5 DNSs (the 5 in the third row of column 9). To illustrate the proportions, when he had 7 medals he got DNS 0% (0/4) of the time, DNF 75% (3/4) of the time and finished 25% (1/4) of the time.

Those are the data for a single athlete. To make a compelling case it is necessary to compile the same statistics for many, many runners. So I generated the analogous table for all athletes who ran the race between 1984 and 2013. A melted and abridged version of the resulting data look like this:

status medal.count number proportion 1 Finished 0 78051 0.83386039 2 DNF 0 11102 0.11860858 3 DNS 0 4449 0.04753104 4 Finished 1 52186 0.83512298 5 DNF 1 7336 0.11739666 6 DNS 1 2967 0.04748036 7 Finished 2 37478 0.83605863 8 DNF 2 5332 0.11894617 9 DNS 2 2017 0.04499520 10 Finished 3 28506 0.83472914 11 DNF 3 4072 0.11923865 12 DNS 3 1572 0.04603221 13 Finished 4 22814 0.83326637 14 DNF 4 3256 0.11892326 15 DNS 4 1309 0.04781037 16 Finished 5 18576 0.83630470 17 DNF 5 2585 0.11637853 18 DNS 5 1051 0.04731677 19 Finished 6 15538 0.83794424 20 DNF 6 2156 0.11627029 21 DNS 6 849 0.04578547 22 Finished 7 13300 0.84503463 23 DNF 7 1706 0.10839316 24 DNS 7 733 0.04657221 25 Finished 8 11809 0.86165633 26 DNF 8 1339 0.09770157 27 DNS 8 557 0.04064210 28 Finished 9 10852 0.81215387 29 DNF 9 1463 0.10948960 30 DNS 9 1047 0.07835653 31 Finished 10 7381 0.82047577 32 DNF 10 974 0.10827034 33 DNS 10 641 0.07125389 61 Finished 20 784 0.80575540 62 DNF 20 98 0.10071942 63 DNS 20 91 0.09352518 91 Finished 30 59 0.83098592 92 DNF 30 9 0.12676056 93 DNS 30 3 0.04225352

The important information here is the proportion of DNF entries for each medal count. We can see that 11.8% (0.11860858) of runners DNF on the first time that they ran. Similarly, of those runners who had already completed the race once (so they had one medal in the bag), 11.7% (0.11739666) did not finish. Of those who ran again after just achieving a green number, 10.8% (0.10827034) were DNF. It will be easier to make sense of all this in a plot.

Wow! Now that is interesting. Just to be sure that everything is clear about this plot: every column reflects the proportions of finishers, DNFs and DNSs who **already had** a given number of medals. There are a number of intriguing things about these data:

- all three proportions remain almost identical for runners who already had between 0 and 6 medals;
- the proportion of finishers then starts to ramp up for those with 7 and 8 medals (the DNS proportion remains unchanged, the DNFs decrease);
- there is a decrease in the proportion of finishers who already have 9 medals and a corresponding increase in the proportion of DNSs, while the DNFs remain unchanged;
- the proportion of finishers then increases slightly for those who already have 10 medals.

What conclusions can we draw from this? The second point seems to indicate a growing level of determination: these athletes are really close to their green number and they are less likely to sacrifice their medal. The third point is interesting too: the proportion of DNFs stays roughly the same but the DNS percentage grows from 4.1% for those with 8 medals to 7.8% for those with 9 medals. Why would this be? Well, I am really not sure and I would welcome suggestions. One possibility is that these runners are determined to have a good race so they might overtrain and end up injured or ill.

Are the differences in the proportion of DNFs statistically significant?

31-sample test for equality of proportions without continuity correction data: medal.table[2, 1:31] out of colSums(medal.table[, 1:31]) X-squared = 139.4798, df = 30, p-value = 4.744e-16 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 prop 7 prop 8 prop 9 prop 10 0.11860858 0.11739666 0.11894617 0.11923865 0.11892326 0.11637853 0.11627029 0.10839316 0.09770157 0.10948960 prop 11 prop 12 prop 13 prop 14 prop 15 prop 16 prop 17 prop 18 prop 19 prop 20 0.10827034 0.10204696 0.10013936 0.10500000 0.11237335 0.10784314 0.11079137 0.10659026 0.09327846 0.11298606 prop 21 prop 22 prop 23 prop 24 prop 25 prop 26 prop 27 prop 28 prop 29 prop 30 0.10071942 0.10404624 0.09890110 0.09684685 0.14473684 0.10833333 0.14358974 0.07284768 0.14285714 0.16379310 prop 31 0.12676056

The miniscule p-value from the proportion test indicates that there definitely is a significant difference in the proportion of DNFs across the entire data set (for those with between 0 and 30 medals). But it does not tell us anything about which of the proportions are responsible for this difference. We can get some information about this from a pairwise proportion test. Here is the abridged output.

Pairwise comparisons using Pairwise comparison of proportions data: medal.table[2, 1:31] out of colSums(medal.table[, 1:31]) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1.000 - - - - - - - - - - - - - - - 2 1.000 1.000 - - - - - - - - - - - - - - 3 1.000 1.000 1.000 - - - - - - - - - - - - - 4 1.000 1.000 1.000 1.000 - - - - - - - - - - - - 5 1.000 1.000 1.000 1.000 1.000 - - - - - - - - - - - 6 1.000 1.000 1.000 1.000 1.000 1.000 - - - - - - - - - - 7 0.107 0.734 0.179 0.205 0.457 1.000 1.000 - - - - - - - - - 8 4.8e-10 2.5e-08 3.8e-09 9.0e-09 6.4e-08 1.8e-05 5.8e-05 1.000 - - - - - - - - 9 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.689 - - - - - - - 10 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 - - - - - - 11 0.025 0.099 0.031 0.032 0.056 0.579 0.780 1.000 1.000 1.000 1.000 - - - - - 12 0.038 0.117 0.042 0.042 0.066 0.506 0.651 1.000 1.000 1.000 1.000 1.000 - - - - 13 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 - - - 14 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 - - 15 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 - 16 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

For between 0 and 6 medals there is no significant difference (p-value is roughly 1). The DNF proportion for those with 7 medals does start to differ from those with 4 medals or fewer, but the p-values are not significant. When we get to athletes who have 8 medals there is a significant difference in the proportion of DNFs all the way from those with 0 medals to those with 6 medals. However, the proportion of DNFs for those with 9 medals is not significantly different from any of the other categories. Finally, the DNF proportion for those athletes who already have 10 medals does not differ significantly from the athletes with any number of fewer medals.

So, no, it does not seem that runners with green numbers are more likely to bail (a conclusion that makes me personally very happy!). And good luck to the anonymous runner: I hope that you will be back in 2014 and that you will crack your green number!

Oh, and one last thing: as I mentioned before, the analysis above is based on the period 1984 to 2013. There are some serious issues with the data in the earlier years. Here is a breakdown of the number of runners in each of the categories across the years:

Finished DNF DNS 1984 7105 2 0 1985 8192 1907 1 1986 9654 1793 0 1987 8376 2458 0 1988 10363 1934 0 1989 10505 3065 2 1990 10272 1351 2 1991 12082 2936 1 1992 10695 2533 5 1993 11322 2270 2 1994 10274 2428 3 1995 10541 2990 1 1996 11269 2277 2 1997 11365 2467 3 1998 10496 2874 5 1999 11291 2835 3 2000 20030 4508 7 2001 11090 4270 1 2002 9027 2276 863 2003 11416 1065 892 2004 10123 1925 9 2005 11729 2163 7 2006 9846 1194 1025 2007 10052 1084 868 2008 8631 1745 813 2009 10008 1501 1441 2010 14339 2226 7000 2011 11058 2023 6506 2012 11889 1739 5916 2013 10278 3643 5986

Certainly something is deeply wrong in 1984! In the early years it does not make any sense to discriminate between DNF and DNS since there were no independent records kept: we simply know whether or not an athlete finished. The introduction of the ChampionChip timing devices improved the quality of the data dramatically. These chips have been used by all Comrades Marathon runners since 1997 although there is a delayed effect on the quality of the data.

Despite these issues, the conclusions of the analysis above remain essentially unchanged if you simply lump the DNF and DNS data together (because we cannot always make a meaningful divide between them!).

# The Green Number Effect

Following up on a suggestion from my previous post, here are the statistics for medal count versus age. Every point on the plot is the number (see colour legend on right) of athletes who have achieved a given number of medals by a particular age.

There is clear evidence of a Green Number Effect: many people hang on for ten medals and then pack it in. There is also weaker evidence of a Double Green Number Effect. But evidently there are far fewer people with that kind of commitment or level of craziness.

What about the influence of the Back-2-Back medals introduced in 2005? If you look carefully at the plot above then you can see some evidence. However, a simple histogram of medal counts makes the effect irrefutable.

Thanks for the idea, Tilda.

# Age Distribution of Comrades Marathon Athletes

I can clearly remember watching the end of the 1989 Comrades Marathon on television and seeing Wally Hayward coming in just before the final gun, completing the epic race at the age of 80! I was in awe.

Since I have been delving into the Comrades Marathon data, this got me thinking about the typical age distribution of athletes taking part. The plot below indicates the ages of athletes who finished the race, going all the way back to 1984. You can clearly spot the two years when Wally Hayward ran (1988 and 1989). My data indicates that he was only 79 on the day of the 1989 Comrades Marathon, but I am not going to quibble over a year and I am more than happy to accept that he was 80!

It is interesting to see that there is a consistent increase in the ages of both male and female finishers, as reflected by both the median and interquartile range (IQR).

The detailed distribution of ages across the period 1984 to 2013 is shown below. The median age of finishers is 37 years. Although in recent years the minimum age has been set at 20, in earlier times younger athletes were allowed to run the race. There are a significant number of runners in their 60s, but far fewer in their 70s. Only 163 runners older than 70 have finished the race since 1984.

What about the effect of age on individual finish times? Men appear to perform best between 20 and 30 years of age, with a gradual but consistent decrease in performance with advancing years. Things are not quite as clear cut with the female runners, where those in the 30 to 40 age bracket appear to perform fractionally better than those between the ages of 20 and 30.

Naturally these races times translates into medal allocations. The mosaic plot below shows both the distribution of runners across the various age categories as well as the medal allocations within those categories. The majority of runners are between 30 and 40 years of age and the most commonly awarded medal is the Bronze.

Finally, a breakdown of the gross number of medals awarded between 1984 and 2013. This includes data for the last 30 years and so is an extension of my previous analysis. Here it must be borne in mind that the Bill Rowan medal was only introduced in 2000, the Vic Clapham medal in 2003 and the Wally Hayward medal in 2007.

# Medal Allocations at the Comrades Marathon

Following up on my previous post regarding attrition rates at Comrades Marathon 2013, here are the statistics I have gathered for medal allocations. For reference, the medals are allocated as follows:

- Gold medals to the first ten finishers in the men’s race and the ladies’ race;
- Wally Hayward medals to finishers in under 06:00;
- Silver medals to finishers under 07:30;
- Bill Rowan medals to finishers under 09:00;
- Bronze medals to finishers under 11:00; and finally
- Vic Clapham medals to finishers before the final gun at 12:00.

This will be followed in a couple of days by an analysis of the relationship between running a negative split and finishing time.

# Comrades Marathon Attrition Rate

It is a bit of a mission to get the complete data set for this year’s Comrades Marathon. The full results are easily accessible, but come as an HTML file. Embedded in this file are links to the splits for individual athletes. So with a bit of scripting wizardry it is also possible to download the HTML files for each of the individual athletes. Parsing all of these yields the complete result set, which is the starting point for this analysis.

The first interesting thing that I found was that according to the main results page there were 19907 entrants (this is also the number quoted in the 2013 Comrades Marathon Highlights). However, there were only detailed data for 19903 individual athletes. This immediately aroused my suspicions, so I had a look for duplicate race numbers and, guess what? Yup! There were four: 57234, 54243, 16266 and 25315. If you don’t believe me, check out the results for yourself. Here are the relevant data:

Position | Race Number | Name | Time |
---|---|---|---|

1980 | 57234 | Izelle Pretorius | 09:16:02 |

1981 | 57234 | Justin Powrie | 09:16:02 |

3179 | 54243 | Daniel Matseme | 09:56:55 |

3180 | 54243 | Headman Magadeni | 09:56:55 |

3786 | 16266 | Doctor Masina | 10:17:56 |

3787 | 16266 | Doctor Patrick Masina | 10:17:56 |

25315 | Paulus Mpho | DNF | |

25315 | Ludwe Tsoliwe | DNF |

That’s interesting: for each duplicated race number there are two names, both of which have the same finishing time and are assigned independent positions in the field. I don’t know what has happened here, but there is clearly a glitch in the data being provided by the CMA. Logic suggests that in each case there was in fact just one runner and so the overall position data are not correct. Not a big issue, but if you came in after position 1980, then your real position may be out by a few places.

Moving on to something more relevant: attrition. Of the 19903 *independent* entrants, I find that only 10185 finished. Again this number differs from the official number by 3 (this is because of the duplication issue mention above!). But many of those entrants didn’t even start the race. There were 6008 entrants who did not make it to the City Hall in Durban on Sunday morning. Of the 13895 athletes who were there when the start gun went off, only 10183 made it to the finish line before the 12 hour cutoff. This means that the total attrition rate was 26.7%: just over one quarter of the field didn’t make it! In view of the carnage that I witnessed on Sunday, I would have expected this number to be a lot higher!

Let’s break this down by gender. The figure below shows the proportion of athletes who did not start (DNS), did not finish (DNF), and who did actually finish the race as a function of gender. The DNS data are the categories “not yet started”, “pre-race withdrawal” and “substituted”. The DNF data also include “disqualified” and “started and running”.

So what can we take away from this plot? Here are the main points:

- men made up 78.0% of the entrants;
- women accounted for 20.3% of those that crossed the start line;
- men made up 80.8% of the finishers.

The proportions are rather consistent! But this is only one way of looking at the data. What about if we consider the proportions within each gender? Then the picture is slightly different:

- 28.7% of the male entrants did not start the race (compared with 35.6% of the females);
- 74.3% of the males who started also reached the finish line before the gun (as opposed to 69.4% for females).

I am not going to interpret these results any further. I know which side my bread is buttered. Draw your own conclusions.

Next we look at the same data but broken down according to age category. Here the 40-49 age group was the best for getting to the starting line. Obviously they (and I include myself here) have learned that if you don’t start, then you certainly can’t finish! Ahem. Moving on. Of those that did start, runners in the 20-29 age group fared the best with 81.3% finishing. Things got progressively worse from there with the percentage of finishers dropping from 79.6% in the 30-39 group, to 73.5% in the 40-49 group, 61.6% in the 50-59 group and only 45.1% in the 60 and older group. Still damn impressive for the senior runners, but the youngsters appear to have fared best on the day. Perhaps they are more tolerant to warm weather?

Now, let’s put all of this together, looking at gender, age group and finishing status. There is a lot more information and it is a little difficult to make sense of all of it at once. But here are the salient points:

- men in all three of the 30-39, 40-49 and 50-59 age groups were equally likely to start;
- men in the 30-39 age group were most likely to finish;
- among the women, those in the 40-49 age group were the most likely to start;
- of the women that did start, the 30-39 age group was most likely to finish.

Looks like 30-39 is the prime time to be running the Comrades. That’s not to say that I am past my prime. Hell no! Not at all.

Over the next few days I will look at the following questions:

- what is the effect of running a negative split on overall time? and
- how does the finishing rate vary with time? Is there evidence of a “diamond carat” effect?