Tag Archives: Comrades Marathon

A Chart of Recent Comrades Marathon Winners

Continuing on my quest to document the Comrades Marathon results, today I have put together a chart showing the winners of both the men and ladies races since 1980. Click on the image below to see a larger version.

winners-scatterchart

The analysis started off with the same data set that I was working with before, from which I extracted only the records for the winners.

> winners = subset(results, gender.position == 1, select = c(year, name, gender, race.time))
> head(winners)
     year               name gender race.time
1    1980          Alan Robb   Male  05:38:25
428  1980 Isavel Roche-Kelly Female  07:18:00
3981 1981      Bruce Fordyce   Male  05:37:28
4055 1981 Isavel Roche-Kelly Female  06:44:35
7643 1982      Bruce Fordyce   Male  05:34:22
7873 1982        Cheryl Winn Female  07:04:59

I then added in a field which gives a count of the number of times each person won the race.

> library(plyr)
> winners = ddply(winners, .(name), function(df) {
+     df = df[order(df$year),]
+     df$count = 1:nrow(df)
+     return(df)
+ })
> subset(winners, name == "Bruce Fordyce")
   year          name gender race.time count
7  1981 Bruce Fordyce   Male  05:37:28     1
8  1982 Bruce Fordyce   Male  05:34:22     2
9  1983 Bruce Fordyce   Male  05:30:12     3
10 1984 Bruce Fordyce   Male  05:27:18     4
11 1985 Bruce Fordyce   Male  05:37:01     5
12 1986 Bruce Fordyce   Male  05:24:07     6
13 1987 Bruce Fordyce   Male  05:37:01     7
14 1988 Bruce Fordyce   Male  05:27:42     8
15 1990 Bruce Fordyce   Male  05:40:25     9

The chart was generated as a scatter plot using ggplot2. The size of the points relates to the number of times each person won the race. The colour scale is as you might imagine: pink for the ladies and blue for the men.

> library(ggplot2)
> ggplot(winners, aes(x = year, y = name, color = gender)) +
+     geom_point(aes(size = count), shape = 19, alpha = 0.75) +
+     scale_size_continuous(range = c(5, 15)) +
+     ylab("") + xlab("") +
+     scale_x_discrete(expand = c(0, 1)) +
+     theme(
+         axis.text.x = element_text(angle = 45, hjust = 1, colour = "black"),
+         axis.text.y = element_text(colour = "black"),
+         legend.position = "none",
+         panel.background = element_blank(),
+         panel.grid.major = element_line(linetype = "dotted", colour = "grey"),
+         panel.grid.major.x = element_blank()
+         )

Two of the key aspects of getting this to look just right were:

  • the call to scale_size_continuous() which ensured that a reasonable range of point sizes was used and
  • the call to scale_x_discrete() which expanded the plot very slightly so that the points near the borders were not cropped.

Comrades Marathon Inference Trees

Following up on my previous posts regarding the results of the Comrades Marathon, I was planning on putting together a set of models which would predict likelihood to finish and probable finishing time. Along the way I got distracted by something else that is just as interesting and which produces results which readily yield to qualitative interpretation: Conditional Inference Trees as implemented in the R package party.

Just to recall what the data look like:

> head(splits.2013)
           gender age.category drummond.time race.time   status       medal
2013-10014   Male        50-59      5.510833        NA      DNF        <NA>
2013-10016   Male        60-69      6.070833        NA      DNF        <NA>
2013-10019   Male        20-29      5.335833  11.87361 Finished Vic Clapham
2013-10031   Male        20-29      4.910833  10.94833 Finished      Bronze
2013-10047   Male        50-59      5.076944  10.72778 Finished      Bronze
2013-10049   Male        50-59      5.729444        NA      DNF        <NA>

Here the drummond.time and finish.time fields are expressed in decimal hours and correspond to the time taken to reach the half-way mark and the finish respectively. The status field indicates whether a runner finished the race or did not finish (DNF).

I am going to consider two models. The first will look at the probability of finishing and the second will look at the distribution of medals. The features which will be used to predict these outcomes will be gender, age category and half-way time at Drummond. To build the first model, first load the party library and then call ctree.

> library(party)
> tree.status = ctree(status ~ gender + age.category + drummond.time, data = splits.2013,
+                     control = ctree_control(minsplit = 750))
> tree.status

	 Conditional inference tree with 17 terminal nodes

Response:  status 
Inputs:  gender, age.category, drummond.time 
Number of observations:  13917 

1) drummond.time <= 5.669167; criterion = 1, statistic = 2985.908
  2) drummond.time <= 5.4825; criterion = 1, statistic = 494.826
    3) age.category <= 40-49; criterion = 1, statistic = 191.12
      4) drummond.time <= 5.078611; criterion = 1, statistic = 76.962
        5) gender == {Male}; criterion = 1, statistic = 73.4
          6)*  weights = 5419 
        5) gender == {Female}
          7)*  weights = 836 
      4) drummond.time > 5.078611
        8) gender == {Male}; criterion = 1, statistic = 63.347
          9) drummond.time <= 5.379722; criterion = 1, statistic = 15.55
            10)*  weights = 1123 
          9) drummond.time > 5.379722
            11)*  weights = 447 
        8) gender == {Female}
          12)*  weights = 634 
    3) age.category > 40-49
      13) drummond.time <= 5.038056; criterion = 1, statistic = 68.556
        14) age.category <= 50-59; criterion = 1, statistic = 40.471
          15) gender == {Female}; criterion = 1, statistic = 32.419
            16)*  weights = 118 
          15) gender == {Male}
            17)*  weights = 886 
        14) age.category > 50-59
          18)*  weights = 170 
      13) drummond.time > 5.038056
        19)*  weights = 701 
  2) drummond.time > 5.4825
    20) gender == {Male}; criterion = 1, statistic = 56.149
      21) age.category <= 40-49; criterion = 0.995, statistic = 9.826
        22)*  weights = 636 
      21) age.category > 40-49
        23)*  weights = 259 
    20) gender == {Female}
      24)*  weights = 352 
1) drummond.time > 5.669167
  25) drummond.time <= 5.811389; criterion = 1, statistic = 301.482
    26) age.category <= 30-39; criterion = 1, statistic = 37.006
      27)*  weights = 315 
    26) age.category > 30-39
      28)*  weights = 553 
  25) drummond.time > 5.811389
    29) drummond.time <= 5.940556; criterion = 1, statistic = 75.164
      30) age.category <= 30-39; criterion = 1, statistic = 25.519
        31)*  weights = 299 
      30) age.category > 30-39
        32)*  weights = 475 
    29) drummond.time > 5.940556
      33)*  weights = 694 

There is a deluge of information in the textual representation of the model. Making sense of this is a lot easier with a plot.

> plot(tree.status)

The image below is a little small. You will want to click on it to bring up a larger version.

ctree-status-2013

To interpret the tree, start at the top node (Node 1) labelled drummond.time, indicating that of the features considered, the most important variable in determining a successful outcome at the race is the time to the half-way mark. We are presented with two options: times that are either less than or greater than 5.669 hours. The cutoff time at Drummond is 6.167 hours (06:10:00), so runners reaching half-way after 5.669 hours are already getting quite close to the cutoff time. Suppose that we take the > 5.669 branch. The next node again depends on the half-way time, in this case dividing the population at 5.811 hours. If we take the left branch then we are considering runners who got to Drummond after 5.669 hours but before 5.811 hours. The next node depends on age category. The two branches here are for runners who are 39 and younger (left branch) and older runners (right branch). If we take the right branch then we reach the terminal node. There were 553 runners in this category and the spine plot indicates that around 35% of those runners successfully finished the race.

Rummaging around in this tree, there is a lot of interesting information to be found. For example, female runners who are aged less than 49 years and pass through Drummond in a time of between 5.079 and 5.482 hours are around 95% likely to finish the race. In fact, this is the most successful group of runners (there were 634 of them in the field). The next best group was male runners in the same age category who got to half-way in less than 5.079 hour: roughly 90% of the 5419 runners in this group finished the race.

Constructing a model for medal allocation is done in a similar fashion.

> splits.2013.finishers = subset(splits.2013, status == "Finished" & !is.na(medal))
> #
> levels(splits.2013.finishers$medal) <- c("G", "WH", "S", "BR", "B", "VC")

Here I first extracted the subset of runners who finished the race (and for whom I have information on the medal allocated). Then, to make the plotting a little easier, the names of the levels in the medal factor are changed to a more compact representation.

> tree.medal = ctree(medal ~ gender + age.category + drummond.time, data = splits.2013.finishers,
+                    control = ctree_control(minsplit = 750))
> tree.medal

	 Conditional inference tree with 19 terminal nodes

Response:  medal 
Inputs:  gender, age.category, drummond.time 
Number of observations:  10221 

1) drummond.time <= 4.124167; criterion = 1, statistic = 7452.85
  2) drummond.time <= 3.438889; criterion = 1, statistic = 1031.778
    3)*  weights = 571 
  2) drummond.time > 3.438889
    4) drummond.time <= 3.812222; criterion = 1, statistic = 342.628
      5) drummond.time <= 3.708056; criterion = 1, statistic = 53.658
        6)*  weights = 549 
      5) drummond.time > 3.708056
        7)*  weights = 250 
    4) drummond.time > 3.812222
      8) drummond.time <= 3.976111; criterion = 1, statistic = 37.853
        9)*  weights = 386 
      8) drummond.time > 3.976111
        10)*  weights = 431 
1) drummond.time > 4.124167
  11) drummond.time <= 5.043611; criterion = 1, statistic = 4144.845
    12) drummond.time <= 4.55; criterion = 1, statistic = 596.673
      13) drummond.time <= 4.288333; criterion = 1, statistic = 81.996
        14)*  weights = 603 
      13) drummond.time > 4.288333
        15) gender == {Male}; criterion = 0.996, statistic = 10.468
          16)*  weights = 993 
        15) gender == {Female}
          17)*  weights = 148 
    12) drummond.time > 4.55
      18) drummond.time <= 4.862778; criterion = 1, statistic = 77.052
        19) gender == {Male}; criterion = 1, statistic = 34.077
          20) drummond.time <= 4.653611; criterion = 0.994, statistic = 9.583
            21)*  weights = 353 
          20) drummond.time > 4.653611
            22)*  weights = 762 
        19) gender == {Female}
          23)*  weights = 237 
      18) drummond.time > 4.862778
        24) gender == {Male}; criterion = 1, statistic = 45.95
          25)*  weights = 756 
        24) gender == {Female}
          26)*  weights = 193 
  11) drummond.time > 5.043611
    27) drummond.time <= 5.265833; criterion = 1, statistic = 544.833
      28) gender == {Male}; criterion = 1, statistic = 54.559
        29) drummond.time <= 5.174444; criterion = 1, statistic = 26.917
          30)*  weights = 545 
        29) drummond.time > 5.174444
          31)*  weights = 402 
      28) gender == {Female}
        32)*  weights = 327 
    27) drummond.time > 5.265833
      33) drummond.time <= 5.409722; criterion = 1, statistic = 88.926
        34) gender == {Male}; criterion = 1, statistic = 40.693
          35)*  weights = 675 
        34) gender == {Female}
          36)*  weights = 277 
      33) drummond.time > 5.409722
        37)*  weights = 1763 

Apologies for the bit of information overload. A plot brings out the salient information though.

> plot(tree.medal)

Again you will want to click on the image below to make it legible.

ctree-medal-2013

Again the most important feature is the time at the half-way mark. If we look at the terminal node on the left (Node 3), which is the only one which contains athletes who received either Gold or Wally Hayward medals, then we see that they all passed through Drummond in a time of less than 3.439 hours. Almost all of the Silver medal athletes were also in this group, along with a good number of Bill Rowan runners. There are still a few Silver medal athletes in Node 6, which corresponds to runners who got to Drummond in less than 3.708 hours.

Shifting across to the other end of the plot and looking at runners who reached half-way in more than 5.266 hours. These are further divided into a group whose half-way time was more than 5.41 hours: these almost all got Vic Clapham medals. Interestingly, the outcome for athletes whose time at Drummond was greater than 5.266 hours but less than 5.41 hours depends on gender: the ladies achieved a higher proportion of Bronze medals than the men.

I could pore over these plots for hours. The take home message from this is that your outcome at the Comrades Marathon is most strongly determined by your pace in the first half of the race. Gender and age don’t seem to be particularly important, although they do exert an influence on your first half pace. Ladies who get to half-way at between 05:00 and 05:30 seem to have hit the sweet spot though with close to 100% success rate. Nice!

Are Green Number Runners More Likely to Bail?

Comrades Marathon runners are awarded a permanent green race number once they have completed 10 journeys between Durban and Pietermaritzburg. For many runners, once they have completed the race a few times, achieving a green number becomes a possibility. And once the idea takes hold, it can become something of a compulsion. I can testify to this: I am thoroughly compelled! For runners with this goal in mind, every finish is one step closer to a green number. They are slowly chipping away, year after year and the idea of bailing is anathema. However, once the green number is in the bag, does the imperative to complete the race fade?

I am going to explore the hypothesis that runners with green numbers are more likely to bail.

Let’s start by looking at the proportions of runners who finish the race as opposed to those who do not finish (DNF) and those who enter but do not start (DNS). As can be seen from the plot below, the proportion of runners who finish the race seems to increase with the number of medals that the runners in question have. So, for example, of the runners with one medal, 68.6% finished while only 21.7% were DNF. For runners with ten medals, 87.1% finished and only 9.5% were DNF.

On the face of it, this seems to make sense: there is a natural selection effect. Runners who have more medals are probably a little more hard core and thus less likely to bail. Less experienced runners might be more likely to jump on the bus when the going gets really tough.

But, unfortunately, it is not quite that simple.

Proportion of runners who finished, did not finish and did not start as a function of number of medals.

The analysis above has a serious problem: consider those runners with one medal. We are comparing the number of finishers (those that have just received that medal) to non-finishers (who already have a medal!). So we are not really comparing apples with apples! What we really should be working with are the number of finishers who had i-1 medals before the race and the number of non-finishers who had i medals.

Compiling these data takes a little work, but nothing too taxing. Let’s consider an anonymous (but real) runner whose Comrades Marathon history looks like this:

year   status medal.count
1985 Finished           1
1986 Finished           2
1987 Finished           3
1988 Finished           4
1989 Finished           5
1990 Finished           6
1991 Finished           7
1992      DNF           7
1993      DNF           7
1998      DNF           7
1999 Finished           8
2000 Finished           9
2001      DNF           9
2002      DNF           9
2003      DNF           9
2009      DNS           9
2010      DNS           9
2011      DNS           9
2012      DNS           9
2013      DNS           9

What we want is a table that shows how many times he ran with a given number of medals. So, for our anonymous hero, this would be:

           0 1 2 3 4 5 6 7 8 9
  Finished 1 1 1 1 1 1 1 1 1 0
  DNF      0 0 0 0 0 0 0 3 0 3
  DNS      0 0 0 0 0 0 0 0 0 5

Things went well for the first seven years. On the first year he had no medal (column 0) but he finished (so there is a 1 in the first row). The same applies for columns 1 to 6. Then on year 7 he finished, gaining his seventh medal (hence the 1 in the first row of column 6: he already had 6 medals when he ran this time!). However, for the next three years (when he already had 7 medals) he got a DNF (hence the 3 in the second row of column 7). On his fourth attempt he got medal number 8 (giving the 1 in the first row of column 7: he already had 7 medals when he ran this time!). And the following year he got medal number 9. Then he suffered a string of 3 DNFs (the 3 in the second row of column 9), followed by a series of 5 DNSs (the 5 in the third row of column 9). To illustrate the proportions, when he had 7 medals he got DNS 0% (0/4) of the time, DNF 75% (3/4) of the time and finished 25% (1/4) of the time.

Those are the data for a single athlete. To make a compelling case it is necessary to compile the same statistics for many, many runners. So I generated the analogous table for all athletes who ran the race between 1984 and 2013. A melted and abridged version of the resulting data look like this:

     status medal.count number proportion
1  Finished           0  78051 0.83386039
2       DNF           0  11102 0.11860858
3       DNS           0   4449 0.04753104
4  Finished           1  52186 0.83512298
5       DNF           1   7336 0.11739666
6       DNS           1   2967 0.04748036
7  Finished           2  37478 0.83605863
8       DNF           2   5332 0.11894617
9       DNS           2   2017 0.04499520
10 Finished           3  28506 0.83472914
11      DNF           3   4072 0.11923865
12      DNS           3   1572 0.04603221
13 Finished           4  22814 0.83326637
14      DNF           4   3256 0.11892326
15      DNS           4   1309 0.04781037
16 Finished           5  18576 0.83630470
17      DNF           5   2585 0.11637853
18      DNS           5   1051 0.04731677
19 Finished           6  15538 0.83794424
20      DNF           6   2156 0.11627029
21      DNS           6    849 0.04578547
22 Finished           7  13300 0.84503463
23      DNF           7   1706 0.10839316
24      DNS           7    733 0.04657221
25 Finished           8  11809 0.86165633
26      DNF           8   1339 0.09770157
27      DNS           8    557 0.04064210
28 Finished           9  10852 0.81215387
29      DNF           9   1463 0.10948960
30      DNS           9   1047 0.07835653
31 Finished          10   7381 0.82047577
32      DNF          10    974 0.10827034
33      DNS          10    641 0.07125389

61 Finished          20    784 0.80575540
62      DNF          20     98 0.10071942
63      DNS          20     91 0.09352518

91 Finished          30     59 0.83098592
92      DNF          30      9 0.12676056
93      DNS          30      3 0.04225352

The important information here is the proportion of DNF entries for each medal count. We can see that 11.8% (0.11860858) of runners DNF on the first time that they ran. Similarly, of those runners who had already completed the race once (so they had one medal in the bag), 11.7% (0.11739666) did not finish. Of those who ran again after just achieving a green number, 10.8% (0.10827034) were DNF. It will be easier to make sense of all this in a plot.

status-proportion-medal-count-corrected

Wow! Now that is interesting. Just to be sure that everything is clear about this plot: every column reflects the proportions of finishers, DNFs and DNSs who already had a given number of medals. There are a number of intriguing things about these data:

  1. all three proportions remain almost identical for runners who already had between 0 and 6 medals;
  2. the proportion of finishers then starts to ramp up for those with 7 and 8 medals (the DNS proportion remains unchanged, the DNFs decrease);
  3. there is a decrease in the proportion of finishers who already have 9 medals and a corresponding increase in the proportion of DNSs, while the DNFs remain unchanged;
  4. the proportion of finishers then increases slightly for those who already have 10 medals.

What conclusions can we draw from this? The second point seems to indicate a growing level of determination: these athletes are really close to their green number and they are less likely to sacrifice their medal. The third point is interesting too: the proportion of DNFs stays roughly the same but the DNS percentage grows from 4.1% for those with 8 medals to 7.8% for those with 9 medals. Why would this be? Well, I am really not sure and I would welcome suggestions. One possibility is that these runners are determined to have a good race so they might overtrain and end up injured or ill.

Are the differences in the proportion of DNFs statistically significant?

	31-sample test for equality of proportions without continuity correction

data:  medal.table[2, 1:31] out of colSums(medal.table[, 1:31])
X-squared = 139.4798, df = 30, p-value = 4.744e-16
alternative hypothesis: two.sided
sample estimates:
    prop 1     prop 2     prop 3     prop 4     prop 5     prop 6     prop 7     prop 8     prop 9    prop 10
0.11860858 0.11739666 0.11894617 0.11923865 0.11892326 0.11637853 0.11627029 0.10839316 0.09770157 0.10948960
   prop 11    prop 12    prop 13    prop 14    prop 15    prop 16    prop 17    prop 18    prop 19    prop 20
0.10827034 0.10204696 0.10013936 0.10500000 0.11237335 0.10784314 0.11079137 0.10659026 0.09327846 0.11298606
   prop 21    prop 22    prop 23    prop 24    prop 25    prop 26    prop 27    prop 28    prop 29    prop 30
0.10071942 0.10404624 0.09890110 0.09684685 0.14473684 0.10833333 0.14358974 0.07284768 0.14285714 0.16379310
   prop 31
0.12676056

The miniscule p-value from the proportion test indicates that there definitely is a significant difference in the proportion of DNFs across the entire data set (for those with between 0 and 30 medals). But it does not tell us anything about which of the proportions are responsible for this difference. We can get some information about this from a pairwise proportion test. Here is the abridged output.

	Pairwise comparisons using Pairwise comparison of proportions

data:  medal.table[2, 1:31] out of colSums(medal.table[, 1:31])

   0       1       2       3       4       5       6       7     8     9     10    11    12    13    14    15
1  1.000   -       -       -       -       -       -       -     -     -     -     -     -     -     -     -
2  1.000   1.000   -       -       -       -       -       -     -     -     -     -     -     -     -     -
3  1.000   1.000   1.000   -       -       -       -       -     -     -     -     -     -     -     -     -
4  1.000   1.000   1.000   1.000   -       -       -       -     -     -     -     -     -     -     -     -
5  1.000   1.000   1.000   1.000   1.000   -       -       -     -     -     -     -     -     -     -     -
6  1.000   1.000   1.000   1.000   1.000   1.000   -       -     -     -     -     -     -     -     -     -
7  0.107   0.734   0.179   0.205   0.457   1.000   1.000   -     -     -     -     -     -     -     -     -
8  4.8e-10 2.5e-08 3.8e-09 9.0e-09 6.4e-08 1.8e-05 5.8e-05 1.000 -     -     -     -     -     -     -     -
9  1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000 0.689 -     -     -     -     -     -     -
10 1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000 1.000 1.000 -     -     -     -     -     -
11 0.025   0.099   0.031   0.032   0.056   0.579   0.780   1.000 1.000 1.000 1.000 -     -     -     -     -
12 0.038   0.117   0.042   0.042   0.066   0.506   0.651   1.000 1.000 1.000 1.000 1.000 -     -     -     -
13 1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000 1.000 1.000 1.000 1.000 1.000 -     -     -
14 1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000 1.000 1.000 1.000 1.000 1.000 1.000 -     -
15 1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 -
16 1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

For between 0 and 6 medals there is no significant difference (p-value is roughly 1). The DNF proportion for those with 7 medals does start to differ from those with 4 medals or fewer, but the p-values are not significant. When we get to athletes who have 8 medals there is a significant difference in the proportion of DNFs all the way from those with 0 medals to those with 6 medals. However, the proportion of DNFs for those with 9 medals is not significantly different from any of the other categories. Finally, the DNF proportion for those athletes who already have 10 medals does not differ significantly from the athletes with any number of fewer medals.

So, no, it does not seem that runners with green numbers are more likely to bail (a conclusion that makes me personally very happy!). And good luck to the anonymous runner: I hope that you will be back in 2014 and that you will crack your green number!

Oh, and one last thing: as I mentioned before, the analysis above is based on the period 1984 to 2013. There are some serious issues with the data in the earlier years. Here is a breakdown of the number of runners in each of the categories across the years:

       Finished   DNF   DNS
  1984     7105     2     0
  1985     8192  1907     1
  1986     9654  1793     0
  1987     8376  2458     0
  1988    10363  1934     0
  1989    10505  3065     2
  1990    10272  1351     2
  1991    12082  2936     1
  1992    10695  2533     5
  1993    11322  2270     2
  1994    10274  2428     3
  1995    10541  2990     1
  1996    11269  2277     2
  1997    11365  2467     3
  1998    10496  2874     5
  1999    11291  2835     3
  2000    20030  4508     7
  2001    11090  4270     1
  2002     9027  2276   863
  2003    11416  1065   892
  2004    10123  1925     9
  2005    11729  2163     7
  2006     9846  1194  1025
  2007    10052  1084   868
  2008     8631  1745   813
  2009    10008  1501  1441
  2010    14339  2226  7000
  2011    11058  2023  6506
  2012    11889  1739  5916
  2013    10278  3643  5986

Certainly something is deeply wrong in 1984! In the early years it does not make any sense to discriminate between DNF and DNS since there were no independent records kept: we simply know whether or not an athlete finished. The introduction of the ChampionChip timing devices improved the quality of the data dramatically. These chips have been used by all Comrades Marathon runners since 1997 although there is a delayed effect on the quality of the data.

Despite these issues, the conclusions of the analysis above remain essentially unchanged if you simply lump the DNF and DNS data together (because we cannot always make a meaningful divide between them!).

The Green Number Effect

Following up on a suggestion from my previous post, here are the statistics for medal count versus age. Every point on the plot is the number (see colour legend on right) of athletes who have achieved a given number of medals by a particular age.

medal-count-age

There is clear evidence of a Green Number Effect: many people hang on for ten medals and then pack it in. There is also weaker evidence of a Double Green Number Effect. But evidently there are far fewer people with that kind of commitment or level of craziness.

What about the influence of the Back-2-Back medals introduced in 2005? If you look carefully at the plot above then you can see some evidence. However, a simple histogram of medal counts makes the effect irrefutable.

medal-count-histogram

Thanks for the idea, Tilda.

Age Distribution of Comrades Marathon Athletes

I can clearly remember watching the end of the 1989 Comrades Marathon on television and seeing Wally Hayward coming in just before the final gun, completing the epic race at the age of 80! I was in awe.

Since I have been delving into the Comrades Marathon data, this got me thinking about the typical age distribution of athletes taking part. The plot below indicates the ages of athletes who finished the race, going all the way back to 1984. You can clearly spot the two years when Wally Hayward ran (1988 and 1989). My data indicates that he was only 79 on the day of the 1989 Comrades Marathon, but I am not going to quibble over a year and I am more than happy to accept that he was 80!

age-year-boxplot

It is interesting to see that there is a consistent increase in the ages of both male and female finishers, as reflected by both the median and interquartile range (IQR).

The detailed distribution of ages across the period 1984 to 2013 is shown below. The median age of finishers is 37 years. Although in recent years the minimum age has been set at 20, in earlier times younger athletes were allowed to run the race. There are a significant number of runners in their 60s, but far fewer in their 70s. Only 163 runners older than 70 have finished the race since 1984.

age-histogram

What about the effect of age on individual finish times? Men appear to perform best between 20 and 30 years of age, with a gradual but consistent decrease in performance with advancing years. Things are not quite as clear cut with the female runners, where those in the 30 to 40 age bracket appear to perform fractionally better than those between the ages of 20 and 30.

gender-age-time-boxplot

Naturally these races times translates into medal allocations. The mosaic plot below shows both the distribution of runners across the various age categories as well as the medal allocations within those categories. The majority of runners are between 30 and 40 years of age and the most commonly awarded medal is the Bronze.

status-medal-age-mosaicplot

Finally, a breakdown of the gross number of medals awarded between 1984 and 2013. This includes data for the last 30 years and so is an extension of my previous analysis. Here it must be borne in mind that the Bill Rowan medal was only introduced in 2000, the Vic Clapham medal in 2003 and the Wally Hayward medal in 2007.

medal-allocations-age-gender

Medal Allocations at the Comrades Marathon

Following up on my previous post regarding attrition rates at Comrades Marathon 2013, here are the statistics I have gathered for medal allocations. For reference, the medals are allocated as follows:

  • Gold medals to the first ten finishers in the men’s race and the ladies’ race;
  • Wally Hayward medals to finishers in under 06:00;
  • Silver medals to finishers under 07:30;
  • Bill Rowan medals to finishers under 09:00;
  • Bronze medals to finishers under 11:00; and finally
  • Vic Clapham medals to finishers before the final gun at 12:00.

Comrades Marathon 2013 Medal Allocations

This will be followed in a couple of days by an analysis of the relationship between running a negative split and finishing time.

Comrades Marathon Attrition Rate

It is a bit of a mission to get the complete data set for this year’s Comrades Marathon. The full results are easily accessible, but come as an HTML file. Embedded in this file are links to the splits for individual athletes. So with a bit of scripting wizardry it is also possible to download the HTML files for each of the individual athletes. Parsing all of these yields the complete result set, which is the starting point for this analysis.

The first interesting thing that I found was that according to the main results page there were 19907 entrants (this is also the number quoted in the 2013 Comrades Marathon Highlights). However, there were only detailed data for 19903 individual athletes. This immediately aroused my suspicions, so I had a look for duplicate race numbers and, guess what? Yup! There were four: 57234, 54243, 16266 and 25315. If you don’t believe me, check out the results for yourself. Here are the relevant data:

Position Race Number Name Time
1980 57234 Izelle Pretorius 09:16:02
1981 57234 Justin Powrie 09:16:02
3179 54243 Daniel Matseme 09:56:55
3180 54243 Headman Magadeni 09:56:55
3786 16266 Doctor Masina 10:17:56
3787 16266 Doctor Patrick Masina 10:17:56
25315 Paulus Mpho  DNF
25315 Ludwe Tsoliwe  DNF

That’s interesting: for each duplicated race number there are two names, both of which have the same finishing time and are assigned independent positions in the field. I don’t know what has happened here, but there is clearly a glitch in the data being provided by the CMA. Logic suggests that in each case there was in fact just one runner and so the overall position data are not correct. Not a big issue, but if you came in after position 1980, then your real position may be out by a few places.

Moving on to something more relevant: attrition. Of the 19903 independent entrants, I find that only 10185 finished. Again this number differs from the official number by 3 (this is because of the duplication issue mention above!). But many of those entrants didn’t even start the race. There were 6008 entrants who did not make it to the City Hall in Durban on Sunday morning. Of the 13895 athletes who were there when the start gun went off, only 10183 made it to the finish line before the 12 hour cutoff. This means that the total attrition rate was 26.7%: just over one quarter of the field didn’t make it! In view of the carnage that I witnessed on Sunday, I would have expected this number to be a lot higher!

Let’s break this down by gender. The figure below shows the proportion of athletes who did not start (DNS), did not finish (DNF), and who did actually finish the race as a function of gender. The DNS data are the categories “not yet started”, “pre-race withdrawal” and “substituted”. The DNF data also include “disqualified” and “started and running”.

status-gender-spineplot

So what can we take away from this plot? Here are the main points:

  • men made up 78.0% of the entrants;
  • women accounted for 20.3% of those that crossed the start line;
  • men made up 80.8% of the finishers.

The proportions are rather consistent! But this is only one way of looking at the data. What about if we consider the proportions within each gender? Then the picture is slightly different:

  • 28.7% of the male entrants did not start the race (compared with 35.6% of the females);
  • 74.3% of the males who started also reached the finish line before the gun (as opposed to 69.4% for females).

I am not going to interpret these results any further. I know which side my bread is buttered. Draw your own conclusions.

Next we look at the same data but broken down according to age category. Here the 40-49 age group was the best for getting to the starting line. Obviously they (and I include myself here) have learned that if you don’t start, then you certainly can’t finish! Ahem. Moving on. Of those that did start, runners in the 20-29 age group fared the best with 81.3% finishing. Things got progressively worse from there with the percentage of finishers dropping from 79.6% in the 30-39 group, to 73.5% in the 40-49 group, 61.6% in the 50-59 group and only 45.1% in the 60 and older group. Still damn impressive for the senior runners, but the youngsters appear to have fared best on the day. Perhaps they are more tolerant to warm weather?

status-category-spineplot

Now, let’s put all of this together, looking at gender, age group and finishing status. There is a lot more information and it is a little difficult to make sense of all of it at once. But here are the salient points:

  • men in all three of the 30-39, 40-49 and 50-59 age groups were equally likely to start;
  • men in the 30-39 age group were most likely to finish;
  • among the women, those in the 40-49 age group were the most likely to start;
  • of the women that did start, the 30-39 age group was most likely to finish.

Looks like 30-39 is the prime time to be running the Comrades. That’s not to say that I am past my prime. Hell no! Not at all.

status-category-gender-mosaicplot

Over the next few days I will look at the following questions:

  • what is the effect of running a negative split on overall time? and
  • how does the finishing rate vary with time? Is there evidence of a “diamond carat” effect?