## Are Green Number Runners More Likely to Bail?

Comrades Marathon runners are awarded a permanent green race number once they have completed 10 journeys between Durban and Pietermaritzburg. For many runners, once they have completed the race a few times, achieving a green number becomes a possibility. And once the idea takes hold, it can become something of a compulsion. I can testify to this: I am thoroughly compelled! For runners with this goal in mind, every finish is one step closer to a green number. They are slowly chipping away, year after year and the idea of bailing is anathema. However, once the green number is in the bag, does the imperative to complete the race fade?

I am going to explore the hypothesis that runners with green numbers are more likely to bail.

Let’s start by looking at the proportions of runners who finish the race as opposed to those who do not finish (DNF) and those who enter but do not start (DNS). As can be seen from the plot below, the proportion of runners who finish the race seems to increase with the number of medals that the runners in question have. So, for example, of the runners with one medal, 68.6% finished while only 21.7% were DNF. For runners with ten medals, 87.1% finished and only 9.5% were DNF.

On the face of it, this seems to make sense: there is a natural selection effect. Runners who have more medals are probably a little more hard core and thus less likely to bail. Less experienced runners might be more likely to jump on the bus when the going gets really tough.

But, unfortunately, it is not quite that simple.

The analysis above has a serious problem: consider those runners with one medal. We are comparing the number of finishers (those that have just received that medal) to non-finishers (who already have a medal!). So we are not really comparing apples with apples! What we really should be working with are the number of finishers who had *i-1* medals before the race and the number of non-finishers who had *i* medals.

Compiling these data takes a little work, but nothing too taxing. Let’s consider an anonymous (but real) runner whose Comrades Marathon history looks like this:

What we want is a table that shows how many times he ran with a given number of medals. So, for our anonymous hero, this would be:

Things went well for the first seven years. On the first year he had no medal (column 0) but he finished (so there is a 1 in the first row). The same applies for columns 1 to 6. Then on year 7 he finished, gaining his seventh medal (hence the 1 in the first row of column 6: he already had 6 medals when he ran this time!). However, for the next three years (when he already had 7 medals) he got a DNF (hence the 3 in the second row of column 7). On his fourth attempt he got medal number 8 (giving the 1 in the first row of column 7: he already had 7 medals when he ran this time!). And the following year he got medal number 9. Then he suffered a string of 3 DNFs (the 3 in the second row of column 9), followed by a series of 5 DNSs (the 5 in the third row of column 9). To illustrate the proportions, when he had 7 medals he got DNS 0% (0/4) of the time, DNF 75% (3/4) of the time and finished 25% (1/4) of the time.

Those are the data for a single athlete. To make a compelling case it is necessary to compile the same statistics for many, many runners. So I generated the analogous table for all athletes who ran the race between 1984 and 2013. A melted and abridged version of the resulting data look like this:

The important information here is the proportion of DNF entries for each medal count. We can see that 11.8% (0.11860858) of runners DNF on the first time that they ran. Similarly, of those runners who had already completed the race once (so they had one medal in the bag), 11.7% (0.11739666) did not finish. Of those who ran again after just achieving a green number, 10.8% (0.10827034) were DNF. It will be easier to make sense of all this in a plot.

Wow! Now that is interesting. Just to be sure that everything is clear about this plot: every column reflects the proportions of finishers, DNFs and DNSs who **already had** a given number of medals. There are a number of intriguing things about these data:

- all three proportions remain almost identical for runners who already had between 0 and 6 medals;
- the proportion of finishers then starts to ramp up for those with 7 and 8 medals (the DNS proportion remains unchanged, the DNFs decrease);
- there is a decrease in the proportion of finishers who already have 9 medals and a corresponding increase in the proportion of DNSs, while the DNFs remain unchanged;
- the proportion of finishers then increases slightly for those who already have 10 medals.

What conclusions can we draw from this? The second point seems to indicate a growing level of determination: these athletes are really close to their green number and they are less likely to sacrifice their medal. The third point is interesting too: the proportion of DNFs stays roughly the same but the DNS percentage grows from 4.1% for those with 8 medals to 7.8% for those with 9 medals. Why would this be? Well, I am really not sure and I would welcome suggestions. One possibility is that these runners are determined to have a good race so they might overtrain and end up injured or ill.

Are the differences in the proportion of DNFs statistically significant?

The miniscule p-value from the proportion test indicates that there definitely is a significant difference in the proportion of DNFs across the entire data set (for those with between 0 and 30 medals). But it does not tell us anything about which of the proportions are responsible for this difference. We can get some information about this from a pairwise proportion test. Here is the abridged output.

For between 0 and 6 medals there is no significant difference (p-value is roughly 1). The DNF proportion for those with 7 medals does start to differ from those with 4 medals or fewer, but the p-values are not significant. When we get to athletes who have 8 medals there is a significant difference in the proportion of DNFs all the way from those with 0 medals to those with 6 medals. However, the proportion of DNFs for those with 9 medals is not significantly different from any of the other categories. Finally, the DNF proportion for those athletes who already have 10 medals does not differ significantly from the athletes with any number of fewer medals.

So, no, it does not seem that runners with green numbers are more likely to bail (a conclusion that makes me personally very happy!). And good luck to the anonymous runner: I hope that you will be back in 2014 and that you will crack your green number!

Oh, and one last thing: as I mentioned before, the analysis above is based on the period 1984 to 2013. There are some serious issues with the data in the earlier years. Here is a breakdown of the number of runners in each of the categories across the years:

Certainly something is deeply wrong in 1984! In the early years it does not make any sense to discriminate between DNF and DNS since there were no independent records kept: we simply know whether or not an athlete finished. The introduction of the ChampionChip timing devices improved the quality of the data dramatically. These chips have been used by all Comrades Marathon runners since 1997 although there is a delayed effect on the quality of the data.

Despite these issues, the conclusions of the analysis above remain essentially unchanged if you simply lump the DNF and DNS data together (because we cannot always make a meaningful divide between them!).