Comrades Marathon Inference Trees

R

Following up on my previous posts regarding the results of the Comrades Marathon, I was planning on putting together a set of models which would predict likelihood to finish and probable finishing time. Along the way I got distracted by something else that is just as interesting and which produces results which readily yield to qualitative interpretation: Conditional Inference Trees as implemented in the R package party.

Just to recall what the data look like:

> head(splits.2013)
           gender age.category drummond.time race.time   status       medal
2013-10014   Male        50-59      5.510833        NA      DNF        <NA>
2013-10016   Male        60-69      6.070833        NA      DNF        <NA>
2013-10019   Male        20-29      5.335833  11.87361 Finished Vic Clapham
2013-10031   Male        20-29      4.910833  10.94833 Finished      Bronze
2013-10047   Male        50-59      5.076944  10.72778 Finished      Bronze
2013-10049   Male        50-59      5.729444        NA      DNF        <NA>

Here the drummond.time and finish.time fields are expressed in decimal hours and correspond to the time taken to reach the half-way mark and the finish respectively. The status field indicates whether a runner finished the race or did not finish (DNF).

I am going to consider two models. The first will look at the probability of finishing and the second will look at the distribution of medals. The features which will be used to predict these outcomes will be gender, age category and half-way time at Drummond. To build the first model, first load the party library and then call ctree.

> library(party)
> tree.status = ctree(status ~ gender + age.category + drummond.time, data = splits.2013,
+                     control = ctree_control(minsplit = 750))
> tree.status

	 Conditional inference tree with 17 terminal nodes

Response:  status 
Inputs:  gender, age.category, drummond.time 
Number of observations:  13917 

1) drummond.time <= 5.669167; criterion = 1, statistic = 2985.908
  2) drummond.time <= 5.4825; criterion = 1, statistic = 494.826
    3) age.category <= 40-49; criterion = 1, statistic = 191.12
      4) drummond.time <= 5.078611; criterion = 1, statistic = 76.962
        5) gender == {Male}; criterion = 1, statistic = 73.4
          6)*  weights = 5419 
        5) gender == {Female}
          7)*  weights = 836 
      4) drummond.time > 5.078611
        8) gender == {Male}; criterion = 1, statistic = 63.347
          9) drummond.time <= 5.379722; criterion = 1, statistic = 15.55
            10)*  weights = 1123 
          9) drummond.time > 5.379722
            11)*  weights = 447 
        8) gender == {Female}
          12)*  weights = 634 
    3) age.category > 40-49
      13) drummond.time <= 5.038056; criterion = 1, statistic = 68.556
        14) age.category <= 50-59; criterion = 1, statistic = 40.471
          15) gender == {Female}; criterion = 1, statistic = 32.419
            16)*  weights = 118 
          15) gender == {Male}
            17)*  weights = 886 
        14) age.category > 50-59
          18)*  weights = 170 
      13) drummond.time > 5.038056
        19)*  weights = 701 
  2) drummond.time > 5.4825
    20) gender == {Male}; criterion = 1, statistic = 56.149
      21) age.category <= 40-49; criterion = 0.995, statistic = 9.826
        22)*  weights = 636 
      21) age.category > 40-49
        23)*  weights = 259 
    20) gender == {Female}
      24)*  weights = 352 
1) drummond.time > 5.669167
  25) drummond.time <= 5.811389; criterion = 1, statistic = 301.482
    26) age.category <= 30-39; criterion = 1, statistic = 37.006
      27)*  weights = 315 
    26) age.category > 30-39
      28)*  weights = 553 
  25) drummond.time > 5.811389
    29) drummond.time <= 5.940556; criterion = 1, statistic = 75.164
      30) age.category <= 30-39; criterion = 1, statistic = 25.519
        31)*  weights = 299 
      30) age.category > 30-39
        32)*  weights = 475 
    29) drummond.time > 5.940556
      33)*  weights = 694 

There is a deluge of information in the textual representation of the model. Making sense of this is a lot easier with a plot.

> plot(tree.status)

The image below is a little small. You will want to click on it to bring up a larger version.

To interpret the tree, start at the top node (Node 1) labelled drummond.time, indicating that of the features considered, the most important variable in determining a successful outcome at the race is the time to the half-way mark. We are presented with two options: times that are either less than or greater than 5.669 hours. The cutoff time at Drummond is 6.167 hours (06:10:00), so runners reaching half-way after 5.669 hours are already getting quite close to the cutoff time. Suppose that we take the > 5.669 branch. The next node again depends on the half-way time, in this case dividing the population at 5.811 hours. If we take the left branch then we are considering runners who got to Drummond after 5.669 hours but before 5.811 hours. The next node depends on age category. The two branches here are for runners who are 39 and younger (left branch) and older runners (right branch). If we take the right branch then we reach the terminal node. There were 553 runners in this category and the spine plot indicates that around 35% of those runners successfully finished the race.

Rummaging around in this tree, there is a lot of interesting information to be found. For example, female runners who are aged less than 49 years and pass through Drummond in a time of between 5.079 and 5.482 hours are around 95% likely to finish the race. In fact, this is the most successful group of runners (there were 634 of them in the field). The next best group was male runners in the same age category who got to half-way in less than 5.079 hour: roughly 90% of the 5419 runners in this group finished the race.

Constructing a model for medal allocation is done in a similar fashion.

> splits.2013.finishers = subset(splits.2013, status == "Finished" & !is.na(medal))
> #
> levels(splits.2013.finishers$medal) <- c("G", "WH", "S", "BR", "B", "VC")

Here I first extracted the subset of runners who finished the race (and for whom I have information on the medal allocated). Then, to make the plotting a little easier, the names of the levels in the medal factor are changed to a more compact representation.

> tree.medal = ctree(medal ~ gender + age.category + drummond.time, data = splits.2013.finishers,
+                    control = ctree_control(minsplit = 750))
> tree.medal

	 Conditional inference tree with 19 terminal nodes

Response:  medal 
Inputs:  gender, age.category, drummond.time 
Number of observations:  10221 

1) drummond.time <= 4.124167; criterion = 1, statistic = 7452.85
  2) drummond.time <= 3.438889; criterion = 1, statistic = 1031.778
    3)*  weights = 571 
  2) drummond.time > 3.438889
    4) drummond.time <= 3.812222; criterion = 1, statistic = 342.628
      5) drummond.time <= 3.708056; criterion = 1, statistic = 53.658
        6)*  weights = 549 
      5) drummond.time > 3.708056
        7)*  weights = 250 
    4) drummond.time > 3.812222
      8) drummond.time <= 3.976111; criterion = 1, statistic = 37.853
        9)*  weights = 386 
      8) drummond.time > 3.976111
        10)*  weights = 431 
1) drummond.time > 4.124167
  11) drummond.time <= 5.043611; criterion = 1, statistic = 4144.845
    12) drummond.time <= 4.55; criterion = 1, statistic = 596.673
      13) drummond.time <= 4.288333; criterion = 1, statistic = 81.996
        14)*  weights = 603 
      13) drummond.time > 4.288333
        15) gender == {Male}; criterion = 0.996, statistic = 10.468
          16)*  weights = 993 
        15) gender == {Female}
          17)*  weights = 148 
    12) drummond.time > 4.55
      18) drummond.time <= 4.862778; criterion = 1, statistic = 77.052
        19) gender == {Male}; criterion = 1, statistic = 34.077
          20) drummond.time <= 4.653611; criterion = 0.994, statistic = 9.583
            21)*  weights = 353 
          20) drummond.time > 4.653611
            22)*  weights = 762 
        19) gender == {Female}
          23)*  weights = 237 
      18) drummond.time > 4.862778
        24) gender == {Male}; criterion = 1, statistic = 45.95
          25)*  weights = 756 
        24) gender == {Female}
          26)*  weights = 193 
  11) drummond.time > 5.043611
    27) drummond.time <= 5.265833; criterion = 1, statistic = 544.833
      28) gender == {Male}; criterion = 1, statistic = 54.559
        29) drummond.time <= 5.174444; criterion = 1, statistic = 26.917
          30)*  weights = 545 
        29) drummond.time > 5.174444
          31)*  weights = 402 
      28) gender == {Female}
        32)*  weights = 327 
    27) drummond.time > 5.265833
      33) drummond.time <= 5.409722; criterion = 1, statistic = 88.926
        34) gender == {Male}; criterion = 1, statistic = 40.693
          35)*  weights = 675 
        34) gender == {Female}
          36)*  weights = 277 
      33) drummond.time > 5.409722
        37)*  weights = 1763 

Apologies for the bit of information overload. A plot brings out the salient information though.

> plot(tree.medal)

Again you will want to click on the image below to make it legible.

Again the most important feature is the time at the half-way mark. If we look at the terminal node on the left (Node 3), which is the only one which contains athletes who received either Gold or Wally Hayward medals, then we see that they all passed through Drummond in a time of less than 3.439 hours. Almost all of the Silver medal athletes were also in this group, along with a good number of Bill Rowan runners. There are still a few Silver medal athletes in Node 6, which corresponds to runners who got to Drummond in less than 3.708 hours.

Shifting across to the other end of the plot and looking at runners who reached half-way in more than 5.266 hours. These are further divided into a group whose half-way time was more than 5.41 hours: these almost all got Vic Clapham medals. Interestingly, the outcome for athletes whose time at Drummond was greater than 5.266 hours but less than 5.41 hours depends on gender: the ladies achieved a higher proportion of Bronze medals than the men.

I could pore over these plots for hours. The take home message from this is that your outcome at the Comrades Marathon is most strongly determined by your pace in the first half of the race. Gender and age don’t seem to be particularly important, although they do exert an influence on your first half pace. Ladies who get to half-way at between 05:00 and 05:30 seem to have hit the sweet spot though with close to 100% success rate. Nice!

Categorically Variable