Comrades Marathon Medal Predictions

IMG_5239

With only a few days to go until race day, most Comrades Marathon athletes will focusing on resting, getting enough sleep, hydrating, eating and giving a wide berth to anybody who looks even remotely ill.

They will probably also be thinking a lot about Sunday's race. What will the weather be like? Will it be cold at the start? (Unlikely since it's been so warm in Durban.) How will they feel on the day? Will they manage to find their seconds along the route?

For the more performance oriented among them (and, let's face it, that's most runners!), there will also be thoughts of what time they will do on the day and what medal they'll walk away with. I've considered ways for projecting finish times in a previous article. Today I'm going to focus on a somewhat simpler goal: making a Comrades Marathon medal prediction.

In the process I have put together a small application which will make medal predictions based on recent race times.

medal-prediction-interface

I'm not going to delve too deeply into the details, but if you really don't have the patience, feel free to skip forward to the results or click on the image above, which will take you to the application. If you have trouble accessing the application it's probable that you are sitting behind a firewall that is blocking it. Try again from home.

Raw Data

The data for this analysis were compiled from a variety of sources. I scraped the medal results off the Comrades Marathon Results Archive. Times for other distances were cobbled together from Two Oceans Marathon Results, RaceTec Results and the home pages of some well organised running clubs.

The distribution of the data is broken down below as a function of gender, Comrades Marathon medal and other distances for which I have data. For instance, I have data for 45 female runners who got a Bronze medal and for whom a 32 km race time was available.

medal-race-distance-count

Unfortunately the data are pretty sparse for Gold, Wally Hayward and Silver medalists, especially for females. I'll be collecting more data over the coming months and the coverage in these areas should improve. Athletes that are contenders for these medals should have a pretty good idea of what their likely prospects are anyway, so the model is not likely to be awfully interesting for them. This model is intended more for runners who are aiming at a Bill Rowan, Bronze or Vic Clapham medal.

Decision Tree

The first step in the modelling process was to build a decision tree. Primarily this was to check whether it was feasible to predict a medal class based on race times for other distances (I'm happy to say that it was!). The secondary motivation was to assess what the most important variables were. The resulting tree is plotted below. Open this plot in a new window so that you can zoom in on the details. As far as the labels on the tree are concerned, "min" stands for "minimum" time over the corresponding distance and times (labels on the branches) are given in decimal hours.

medal-ctree

The first thing to observe is that the most important predictor is 56 km race time. This dominates the first few levels in the tree hierachy. Of slightly lesser importance is 42.2 km race time, followed by 25 km race time. It's interesting to note that 32 km and 10 km results does no feature at all in the tree, probably due to the relative scarcity of results over these distances in the data.

Some specific observations from the tree are:

  • Male runners who can do 56 km in less than 03:30 have around 20% chance of getting a Gold medal.
  • Female runners who can do 56 km in less than 04:06 have about 80% chance of getting a Gold medal.
  • Runners who can do 42.2 km in less than about 02:50 are very likely to get a Silver medal.
  • Somewhat more specifically, runners who do 56 km in less than 05:53 and 42.2 km in more than 04:49 are probably in line for a Vic Clapham.

Note that the first three observations above should be taken with a pinch of salt since, due to a lack of data, the model is not well trained for Gold, Wally Hayward and Silver medals.

You'd readily be forgiven for thinking that this decision tree is an awfully complex piece of apparatus for calculating something as simple as the colour of your medal.

Well, yes, it is. And I am going to make it simpler for you. But before I make it simpler, I am going to make it slightly more complicated.

A Forest of Decision Trees

Instead of just using a single decision tree, I built a Random Forest consisting of numerous trees, each of which was trained on a subset of the data. Unfortunately the resulting model is not as easy to visualise as a single decision tree, but the results are far more robust.

Medal Prediction Application

To make this a little more accessible I bundled the model up in a Shiny application which I deployed here on Amazon Web Services. Give it a try. You'll need to enter the times you achieved over one or more race distances during the last few months. Note that these are race times, not training run times. The latter are not good predictors for your Comrades medal.

Let's have a quick look at some sample predictions. Suppose that you are a male athlete who has recent times of 00:45, 01:45, 04:00 and 05:00 for 10, 21.1, 42.2 and 56 km races respectively, then according to the model you have a 77% probability of getting a Bronze medal and around 11% chance of getting either a Bill Rowan or Vic Clapham medal. There's a small chance (less than 1%) that you might be in the running for a Silver medal.

medal-predictions-bog-standard

What about a male runner who recently ran 03:20 for 56 km? There is around 20% chance that he would get a Gold medal. Failing that he would most likely (60% chance) get a Silver.

medal-predictions-gold-male

If you happen to have race results for the last few years that I could incorporate into the model, please get in touch. I'm keen to collaborate on improving this tool.

R Recipe: Aligning Axes in ggplot2

Faceted plots in ggplot2 are phenomenal. They give you a simple way to break down an array of plots according to the values of one or more categorical variables. But what if you want to stack plots of different variables? Not quite so simple. But certainly possible. I gathered together this solution from a variety of sources on stackoverflow, notably this one and this other one. A similar issue for vertical alignment is addressed here.

To illustrate, first let's generate some sample data.

> data <- data.frame(x = seq(0, 100, 1))
> #
> data = transform(data,
+                  y1 = sin(x * pi / 10),
+                  y2 = x**2
+                  )

Then we'll import a couple of libraries: ggplot2, of course, and gridExtra which will give us the stacking functionality.

> library(ggplot2)
> library(gridExtra)

We'll generate the two plots.

> p1 <- ggplot(data, aes(x = x)) + geom_line(aes(y = y1)) + theme_classic()
> p2 <- ggplot(data, aes(x = x)) + geom_bar(aes(y = y2), stat = "identity") + theme_classic()

We could just slap these plots onto a grid, which would produce an acceptable plot. But my inner perfectionist is not happy with the fact that the two vertical axes do not line up and, consequently, the horizontal scales are slightly different.

aligned-plot-naive

The first step towards fixing this small issue is to take the plots and convert them into gtables.

> p1 <- ggplot_gtable(ggplot_build(p1))
> p2 <- ggplot_gtable(ggplot_build(p2))

Next the pivotal bit: find the widths of each of the plots, calculate the maximum and then apply it to each of them individually. This effectively applies a uniform layout to each of the plots.

> maxWidth = unit.pmax(p1$widths[2:3], p2$widths[2:3])
> 
> p1$widths[2:3] <- maxWidth
> p2$widths[2:3] <- maxWidth

And the final result is a stacked plot with perfectly aligned vertical axes. Success!

> grid.arrange(p1, p2, heights = c(3, 2))

aligned-plot

R Recipe: Reordering Columns in a Flexible Way

Suppose you have a data frame with a number of columns.

> names(trading)
 [1] "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"         "TP"         "OpenPrice" 
 [9] "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"       "Duration"   "Trader"    
[17] "System"

You want to put the Trader and System columns first but you also want to do this in a flexible way. One approach would be to specify column numbers.

> trading = trading[, c(16:17, 1:15)]
> names(trading)
 [1] "Trader"     "System"     "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"        
 [9] "TP"         "OpenPrice"  "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"      
[17] "Duration"

This does the job but it's not very flexible. After all, the number of columns might change. Rather do it by specifying column names.

> refcols <- c("Trader", "System")
> #
> trading <- trading[, c(refcols, setdiff(names(trading), refcols))]
> names(trading)
 [1] "Trader"     "System"     "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"        
 [9] "TP"         "OpenPrice"  "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"      
[17] "Duration"

Recent Common Ancestors: Simple Model

An interesting paper (Modelling the recent common ancestry of all living humans, Nature, 431, 562–566, 2004) by Rohde, Olson and Chang concludes with the words:

Further work is needed to determine the effect of this common ancestry on patterns of genetic variation in structured populations. But to the extent that ancestry is considered in genealogical rather than genetic terms, our findings suggest a remarkable proposition: no matter the languages we speak or the colour of our skin, we share ancestors who planted rice on the banks of the Yangtze, who first domesticated horses on the steppes of the Ukraine, who hunted giant sloths in the forests of North and South America, and who laboured to build the Great Pyramid of Khufu.

This paper considered two models of our genealogical heritage. Neither of the models was exhaustive and they could not take into account all of the complexities involved in determining the genealogy of the Earth's population. However, they captured many of the essential features. And despite their relative simplicity, the models reveal an unexpected result: a genealogical Common Ancestor (CA) of the entire present day human population would have lived in the relatively recent past.

It would be interesting to replicate some of those results.

An Even Simpler Model

In an earlier paper, Recent common ancestors of all present-day individuals, Chang considered a simpler model which he described as follows:

We assume the population size is constant at n. Generations are discrete and non-overlapping. The genealogy is formed by this random process: in each generation, each individual chooses two parents at random from the previous generation. The choices are made just as in the standard Wright–Fisher model (randomly and equally likely over the n possibilities) the only difference being that here each individual chooses twice instead of once. All choices are made independently. Thus, for example, it is possible that when an individual chooses his two parents, he chooses the same individual twice, so that in fact he ends up with just one parent; this happens with probability 1/n.

At first sight this appears to be an extremely crude and abstract model. However it captures many of the essential features required for modelling genealogical inheritance. It also has the advantage of not being too difficult to implement in R. If these details are not your cup of tea, feel free to skip forward to the results.

First we'll need a couple of helper functions to neatly generate labels for generations (G) and individuals (I).

> label.individuals <- function(N) {
+   paste("I", str_pad(1:N, 2, pad = "0"), sep = "")
+ }
> 
> make.generation <- function(N, g) {
+   paste(paste0("G", str_pad(g, 2, pad = "0")), label.individuals(N), sep = "/")
+ }
> 
> make.generation(3, 5)
[1] "G05/I01" "G05/I02" "G05/I03"

Each indivual is identified by a label of the form "G05/I01", which gives the generation number and also the specific identifier within that generation.

Next we'll have a function to generate the random links between individuals in successive generations. The function will take two arguments: the number of individuals per generation and the number of ancestor generations (those prior to the "current" generation).

> # N - number of people per generation
> # G - number of ancestor generations (there will be G+1 generations!)
> #
> generate.edges <- function(N, G) {
+   edges = lapply(0:(G-1),  function(g) {
+     children <- rep(make.generation(N, g), each = 2)
+     #
+     parents <- sample(make.generation(N, g+1), 2 * N, replace = TRUE)
+     #
+     data.frame(parents = parents, children = children, stringsAsFactors = FALSE)
+   })
+   do.call(rbind, edges)
+ }
> 
> head(generate.edges(3, 3), 10)
   parents children
1  G01/I02  G00/I01
2  G01/I01  G00/I01
3  G01/I01  G00/I02
4  G01/I02  G00/I02
5  G01/I01  G00/I03
6  G01/I03  G00/I03
7  G02/I02  G01/I01
8  G02/I01  G01/I01
9  G02/I01  G01/I02
10 G02/I02  G01/I02

So, for example, the data generated above links the child node G00/I01 (individual 1 in generation 0) to two parent nodes, G01/I02 and G01/I01 (individuals 1 and 2 in generation 1).

Finally we'll wrap all of that up in a graph-like object.

> library(igraph)
> generate.nodes <- function(N, G) {
+   lapply(0:G, function(g) make.generation(N, g))
+ }
> 
> generate.graph <- function(N, G) {
+   nodes = generate.nodes(N, G)
+   #
+   net <- graph.data.frame(generate.edges(N, G), directed = TRUE, vertices = unlist(nodes))
+   #
+   # Edge layout
+   #
+   x = rep(1:N, G+1)
+   y = rep(0:G, each = N)
+   #
+   net$layout = cbind(x, y)
+   #
+   net$individuals = label.individuals(N)
+   net$generations = nodes
+   #
+   net
+ }

Let's give it a whirl on a graph consisting of 10 ancestor generations (11 generations including the current generation) and 8 individuals per generation. The result is plotted below with the oldest generation (G10) at the top and the current generation (G00) at the bottom. The edges indicate parent/progeny links.

sample-graph

Okay, so it looks like all of the infrastructure is in place. Now we need to put together the functionality for analysing the relationships between generations. We are really only interested in how any one of the ancestor generations relates to the current generation, which makes things a little simpler. First we write a function to calculate the number of steps from an arbitrary node (identified by label) to each of the nodes in the current generation. This is simplified by the fact that igraph already has a shortest.path() function.

> generation.descendants <- function(net, node) {
+   present.generation <- first(net$generations)
+   #
+   shortest.paths(net, v = node, to = present.generation, mode = "out")
+ }

Next a pair of helper functions which indicate whether a particular node is a CA of all nodes in the current generation or if it has gone extinct. Node G03/I08 in the above graph is an example of an extinct node since it has no child nodes.

> common.ancestor <- function(net, node) {
+   all(is.finite(generation.descendants(net, node)))
+ }
>   
> extinct <- function(net, node) {
+   all(is.infinite(generation.descendants(net, node)))
+ }

Let's test those. We'll generate another small graph for the test.

> set.seed(1)
> #
> net <- generate.graph(3, 5)
> #
> # This is an ancestor of all.
> #
> generation.descendants(net, "G05/I01")
        G00/I01 G00/I02 G00/I03
G05/I01       5       5       5
> common.ancestor(net, "G05/I01")
[1] TRUE
> #
> # This is an ancestor of all but G00/I02
> #
> generation.descendants(net, "G02/I03")
        G00/I01 G00/I02 G00/I03
G02/I03       2     Inf       2
> common.ancestor(net, "G02/I03")
[1] FALSE
> #
> # This node is extinct.
> #
> generation.descendants(net, "G03/I01")
        G00/I01 G00/I02 G00/I03
G03/I01     Inf     Inf     Inf
> extinct(net, "G03/I01")
[1] TRUE

It would also be handy to have a function which gives the distance from every node in a particular generation to each of the nodes in the current generation.

> generation.distance <- function(net, g) {
+   current.generation <- net$generations[[g+1]]
+   #
+   do.call(rbind, lapply(current.generation, function(n) {generation.descendants(net, n)}))
+ }
> generation.distance(net, 3)
        G00/I01 G00/I02 G00/I03
G03/I01     Inf     Inf     Inf
G03/I02       3       3       3
G03/I03       3       3       3

So here we see that G03/I02 and G03/I03 are both ancestors of each of the individuals in the current generation, while G03/I01 has gone extinct.

Finally we can pull all of this together with a single function that classifies individuals in previous generations according to whether they are an ancestor of at least one but not all of the current generation; a common ancestor of all of the current generation; or extinct.

> classify.generation <- function(net, g) {
+   factor(apply(generation.distance(net, g), 1, function(p) {
+     ifelse(all(is.infinite(p)), 1, ifelse(all(is.finite(p)), 2, 0))
+   }), levels = 0:2, labels = c("ancestor", "extinct", "common"))
+ }
> classify.generation(net, 3)
G03/I01 G03/I02 G03/I03 
extinct  common  common 
Levels: ancestor extinct common

Results

First let's have a look at how the relationship between previous generations and the current generation changes over time. The plot below indicates the proportion of ancestors, common ancestors and extinct individuals as a function of generation number. Note that time runs backwards along the horizontal axis, with the current generation on the left and successive ancestor generations moving towards the right.

The graph used to generate these data had 1000 individuals per generation. We can see that for the first 10 generations around 80% of the individuals were ancestors of one or more (but not all!) of the current generation. The remaining 20% or so were extinct. Then in generation 11 we have the first CA. This individual would be labelled the Most Recent Common Ancestor (MRCA). Going further back in time the fraction of CAs increases while the proportion of mere ancestors declines until around generation 19 when all of the ancestors are CAs.

ancestor-evolution

There are two important epochs to identify on the plot above:

  1. the generation at which the MRCA emerged (this is denoted T_n); and
  2. the generation at which all individuals are either CAs or extinct (this is denoted U_n).

Chang's paper gives asymptotic expressions for how these epochs vary as a function of generation size. It also presents a table with sample values of T_n and U_n for generations of size 500, 1000, 2000 and 4000. Those samples support the asymptotic relationships.

Since we have all of the infrastructure in place, we can conduct an independent test of the relationships. Below is a boxplot generated from 100 realisations each of calculations with generations of size 10, 100 and 1000. The blue boxes reflect our estimates for T_n, while the green boxes are the estimates of U_n. The two dashed lines reflect the asymptotic relationships. There's pretty good agreement between the theoretical and experimental results, although our estimates for U_n appear to be consistently larger than expected. The sample size was limited though... Furthermore, these are asymptotic relationships, so we really only expect them to hold exactly for large generation sizes.

Un-Tn-versus-N

When would our MRCA have been alive based on this simple model and the Earth's current population? Setting the generation size to the Earth's current population of around 7 billion people gives T_n at about 33 generations and a U_n of approximately 58 generations. I know, those 7 billion people are not from the same generation, but we are going for an estimate here...

What does a generation translate to in terms of years? That figure has varied significantly over time. It also depends on location. However, in developed nations it is currently around 30 years. WolframAlpha equates a generation to 28 years. So, based on these estimates we have T_n at around 1000 years and our estimate of U_n translates into less than 2 millennia.

Of course, these numbers need to be taken with a pinch of salt since they are based on a model which assumes that the population has been static at around 7 billion, which we know to be far from true! However these results are also not completely incompatible with the more sophisticated results of Rohde, Olson and Chang who found that the MRCA of all present-day humans lived just a few thousand years ago.

R Recipe: Making a Chord Diagram

With the circlize package, putting together a Chord Diagram is simple.

library(circlize)
library(RColorBrewer)

# Create a random adjacency matrix
#
adj = matrix(sample(c(1, 0), 26**2, replace = TRUE, prob = c(1, 9)),
             nrow = 26, dimnames = list(LETTERS, LETTERS))

adj = ifelse(adj == 1, runif(26**2), 0)

chordDiagram(adj, transparency = 0.4, grid.col = "midnightblue",
             col = colorRamp2(seq(0, 1, 0.2), brewer.pal(6, "Blues")))

chord-diagram

Comrades Marathon Finish Predictions

* If you see a bunch of [Math Processing Error] errors, you might want to try opening the page in a different browser. I have had some trouble with MathJax and Internet Explorer. Yet another reason to never use Windows.

There are various approaches to predicting Comrades Marathon finishing times. Lindsey Parry, for example, suggests that you use two and a half times your recent marathon time. Sports Digest provides a calculator which predicts finishing time using recent times over three distances. I understand that this calculator is based on the work of Norrie Williamson.

Let's give them a test. I finished the 2013 Comrades Marathon in 09:41. Based on my marathon time from February that year, which was 03:38, Parry's formula suggests that I should have finished at around 09:07. Throwing in my Two Oceans time for that year, 04:59, and a 21.1 km time of 01:58 a few weeks before Comrades, the Sports Digest calculator gives a projected finish time of 08:59. Clearly, relative to both of those predictions, I under-performed that year! Either that or the predictions were way off the mark.

It seems to me that, given the volume of data we gather on our runs, we should be able to generate better predictions. If the thought of maths or data makes you want to doze off, feel free to jump ahead, otherwise read on.

Riegel's Formula

In 1977 Peter Riegel published a formula for predicting running times, which became popular due to its simplicity. The formula itself looks like this:

 \Delta t_2 = \Delta t_1 \left( \frac{d_2}{d_1} \right)^{1.06}


which allows you to predict \Delta t_2 the time it will take you to run distance d_2, given that you know it takes you time \Delta t_1 to run distance d_1. Riegel called this his "endurance equation".

Reverse-Engineering Riegel's Model

Riegel's formula is an empirical model: it's based on data. In order to reverse engineer the model we are going to need some data too. Unfortunately I do not have access to data for a cohort of elite runners. However, I do have ample data for one particular runner: me. Since I come from the diametrically opposite end of the running spectrum (I believe the technical term would be "bog standard runner"), I think these data are probably more relevant to most runners anyway.

I compiled my data for the last three years based on the records kept by my trusty Garmin 910XT. A plot of time versus distance is given below.

training-data-linear

At first glance it looks like you could fit a straight line through those points. And you can, indeed, make a pretty decent linear fit.

> fit <- lm(TimeHours ~ Distance, data = training)
> 
> summary(fit)

Call:
lm(formula = TimeHours ~ Distance, data = training)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64254 -0.04592 -0.00618  0.02361  1.24900 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1029964  0.0107648  -9.568   <2e-16 ***
Distance     0.1012847  0.0008664 116.902   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1394 on 442 degrees of freedom
Multiple R-squared:  0.9687,	Adjusted R-squared:  0.9686 
F-statistic: 1.367e+04 on 1 and 442 DF,  p-value: < 2.2e-16

However, applying a logarithmic transform to both axes gives a more uniform distribution of the data, which also now looks more linear.

training-data-log

Riegel observed that data for a variety of disciplines (running, swimming, cycling and race walking) conformed to the same pattern. Figure 1 from this paper is included below.

Figure 1 from Riegel's paper "Athletic Records and Human Endurance".

Figure 1 from Riegel's paper "Athletic Records and Human Endurance".

If we were to fit a straight line to the data on logarithmic axes then the relationship we'd be contemplating would have the form

\log \Delta t = m \log d + c


or, equivalently,

\Delta t = k d^m


which is a power law relating elapsed time to distance. It's pretty easy to get Riegel's formula from this. Taking two particular points on the power law, \Delta t_1 = k d_1^m and \Delta t_2 = k d_2^m, and eliminating k gives

\Delta t_2 = \Delta t_2 \left( \frac{d_2}{d_1} \right)^m


which is Riegel's formula with an unspecified value for the exponent. We'll call the exponent the "fatigue factor" since it determines the degree to which a runner slows down as distance increases.

How do we get a value for the fatigue factor? Well, by fitting the data, of course!

> fit <- lm(log(TimeHours) ~ log(Distance), data = training)
> 
> summary(fit)

Call:
lm(formula = log(TimeHours) ~ log(Distance), data = training)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27095 -0.04809 -0.01843  0.01552  0.80351 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -2.522669   0.018111  -139.3   <2e-16 ***
log(Distance)  1.045468   0.008307   125.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09424 on 442 degrees of freedom
Multiple R-squared:  0.9729,	Adjusted R-squared:  0.9728 
F-statistic: 1.584e+04 on 1 and 442 DF,  p-value: < 2.2e-16

The fitted value for the exponent is 1.05 (rounding up), which is pretty close to the value in Riegel's formula. The fitted model is included in the logarithmic plot above as the solid line, with the 95% prediction confidence interval indicated by the coloured ribbon. The linear plot below shows the data points used for training the model, the fit and confidence interval as well as dotted lines for constant paces ranging from 04:00 per km to 07:00 per km.

training-data-linear-model

Our model also provides us with an indication of the uncertainty in the fitted exponent: the 95% confidence interval extends from 1.029 to 1.062.

> confint(fit)
                  2.5 %    97.5 %
(Intercept)   -2.558264 -2.487074
log(Distance)  1.029142  1.061794

Estimating the Fatigue Factor

A fatigue factor of 1 would correspond to a straight line, which implies a constant pace regardless of distance. A value less than 1 implies faster pace at larger distances (rather unlikely in practice). Finally, a value larger than 1 implies progressively slower pace at larger distances.

The problem with the fatigue factor estimate above, which was based on a single model fit to all of the data, is that it's probably biased by the fact that most of the data are for runs of around 10 km. In this regime the relationship between time and distance is approximately linear, so that the resulting estimate of the fatigue factor is probably too small.

To get around this problem, I employed a bootstrap technique, creating a large number of subsets from the data. In each subset I weighted the samples to ensure that there was a more even distribution of distances in the mix. I calculated the fatigue factor for each subset, resulting in a range of estimates. Their distribution is plotted below.

fatigue-factor-bootstrap

According to this analysis, my personal fatigue factor is around 1.07 (the median value indicated by the dashed line in the plot above). The Shapiro-Wilk test suggests that the data is sufficiently non-Normal to justify a non-parameteric estimate of the 95% confidence interval for the fatigue factor, which runs from 1.03 to 1.11.

> shapiro.test(fatigue.factor)

	Shapiro-Wilk normality test

data:  fatigue.factor
W = 0.9824, p-value = 1.435e-12

> median(fatigue.factor)
[1] 1.072006
> quantile(fatigue.factor, c(0.025, 0.975))
    2.5%    97.5% 
1.030617 1.107044 

Riegel's analysis also lead to a range of values for the fatigue factor. As can be seen from the table below (extracted from his paper), the values range from 1.01 for Nordic skiing to 1.14 for roller skating. Values for running range from 1.05 to 1.08 depending on age group and gender.

Table 1 from Riegel's paper "Athletic Records and Human Endurance".

Table 1 from Riegel's paper "Athletic Records and Human Endurance".

Generating Predictions

The rules mentioned above for predicting finishing times are generally applied to data for a single race (or a selection of three races). But, again, given that we have so much data on hand, would it not make sense to generate a larger set of predictions?

Predicted finishing times for the Comrades Marathon

The distributions above indicate the predictions for this year's Comrades Marathon (which is apparently going to be 89 km) based on all of my training data this year and using both the default (1.06) and personalised (1.07) values for the fatigue factor. The distributions are interesting, but what we are really interested in is the expected finish times, which are 08:59 and 09:18 depending on what value you use for the fatigue factor. I have a little more confidence in my personalised value, so I am going to be aiming for 09:18 this year.

Comrades is a long day and a variety of factors can affect your finish time. It's good to have a ball-park idea though. If you would like me to generate a set of personalised predictions for you, just get in touch via the replies below.

Postscript

I repeated the analysis for one of my friends and colleagues. His fatigue factor also comes out as 1.07 although, interestingly, the distribution is bi-modal. I think I understand the reason for this though: his runs are divided clearly into two groups: training runs and short runs back and forth between work and the gym.

fatigue-factor-bootstrap-karl-wilhelm

Comrades Runners Disqualified: I'm Not Convinced

Apparently 14 runners have been disqualified for failing to complete the full distance at the 2012 and 2013 Comrades Marathons.

I'm not convinced.

Although 20 runners were charged with misconduct, six of them had a valid story. These runners had retired from the race but the bailers' bus had dropped them back on the course and they were "forced" to cross the finish line. I find this story a hard to digest. My understanding is that race numbers are either confiscated, destroyed or permanently marked when entering the bus. If this story is true then it should have thus been immediately obvious to officials that the runners in question had dropped out of the race. Their times should never have been recorded and they certainly should not have received medals (which presumably at least some of them did, since they have been instructed to return them!).

So that leaves the 14 runners who were disqualified. Were any of them among the group of mysterious negative splits identified previously? Unless KZNA or the CMA releases the names or race numbers, I guess we'll never know.

Encyclopaedia: Meteorites in Antarctica

A contribution which I wrote for Antarctica and the Arctic Circle: A Geographic Encyclopedia of the Earth's Polar Regions.

Interplanetary space is not a vacuum: it is teeming with objects of various sizes, from microscopic electrons and ions to large pieces of space debris. Some of these objects end up on earth. A meteoroid is a natural object with size between 10 μm and 1 m moving through space. Asteroids are bigger than meteoroids but smaller than a planet, and without their own atmosphere. Most meteoroids have a mass of less than a few grams, originating as fragments from asteroids, comets, and planets. They are often concentrated in a meteoroid stream, which is the debris left in the trail of a comet. A meteoroid or asteroid that enters the earth’s atmosphere produces a meteor (or shooting star), which is the visible trail formed as it falls through the atmosphere. When the earth passes through a meteoroid stream, a meteor shower is observed.

Meteoroids enter the earth’s atmosphere at speeds up to about 156,585 mph (251,999 km/h). Speed, size, incidence angle, and degree of fragmentation determine whether or not a meteoroid reaches the ground. Those meteoroids that penetrate the surface of the earth and survive the impact are known as meteorites. Meteorites are found everywhere on earth, but most have been discovered in Antarctica where they are clearly visible on the snowy surface or embedded in ice.

The flux of meteoritic material to the surface of the earth is between 104 and 106 kg per year. A meteorite with a size of 1 m hits the earth on average once per year. Smaller meteorites are more common, while larger ones are relatively rare. Larger meteorites can produce extensive craters like the Vredefort crater in South Africa, which is 300 km in diameter. A meteorite with a size of 3,280 ft. (1,000 m) would have a cataclysmic effect, but fortunately, such impacts are rare: roughly once every million years. Many hundred meteorites, each with a mass of a few grams, strike the earth every day. The majority of meteorites originate from tiny meteoroids that are so light that they slow down rapidly in the atmosphere. These are deposited as microscopic dust.

Meteorites are classified as stony (94% of all meteorites; composed of silicate minerals), iron (5%; alloys of nickel and iron), or stony-iron (1%; composites of stony and metallic materials). Although stony meteorites are the most common, they are more difficult to identify since they are similar to normal terrestrial rocks. Iron meteorites are readily identified since they are magnetic and denser than most terrestrial rocks. They are also the most likely to reach the ground since they seldom fragment.

Initial estimates of meteorite flux were made from eyewitness accounts. Such meteorite falls (where the meteor trail was observed) are naturally more common in densely populated areas. Unfortunately, such estimates are bedeviled by large uncertainties. Recent meteorite flux data have been recorded by camera networks. These too suffer from difficulties in calibrating the images of meteor trails to meteorite masses. However, meteorite finds (where the meteor trail was not observed, but the meteorite itself was discovered) present a robust history that extends back thousands of years. The hot deserts and Antarctica have favorable conditions for meteorite accumulation: weathering occurs slowly, and meteorites are easily recognized against a uniform background that is relatively free of other rocks. Meteorites can survive in these regions for more than 100,000 years, allowing variations in meteorite flux and mass distribution to be estimated. Some meteorites found in Antarctica had been there for more than 1 million years. Furthermore, there is evidence to suggest that some of the meteorites found in Antarctica contain material that originated either from the Moon or Mars.

More than 30,000 meteorites have been retrieved from Antarctica by various expeditions during the past 30 years. This exceeds the entire population of meteorites collected over the rest of the earth. Differences in the populations of recent meteorite falls and Antarctic meteorite finds indicate that the characteristics of meteorites landing on earth are changing with time as debris from the early solar system is progressively swept up.

The fate of a meteorite that falls onto Antarctica depends on the nature of the surface and the typical weather conditions. If it falls on ice, then it may remain exposed for a long time. Alternatively, it might become covered by snow and ultimately trapped in solid ice. There is a general flow of ice toward the periphery of the continent, so that these meteorites are eventually deposited in the Southern Ocean. The motion of the ice sheet can be slowed by mountain ranges, and meteorites embedded in the ice become concentrated in stranding zones. If the ice either evaporates via sublimation or is ablated by the wind, trapped meteorites become exposed, leading to high meteorite concentrations. The history of meteorite falls in the Antarctic can be deduced from stranding zones and ice cores.

A Sankey Plot with Uniform Coloured Edges

Following up on my previous post about generating Sankey plots with the riverplot package. It's also possible to generate plots which have constant coloured edges.

Here's how (using some of the data structures from the previous post too):

edges$col = sample(palette, size = nrow(edges), replace = TRUE)
edges$edgecol <- "col"

river <- makeRiver(nodes, edges)
style <- list(nodestyle= "invisible")

riverplot-example-constant-edge

Encyclopaedia: Discovery Expedition

A contribution which my wife and I wrote for Antarctica and the Arctic Circle: A Geographic Encyclopedia of the Earth's Polar Regions.

The British National Antarctic Expedition, now known as the Discovery Expedition, took place between 1901 and 1904. During this time, other expeditions from Germany, Sweden, France, and Scotland also traveled to Antarctica. The Discovery Expedition was carried out under the auspices of the Royal Society and the Royal Geographical Society, and was intended to renew Great Britain’s involvement in the exploration of Antarctica.

The Discovery Expedition had both exploratory and scientific objectives. Specifically, they were to forge as far south as possible and to conduct oceanographic, meteorological, magnetic, biological, and geological observations en route.

The projected cost of the expedition was £90,000, half of which was sourced from the British Government, while the balance was raised by the two royal societies. The expedition vessel, the Discovery, was a three-masted wooden sailing ship, specifically designed for research in Antarctic conditions. She was constructed in Dundee at a cost of around £51,000.

The ship’s crew was recruited from the Royal Navy, the Merchant Marine, and the civilian population. Robert Falcon Scott (1868–1912) was appointed as commander of the Discovery and leader of the expedition. Other noteworthy members of the crew were Ernest Shackleton (1874–1922), Tom Crean (1877–1938), and John Robert Francis “Frank” Wild (1873–1939). The scientific team consisted of Louis Bernacchi (meteorologist and magnetic observer), Hartley Ferrar (geologist), Thomas Hodgson (marine biologist), Edward Wilson (junior doctor and zoologist), and Reginald Koettlitz (senior doctor). Of these, only Bernacchi had previous Antarctic experience.

Discovery departed from Cardiff on August 6, 1901, traveling via Cape Town to Lyttelton, New Zealand. There, she prepared for the journey into the Southern Ocean, taking on supplies that included 45 live sheep donated by local farmers. After three weeks in port, she departed, sailing south. She stopped at Cape Adare and then Cape Crozier, which was to serve as a point where messages detailing the location of the expedition could be relayed to relief ships. Discovery then proceeded eastward along the Victoria Barrier (now known as the Ross Ice Shelf), before entering McMurdo Sound on February 8, 1902. Here she dropped anchor and soon became frozen into the sea ice, where she would be trapped for two years. Huts were erected on a nearby rocky peninsula. The larger hut was intended to act as accommodation. However, it was not conducive to habitation, and the expedition personnel continued to live on the ship. Two smaller huts housed scientific instruments including magnetometers and a seismograph.

The scientists, who had made measurements on the journey south, continued to take meteorological and magnetic observations during the winter. Preparations were also made for work to be undertaken during the next summer.

The men, who had little or no experience of skiing or dog sledging, began to practice these skills. They underestimated the dangerous conditions, and on a training journey, one of the parties fell to his death during a blizzard. Two overland journeys took place during the winter. A team led by Royds traveled to Cape Crozier to leave a message for a relief ship. When they arrived, they discovered a colony of emperor penguins. Scott, Wilson, and Shackleton also set off across the ice shelf with the goal of getting as far south as possible. Due to their inexperience with polar travel, progress was slow. The conditions of men and dogs deteriorated rapidly. The weaker dogs were killed and fed to the other dogs. They reached 82º 17′S before turning back. This was the closest that anybody had got to the South Pole. After the last of the dogs died, the men were forced to haul the sleds themselves. Shackleton was unable to assist since he was overcome with scurvy. The team returned to the Discovery on February 3, 1903.

While Scott’s party was away, a relief ship arrived with supplies and a letter addressed to him authorizing the expedition to stay for a second year. A number of the men, including the ailing Shackleton, returned home with the relief ship. The Discovery spent a second winter in the ice.

In the spring of 1903, a group led by Scott once again set out, heading westward toward the South Magnetic Pole. Since there were no dogs left, it was a man-hauling journey. Their progress, however, was appreciably better than on the previous journey. After ascending a large glacier, they reached the Polar Plateau at a height of around 6,500 ft. (2,000 m). On the return journey, they discovered a valley completely free of snow. The existence of such dry valleys was not previously known.

Upon Scott’s return, the Discovery was still stuck in the ice despite extensive efforts to free her. Two relief vessels soon arrived bearing instructions to Scott to abandon the Discovery if he was not able to free her quickly. Explosives and ice saws were employed to break up the ice around the ship. When she still would not budge, the men started to transfer important items to the relief ships. However, suddenly on Valentine’s Day, 1904, the ice around the Discovery broke up, and she was free. On February 17, 1904, she began her journey home. Returning via Lyttelton, the Discovery arrived in Portsmouth on September 10, 1904, and soon sailed for London, where she docked with little fanfare.

After his return, Scott was promoted to captain and received an array of awards. A number of the officers and crew also received the Polar Medal. Among the various outcomes of the expedition, the highlights were the discovery of the Polar Plateau, a dry valley, and an emperor penguin colony; evidence suggesting that the ice barrier was not seated on land but floating on the sea; and magnetic measurements that allowed a better estimate of the location of the South Magnetic Pole. Numerous new geological and biological specimens were also found.