Comrades Marathon: A Race for Geriatrics?

It has been suggested that the average Comrades Marathon runner is gradually getting older. As an “average runner” myself, I will not deny that I am personally getting older. But, what I really mean is that the average age of all runners taking part in this great event is gradually increasing. This is not just an idle hypothesis: it is supported by the data. If you’re interested in the technical details of the analysis, these are included at the end, otherwise read on for the results.

Results

The histograms below show graphically how the distribution of runners’ ages at the Comrades Marathon has changed every decade starting in the 1980s and proceeding through to the 2010s. The data are encoded using blue for male and pink for female runners (apologies for the banality!). It is readily apparent how the distributions have shifted consistently towards older ages with the passing of the decades. The vertical lines in each panel indicate the average age for male (dashed line) and female (solid line) runners. Whereas in the 1980s the average age for both genders was around 34, in the 2010s it has shifted to over 40 for females and almost 42 for males.

age-decade-histogram

Maybe clumping the data together into decades is hiding some of the details. The plot below shows the average age for each gender as a function of the race year. The plotted points are the observed average age, the solid line is a linear model fitted to these data and the dashed lines delineate a 95% confidence interval.

Prior to 1990 the average age for both genders was around 35 and varies somewhat erratically from year to year. Interestingly there is a pronounced decrease in the average age for both genders around 1990. Evidently something attracted more young runners that year… Since 1990 though there has been a consistent increase in average age. In 2013 the average age for men was fractionally less than 42, while for women it was over 40.

mean-age-year-regression

Conclusion

Of course, the title of this article is hyperbolic. The Comrades Marathon is a long way from being a race for geriatrics. However, there is very clear evidence that the average age of runners is getting higher every year. A linear model, which is a reasonably good fit to the data, indicates that the average age increases by 0.26 years annually and is generally 0.6 years higher for men than women. If this trend continues then, by the time of the 100th edition of the race, the average age will be almost 45.

Is the aging Comrades Marathon field a problem and, if so, what can be done about it?

Analysis

As before I have used the Comrades Marathon results from 1980 through to 2013. Since my last post on this topic I have refactored these data, which now look like this:

> head(results)
       key year age gender category   status  medal direction medal_count decade
1  6a18da7 1980  39   Male   Senior Finished Bronze         D          20   1980
2   6570be 1980  39   Male   Senior Finished Bronze         D          16   1980
3 4371bd17 1980  29   Male   Senior Finished Bronze         D           9   1980
4 58792c25 1980  24   Male   Senior Finished Silver         D          25   1980
5 16fe5d63 1980  58   Male   Master Finished Bronze         D           9   1980
6 541c273e 1980  43   Male  Veteran Finished Silver         D          18   1980

The first step in the analysis was to compile decadal and annual summary statistics using plyr.

> decade.statistics = ddply(results, .(decade, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
> #
> year.statistics = ddply(results, .(year, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
> head(decade.statistics)
  decade gender median.age mean.age
1   1980 Female         34   34.352
2   1980   Male         34   34.937
3   1990 Female         36   36.188
4   1990   Male         36   36.440
5   2000 Female         39   39.364
6   2000   Male         39   39.799
> head(year.statistics)
  year gender median.age mean.age
1 1980 Female       35.0   35.061
2 1980   Male       33.0   34.091
3 1981 Female       33.5   34.096
4 1981   Male       34.0   34.528
5 1982 Female       34.5   35.032
6 1982   Male       34.0   34.729

The decadal data were used to generate the histograms. I then considered a selection of linear models applied to the annual data.

> fit.1 <- lm(mean.age ~ year, data = year.statistics)
> fit.2 <- lm(mean.age ~ year + year:gender, data = year.statistics)
> fit.3 <- lm(mean.age ~ year + gender, data = year.statistics)
> fit.4 <- lm(mean.age ~ year + year * gender, data = year.statistics)

The first model applies a simple linear relationship between average age and year. There is no discrimination between genders. The model summary (below) indicates that the average age increases by about 0.26 years annually. Both the intercept and slope coefficients are highly significant.

> summary(fit.1)

Call:
lm(formula = mean.age ~ year, data = year.statistics)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3181 -0.5322 -0.0118  0.4971  1.9897 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.80e+02   1.83e+01   -26.2   <2e-16 ***
year         2.59e-01   9.15e-03    28.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.74 on 66 degrees of freedom
Multiple R-squared:  0.924,	Adjusted R-squared:  0.923 
F-statistic:  801 on 1 and 66 DF,  p-value: <2e-16

The second model considers the effect on the slope of an interaction between year and gender. Here we see that the slope is slightly large for males than females. Although this interaction coefficient is statistically significant, it is extremely small relative to the slope coefficient itself. However, given that the value of the abscissa is around 2000, it still contributes roughly 0.6 extra years to the average age for men.

> summary(fit.2)

Call:
lm(formula = mean.age ~ year + year:gender, data = year.statistics)

Residuals:
   Min     1Q Median     3Q    Max 
-1.103 -0.522  0.024  0.388  2.287 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -4.80e+02   1.68e+01  -28.57  < 2e-16 ***
year             2.59e-01   8.41e-03   30.78  < 2e-16 ***
year:genderMale  3.00e-04   8.26e-05    3.63  0.00056 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935 
F-statistic:  481 on 2 and 65 DF,  p-value: <2e-16

The third model considers an offset on the intercept based on gender. Here, again, we see that the effect of gender is small, with the fit for males being shifted slightly upwards. Again, although this effect is statistically significant, it has only a small effect on the model. Note that the value of this coefficient (5.98e-01 years) is consistent with the effect of the interaction term (0.6 years for typical values of the abscissa) in the second model above.

> summary(fit.3)

Call:
lm(formula = mean.age ~ year + gender, data = year.statistics)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1038 -0.5225  0.0259  0.3866  2.2885 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.80e+02   1.68e+01  -28.58  < 2e-16 ***
year         2.59e-01   8.41e-03   30.79  < 2e-16 ***
genderMale   5.98e-01   1.65e-01    3.62  0.00057 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935 
F-statistic:  480 on 2 and 65 DF,  p-value: <2e-16

The fourth and final model considers both an interaction between year and gender as well as an offset of the intercept based on gender. Here we see that the data does not differ sufficiently on the basis of gender to support both of these effects, and neither of the resulting coefficients is statistically significant.

> summary(fit.4)

Call:
lm(formula = mean.age ~ year + year * gender, data = year.statistics)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0730 -0.5127 -0.0492  0.4225  2.1273 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -460.3631    23.6813  -19.44   <2e-16 ***
year               0.2491     0.0119   21.00   <2e-16 ***
genderMale       -38.4188    33.4904   -1.15     0.26    
year:genderMale    0.0195     0.0168    1.17     0.25    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.679 on 64 degrees of freedom
Multiple R-squared:  0.938,	Adjusted R-squared:  0.935 
F-statistic:  322 on 3 and 64 DF,  p-value: <2e-16

On the basis of the above discussion, the fourth model can be immediately abandoned. But how do we choose between the three remaining models? An ANOVA indicates that the second model is a significant improvement over the first model. There is little to choose, however, between the second and third models. I find the second model more intuitive, since I would expect there to be a slight gender difference in the rate of aging, rather than a simple offset. We will thus adopt the second model, which indicates that the average age of runners increases by about 0.259 years annually, with the men aging slightly faster than the women.

> anova(fit.1, fit.2, fit.3, fit.4)
Analysis of Variance Table

Model 1: mean.age ~ year
Model 2: mean.age ~ year + year:gender
Model 3: mean.age ~ year + gender
Model 4: mean.age ~ year + year * gender
  Res.Df  RSS Df Sum of Sq     F  Pr(>F)    
1     66 36.2                               
2     65 30.1  1      6.09 13.23 0.00055 ***
3     65 30.1  0     -0.02                  
4     64 29.5  1      0.62  1.36 0.24833    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Lastly, I constructed a data frame based on the second model which gives both the model prediction and a 95% uncertainty interval. This was used to generate the second set of plots.

fit.data <- data.frame(year = rep(1980:2020, each = 2), gender = c("Female", "Male"))
fit.data <- cbind(fit.data, predict(fit.2, fit.data, level = 0.95, interval = "prediction"))

Where to Put EAs and Indicators in New MT4 Builds

If you are creating an EA or indicator from scratch, then the MetaTrader editor places the files in the correct location and the terminal is automatically able to find them. However, if the files originate from a third party then you will need to know where to insert them so that they show up in the terminal. For older builds of MetaTrader 4 the directory structure was fairly simple. Everything was to be found in a folder under Program Files. Its contents looked like this:

MT4-old-folders-windows

EAs would go in the experts folder, while indicators would go one level further down in the indicators sub-folder.

But everything is different with the new builds of MetaTrader 4.

Windows

On Windows (and here I am referring specifically to Windows 7, although the setup will be similar on other flavours) the structure of the MetaTrader folder (still found under Program Files) now looks like this:

MT4-new-folders-windows

There seems to be something missing, right? Where do the EA and indicator files go? To find these you will need to look elsewhere. There is now a separate directory hierarchy for each user, in which these files are stored. Browsing under the AppData folder for your user you should find

MT4-personal-folder-windows

Locating this folder can be a little tricky. First look under Users to find the folder associated with your user name. Within this folder you may or may not be able to see an AppData folder. If you can’t see it then you will need to reveal hidden files by changing the folder options.

folder-options-hidden-files

Once you have dug down to your Terminal folder it should contain a set of sub-folders like this:

MT4-personal-folder-contents-windows

If you descend further into one of those obscurely named folders, then you will find further sub-folders.

MT4-personal-instance-contents-windows

Finally, within the MT4 folder, you will find this:

MT4-mt4-folder-windows

The Experts and Indicators folders are where you will insert those EAs and indicators.

Linux

If you are running MetaTrader under Wine on Linux, then the new directory structure looks like this:

MT4-new-folders-linux

Evidently the structure is a little simpler here, because all of the files and directories are still lumped in a single location. The Experts and Indicators directories are to be found under MQL4.

Comrades Marathon Negative Splits: Cheat Strikes Again

It looks likes one of the suspect runners from my previous posts cheated again in this year’s Comrades Marathon. Brad Brown has published evidence that the runner in question, Kitty Chutergon (race number 25058), had another athlete running with his number for most of the race. This is a change in strategy from last year, where he appears to have been assisted along the route, having missed the timing mats at Camperdown and Polly Shortts.

Twins, Tripods and Phantoms at the Comrades Marathon

Having picked up a viral infection days before this year’s Comrades Marathon, on 1 June I was left with time on my hands and somewhat desperate for any distraction. So I spent some time looking at my archive of Comrades data and considering some new questions. For example, what are the chances of two runners passing through halfway and the finish line at exactly the same time? How likely is it that three runners achieve the same feat?

My data for the 2013 up run gives times with one second precision, so these questions could be answered if I relaxed the constraints from “exactly the same time” to “within one second of each other”. We’ll call such simultaneous pairs of runners “twins” and simultaneous threesomes will be known as “tripods”. How many twins are there? How many tripods? The answers are somewhat surprising. What’s even more surprising is another category: “phantoms”.

If you are not interested in the details of the analysis (and I’m guessing that you probably aren’t), please skip forward to the pictures and analysis.

Looking at the Data

The first step is to subset the data, leaving a data frame containing only the times at halfway and the finish, indexed by a unique runner key.

> simultaneous = subset(splits,
+                      year == 2013 & !is.na(medal))[, c("key", "drummond.time", "race.time")]
> simultaneous = simultaneous[complete.cases(simultaneous),]
> #
> rownames(simultaneous) = simultaneous$key
> simultaneous$key <- NULL
> head(simultaneous)
         drummond.time race.time
4bdcb291        320.15    712.42
4e488aab        294.65    656.90
ab59fc97        304.62    643.67
89d3e09b        270.32    646.78
fc728816        211.27    492.95
7b761740        274.60    584.37

Next we calculate the “distance” (this is a distance in time and not in space) between runners, which is effectively the squared difference between the halfway and finish times for each pair of runners. This yields a rather large matrix with rows and columns labelled by runner key. These data are then transformed into a format where each row represents a pair of runners.

> simultaneous = dist(simultaneous)
> library(reshape2)
> simultaneous = melt(as.matrix(simultaneous))
> head(simultaneous)
      Var1     Var2   value
1 4bdcb291 4bdcb291   0.000
2 4e488aab 4bdcb291  61.093
3 ab59fc97 4bdcb291  70.483
4 89d3e09b 4bdcb291  82.408
5 fc728816 4bdcb291 244.992
6 7b761740 4bdcb291 135.910

We can immediately see that there are some redundant entries. We need to remove the matrix diagonal (obviously the times match when a runner is compared to himself!) and keep only one half of the matrix.

> simultaneous = subset(simultaneous, as.character(Var1) < as.character(Var2))

Finally we retain only the records for those pairs of runners who crossed both mats simultaneously (in retrospect, this could have been done earlier!).

> simultaneous = subset(simultaneous, value == 0)
> head(simultaneous)
            Var1     Var2 value
623174  5217dfc9 75a78d04     0
971958  d8c9c403 e6e0d6e3     0
2024105 2e8f7778 9acc46ee     0
2464116 5f18d86f 9a1697ff     0
2467712 63033429 9a1697ff     0
3538608 54a92b96 f574be97     0

We can then merge in the data for race numbers and names, leaving us with an (anonymised) data set that looks like this:

> simultaneous = simultaneous[order(simultaneous$race.time),]
> head(simultaneous)[, c(4, 6, 8)]
    race.number.x race.number.y race.time
133         59235         56915  07:54:21
9           26132         23470  08:06:55
62          44008         31833  08:25:58
61          25035         36706  08:35:42
54          28868         25910  08:46:42
26          47703         31424  08:47:08
> tail(simultaneous)[, c(4, 6, 8)]
   race.number.x race.number.y race.time
71         54689         16554  11:55:59
60          8846         23003  11:56:26
44          9235         49251  11:56:47
38         53354         53352  11:56:56
28         19268         59916  11:57:49
20         22499         40754  11:58:26

Twins

As it turns out, there are a remarkably large number of Comrades twins. In the 2013 race there were more than 100 such pairs. So they are not as rare as I had assumed they would be.

Tripods

Although there were relatively many Comrades twins, there were only two tripods. In both cases, all three members of the tripod shared the same surname, so they are presumably related.

The members of the first tripod all belong to the same running club, two of them are in the 30-39 age category and the third is in the 60+ group. There’s a clear family resemblance, so I’m guessing that they are father and sons. Dad had gathered 9 medals, while the sons had 2 and 3 medals respectively. What a day they must have had together!

tripod-11832

The second tripod also consisted of three runners from the same club. Based on gender and age groups, I suspect that they are Mom, Dad and son. The parents had collected 8 medals each, while junior had 3. What a privilege to run the race with your folks! Lucky guy.

tripod-53713

And now things get more interesting…

Phantom #1

The runner with race number 26132 appears to have run all the way from Durban to Pietermaritzburg with runner 23470! Check out the splits below.

splits-26132

splits-23470

Not only did they pass through halfway and the finish at the same time, but they crossed every mat along the route at precisely the same time. Yet, somewhat mysteriously, there is no sign of 23470 in the race photographs…

phantom-26132-A

phantom-26132-B

phantom-26132-C

You might notice that there is another runner with 26132 in all three of the images above. That’s not 23470. He has race number 28151 and he is not the phantom! His splits below show that he only started running with 26132 somewhere between Camperdown and Polly Shortts.

splits-28151

If you search the race photographs for the phantom’s race number (23470), you will find that there are no pictures of him at all! That’s right, nineteen photographs of 26132 and not a single photograph of 23470.

Phantom #2

The runner with race number 53367 was also accompanied by a phantom with race number 27587. Again, as can be seen from the splits below, these two crossed every mat on the course at precisely the same time.

splits-53367

splits-27587

Yet, despite the fact that 53367 is quite evident in the race photos, there is no sign of 27587.

phantom-53367-B

phantom-53367-C

phantom-53367-D

phantom-53367-A

I would have expected to see a photograph of 53367 embracing his running mate at the finish, yet we find him pictured with two other runners. In fact, if you search the race photographs for 27587 you will find that there are no photographs of him at all. You will, however, find twelve photographs of 53367.

Conclusion

Well done to the tripods, I think you guys are awesome! As for the phantoms (and their running mates), you have some explaining to do.

Concatenating a list of data frames

It’s something that I do surprisingly often: concatenating a list of data frames into a single (possibly quite enormous) data frame. Until now my naive solution worked pretty well. However, today I needed to deal with a list of over 6 million elements. The result was hours of page thrashing before my R session finally surrendered. I suppose I should be happy that my hard disk survived.

I did a bit of research and found that there are a few solutions which are much (much!) more efficient.

The Problem

Let’s create some test data: a list consisting of 100 000 elements, each of which is a small data frame.

> data <- list()
> 
> N <- 100000
>
> for (n in 1:N) {
+   data[[n]] = data.frame(index = n, char = sample(letters, 1), z = runif(1))
+ }
> data[[1]]
  index char        z
1     1    t 0.221784

The Naive Solution

My naive solution to the problem was to use a combination of do.call() and rbind(). It gets the job done.

> head(do.call(rbind, data))
  index char          z
1     1    h 0.56891292
2     2    x 0.90331644
3     3    z 0.53675079
4     4    h 0.04587779
5     5    o 0.08608656
6     6    l 0.26410506

Alternative Solutions #1 and #2

The plyr package presents two options.

> library(plyr)
> 
> head(ldply(data, rbind))
  index char          z
1     1    h 0.56891292
2     2    x 0.90331644
3     3    z 0.53675079
4     4    h 0.04587779
5     5    o 0.08608656
6     6    l 0.26410506
> head(rbind.fill(data))
  index char          z
1     1    h 0.56891292
2     2    x 0.90331644
3     3    z 0.53675079
4     4    h 0.04587779
5     5    o 0.08608656
6     6    l 0.26410506

Both of these also do the job nicely.

Alternative Solution #3

The revised package dplyr provides some alternative solutions.

> library(dplyr)
> 
> head(rbind_all(data))
  index char          z
1     1    g 0.98735847
2     2    i 0.01427801
3     3    x 0.39046394
4     4    h 0.86044470
5     5    e 0.83855702
6     6    v 0.51332403
Warning message:
In rbind_all(data) : Unequal factor levels: coercing to character

A second function, rbind_list(), takes individual elements to be concatenated as arguments (rather than a single list).

Alternative Solution #4

Finally, a solution from the data.table package.

> library(data.table)
> 
> head(rbindlist(data))
   index char          z
1:     1    h 0.56891292
2:     2    x 0.90331644
3:     3    z 0.53675079
4:     4    h 0.04587779
5:     5    o 0.08608656
6:     6    l 0.26410506

Benchmarking

All of these alternatives produce the correct result. The solution of choice will be the fastest one (and the one causing the minimum of page thrashing!).

> library(rbenchmark)
> 
> benchmark(do.call(rbind, data), ldply(data, rbind), rbind.fill(data), rbind_all(data), rbindlist(data))
                  test replications  elapsed relative user.self sys.self user.child sys.child
1 do.call(rbind, data)          100 18943.38  609.308  11204.84     1.72         NA        NA
2   ldply(data, rbind)          100 16131.18  518.854   6529.10     1.56         NA        NA
3     rbind.fill(data)          100  4836.31  155.558   1936.55     0.37         NA        NA
4      rbind_all(data)          100  1627.84   52.359    111.79     0.10         NA        NA
5      rbindlist(data)          100    31.09    1.000     12.53     0.12         NA        NA

Thoughts on Performance

The naive solution uses the rbind.data.frame() method which is slow because it checks that the columns in the various data frames match by name and, if they don’t, will re-arrange them accordingly. rbindlist(), by contrast, does not perform such checks and matches columns by position.

rbindlist() is implemented in C, while rbind.data.frame() is coded in R.

In the most recent version of data.table (1.9.3, currently available from r-forge), rbindlist() has two new arguments. One of them, use.names, forces rbindlist() to match column names and so works more like rbind.data.frame(), but is coded in C so it is more efficient. Another related argument, fill, causing missing columns to be filled with NA.

Both of the plyr solutions are an improvement on the naive solution. However, the dplyr solution is better than either of them. Relative to all of the other solutions, rbindlist() is far superior. It’s blisteringly fast. Little wonder that my naive solution bombed out with a list of 6 million data frames. Using rbindlist(), however, it was done before I had finished my cup of coffee.

Acknowledgements

Thanks to the various folk who provided feedback, which has been used to expand and improve this article.

Personalised Comrades Marathon Pacing Chart

Although I have been thinking vaguely about my Plan A goal of a Bill Rowan medal at the Comrades Marathon this year, I have not really put a rigorous pacing plan in place. I know from previous experience that I am likely to be quite a bit slower towards the end of the race. I also know that I am going to lose a few minutes at the start. So how fast does this mean I need to run in order to get from Pietermaritzburg to Durban in under 9 hours?

Well, suppose that it takes me 3 minutes from the gun to get across the starting line. And, furthermore, assume that I will be running around 5% slower towards the end of the race. To still get to Durban under 9 hours I would need to run at roughly 5:52 per km at the beginning and gradually ease back to about 6:11 per km towards the end.

I arrived at these figures using a pacing spreadsheet. To get an idea of your pace requirements you will need to specify your goal time, the number of minutes you anticipate losing before crossing the start line and an estimate of how much you think you will slow down during the course of the race. This is done by editing the blue fields indicated in the image below. The rest of the spreadsheet will update on the basis of your selections.

pace-calc-1

The spreadsheet uses a simple linear model which assumes that your pace will gradually decline at the rate you have specified. If you give 0% for your slowing down percentage then the calculations are performed on the basis of a uniform pace throughout the race. Of course, neither the linear model nor a uniform pace are truly realistic. We all know that our pace will vary continuously throughout the race as a function of congestion, topography, hydration, fatigue, motivation and all of the other factors which come into play. However, as noted by the eminent statistician George Box “all models are wrong, but some are useful”. In this case the linear model is a useful way to account for the effects of fatigue.

The spreadsheet will give you an indication of the splits (both relative to the start of the race as well as time of day) and pace (instantaneous and average) required to achieve your goal time. There are also a pair of charts which will be updated with your projected timing and pace information.

pace-calc-2

My plan on race day is to run according to my average pace. This works well because it smooths out all the perturbations associated with tables and walking breaks.

Losing Time at the Start

One interesting thing to play around with on the spreadsheet is the effect of losing time at the start. If you vary this number you should see that it really does not have a massive influence on your pacing requirements for the rest of the race. For example, if I change my estimate from 3 minutes to 10 minutes then my required average pace decreases from 6:02 per km to 5:57 per km. Sure, this amounts to 5 seconds shaved off every km, but it is not unmananagable: the delay at the start gets averaged out over the rest of the race.

Naturally, the faster you are hoping to finish the race, the more significant a delay at the start is going to become. However, if you are aiming for a really fast time then presumably you are in a good seeding batch. For the majority of runners it is probably not going to make an enormous difference and so it is not worth stressing about.

The important thing is to make sure that you just keep on moving forward. Don’t stop. Just keep on putting one foot in front of the other.

Other Pacing Charts

The pacing chart by Dirk Cloete is based on the profile of the route. It breaks the route down into undulating, up and down sections and takes this into account when calculating splits. Don Oliver also has some static pacing charts.

What Can We Learn from the Commitments of Traders Report?

The Commitments of Traders (COT) report is issued weekly by the Commodity Futures Trading Commission (CFTC). It reflects the level of activity in the futures markets. The report, which is issued every Friday, contains the data from the previous Tuesday.

Amongst other data, the COT report gives the total number of long and short positions aggregated across three sectors:

  • Commercial traders (also “hedgers”) are “…engaged in business activities hedged by the use of the futures or option markets”,
  • Non-Commercial traders (the “large speculators”, mainly hedge funds and banks) and
  • Non-Reportable traders (the “small speculators”).

In general the largest positions are retained by Commercial traders who either produce or consume the underlying commodity or instrument. The speculators do not produce or consume, but buy or sell the futures contract in order to profit from changes in price. Speculators will not hold a contract to maturity. Large speculators hold positions with sizes which are above the threshold for reporting to the CFTC. Small speculators hold positions which are not large enough to require reporting to the CFTC.

Report Availability

The reports are available in either long or short format from the CFTC web site. Just select the date you are interested in and you will be taken through to a page where you can select from a range of reports. In addition to the viewable reports, you can also get the data as a CSV or xls file.

cftc-report-options

We will be focusing on FOREX, so we will be principally interested in the reports relating to the Chicago Mercantile Exchange.

Long and Short Positions

The plot below reflects the number of long (blue) and short (red) positions for a selection of currencies broken down into the three sectors mentioned above. All of these positions are traded against the USD. The superimposed black lines indicate the net number of long and short positions. Things to note here are that the Commercial and Non-Commercial sectors are generally trading in opposite directions and that the Non-Reportable sector constitutes much smaller volumes, but these account for the detailed balance between the Commercial and Non-Commercial trades.

140513-open-positions

In this plot USD is always treated as the counter currency. So, for JPY we are actually looking at positions on JPY/USD. Of course, the JPY is not quoted like this against the USD. So, for pairs like USD/CAD, USD/CHF and USD/JPY we need to reverse the sense of the positions above. This will be sorted out though below when we consider individual currency pairs.

Open Interest

Open Interest is the total number of current contracts. Since every long position is offset by a short position, the Open Interest is equal to the total number of either long or short positions. It shows the overall volume of contracts and is an indication of market interest. The black lines in the plot indicate the Open Interest for each currency and the red lines show the change from week to week.

140513-open-interest

These data become more interesting when we link them to the appropriate FOREX rates.

EUR/USD

140513-EURUSD

The top panel reflects the daily closing price for EUR/USD. The second panel shows the total open interest (black) and weekly change in open interest (red). The third panel shows the net number of positions held by each of the sectors. Here positive values indicate that a sector is net long, while negative values represent net short positions. The fourth panel is an indication of sentiment and reflects what portion of positions for each sector are either long or short. A completely bullish value indicates that a sector is 100% long, while a completely bearish value represents 100% short. A value close to 0% occurs when there is a balance between long and short positions.

GBP/USD

140513-GBPUSD

USD/JPY

140513-USDJPY

The fact that USD is the base currency and JPY is the counter currency has been taken into account by negating the data in the third and fourth panels.

Discussion

It has been suggested that small speculators are generally wrong. It is also held that Commercial traders have the best understanding of market conditions and are thus more likely to be right. I am not convinced that the data above support these hypotheses.

Although, it is interesting to note relationships between COT and FOREX prices in retrospect. The COT data can only become useful if one can use them as a predictive tool. I will be considering ways of doing this in future posts.

Race Statistics for Comrades Marathon Novice Runners: Corrigendum

There was some significant bias in the histogram from my previous post: the data from all years were lumped together. This is important because as of 2003 (when the Vic Clapham medal was introduced) the final cutoff for the Comrades Marathon was extended from 11:00 to 12:00. In 2000 they also applied an extended cutoff.

I have consequently partitioned the data according to “strict” and “extended” cutoffs.

> novices$extended = factor(novices$year == 2000 | novices$year >= 2003,
+                           labels = c("Strict Cutoff", "Extended Cutoff"))

novice-finish-times-hist

This paints a much more representative picture of the distribution of finish times now that the race has been extended to 12 hours.

The allocation of medals is complicated by the fact that new medals have been introduced at different times over recent years. Specifically, the Bill Rowan medal was first awarded in 2000, then the Vic Clapham medal was introduced in 2003 and, finally, 2007 saw the first Wally Hayward medals.

> novices$period = cut(novices$year, breaks = c(1900, 2000, 2003, 2007, 3000), right = FALSE,
+                      labels = c("before 2000", "2000 to 2002", "2003 to 2006", "after 2007"))
>
> novice.medals = table(novices$medal, novices$period)
> novice.medals = scale(novice.medals, scale = colSums(novice.medals), center = FALSE) * 100
> options(digits = 1)
> (novice.medals = t(novice.medals))
              
                Gold Wally Hayward Silver Bill Rowan Bronze Vic Clapham
  before 2000   0.07          0.00   4.80       0.00  95.13        0.00
  2000 to 2002  0.09          0.00   2.66      12.76  84.49        0.00
  2003 to 2006  0.15          0.00   4.05      17.51  47.63       30.66
  after 2007    0.08          0.03   2.60      12.28  46.40       38.62

So, currently, around 46% of novices get a Bronze medal while slight fewer, about 37%, get a Vic Clapham medal. A significant fraction, just over 12%, achieve a Bill Rowan, while only 2.6% get a Silver medal. The number of Wally Hayward and Gold medals among novices is very small indeed.

Acknowledgement

Thanks to Daniel for pointing out this issue!

Race Statistics for Comrades Marathon Novice Runners

Most novice Comrades Marathon runners finish the race on their first attempt and the majority of them walk (shuffle, crawl?) away with Bronze medals.

What is a Novice?

To paraphrase the dictionary, a novice is “a person who is new to or inexperienced in the circumstances in which he or she is placed; a beginner”. In the context of the Comrades Marathon this definition can be interpreted in a few ways:

  1. a runner who has never run the Comrades Marathon (has never started the race);
  2. a runner who has never completed the Comrades Marathon (has never finished the race); or
  3. a runner who has not completed both an “up” and a “down” Comrades Marathon.

For the purposes of this article I will be adopting the first definition. This is probably the one of most interest to runners who are embarking on their first Comrades journey.

Identifying a Novice

I’ll be using the same data sets that I have discussed in previous articles. Before we focus on the data for the novices we’ll start by just retaining the fields of interest.

> novices = results[, c("key", "year", "category", "gender", "medal", "medal.count", "status", "ftime")]
> head(novices)
       key year     category gender       medal medal.count   status   ftime
1 100030f4 2008 Ages 20 - 29 Female Vic Clapham           1 Finished 11.3728
2 100030f4 2009 Ages 20 - 29 Female        <NA>           1      DNF      NA
3 100030f4 2013 Ages 20 - 29 Female        <NA>           1      DNS      NA
4 10007cb6 2005 Ages 26 - 39   Male      Bronze           1 Finished  9.1589
5 10007cb6 2006 Ages 30 - 39   Male  Bill Rowan           2 Finished  8.2564
6 10007cb6 2007 Ages 30 - 39   Male  Bill Rowan           3 Finished  8.0344

To satisfy our definition of novice we’ll need to exclude the “did not start” (DNS) records.

> novices = subset(novices, status != "DNS")
> head(novices)
       key year     category gender       medal medal.count   status   ftime
1 100030f4 2008 Ages 20 - 29 Female Vic Clapham           1 Finished 11.3728
2 100030f4 2009 Ages 20 - 29 Female        <NA>           1      DNF      NA
4 10007cb6 2005 Ages 26 - 39   Male      Bronze           1 Finished  9.1589
5 10007cb6 2006 Ages 30 - 39   Male  Bill Rowan           2 Finished  8.2564
6 10007cb6 2007 Ages 30 - 39   Male  Bill Rowan           3 Finished  8.0344
7 10007cb6 2008 Ages 30 - 39   Male  Bill Rowan           4 Finished  8.8514

Some runners do not finish the race on their first attempt but they bravely come back to run the race again. We will retain only the first record for each runner, because the second time they attempt the race they are (according to our definition) no longer novices since already have some race experience.

> novices = novices[order(novices$year),]
> novices <- novices[which(!duplicated(novices$key)),]

Percentage of Novice Finishers

I suppose that the foremost question going through the minds of many Comrades novices is “Will I finish?”.

> table(novices$status) / nrow(novices) * 100

Finished      DNF 
  80.035   19.965 

Well, there’s some good news: around 80% of all novices finish the race. Those are quite compelling odds. Of course, a number of factors can influence the success of each individual, but if you have done the training and you run sensibly, then the odds are in your favour.

Medal Distribution for Novice Finishers

What medal is a novice most likely to receive?

> table(novices$medal) / nrow(subset(novices, !is.na(medal))) * 100

         Gold Wally Hayward        Silver    Bill Rowan        Bronze   Vic Clapham 
    0.0829671     0.0051854     4.0264976     5.6469490    79.4708254    10.7675754 

The vast majority (again around 80%) claim a Bronze medal. There are also a significant proportion (just over 10%) who miss the eleven hour cutoff and get a Vic Clapham medal. Around 6% of novices achieve a Bill Rowan medal and a surprisingly large fraction, just over 4%, manage to finish in a Silver medal time of under seven and a half hours. There are very few Wally Hayward and Gold medals won by novices. The odds for a novice Gold medal are around one in 1200, all else being equal (which it very definitely isn’t!).

Distribution of Novice Finishing Times

novice-finish-times-hist

As one would expect, the chart slopes up towards the right: progressively more runners come in later in the day. There is very clear evidence of clustering of runners just before the medal cutoffs at 07:30, 09:00, 11:00 and 12:00. There is also a peak before the psychological cutoff at 10:00.

Take Away Message

The data for previous years indicates that the outlook for novices is rather good. 80% of them will finish the race and, of those, around 80% will receive Bronze medals.

How can you help ensure that you have a successful race? Here are some of the things I would think about:

  1. Start slowly. It’s going to be a long day.
  2. Take regular walking breaks and start doing this early on. A few minutes’ recovery will power you up for a number of kms.
  3. Stay hydrated. Take something at every water table. Just don’t overdo it.
  4. Be inspired by the other runners: they all have the guts to indulge in this madness with you and every one of them is fighting their own battle.
  5. Enjoy the support: the hordes of people beside the road have come out to see YOU run by. And they all want you to finish.
  6. Enjoy the day: as far as entertainment is concerned, the Comrades Marathon is about the best value for money that you can get.

See you in Pietermaritzburg at 05:30 on 1 June!

Acknowledgement

Thanks to Daniel for suggesting this article.

Hazardous and Benign Space Objects: Orbits in the Solar-Ecliptic Reference Frame

In two previous posts in this series I have wrangled NEO orbital data into R and then solved Kepler’s Equation to get the eccentric anomaly for each NEO. The final stage in the visualisation of the NEO orbits will be the transformation of locations from the respective orbital planes into a single reference frame.

Reference Frame

The heliocentric ecliptic reference frame has the Sun at the origin and the ecliptic as the fundamental plane. For visualisation purposes we will be representing positions with respect to this plane in terms of rectangular (or Cartesian) coordinates.

Transformation

The transformation between the orbital plane (X, Y, Z) and the heliocentric ecliptic coordinate system (x, y, z) is achieved with

\begin{pmatrix}X\\ Y\\ Z\end{pmatrix}=\begin{pmatrix}\cos \phi_0 \cos \Omega - \sin \phi_0 \cos i \sin \Omega&\cos \phi_0 \sin \Omega + \sin \phi_0 \cos i \cos \Omega&\sin \phi_0 \sin i\\ - \sin \phi_0 \cos \Omega - \cos \phi_0 \cos i \sin \Omega&- \sin \phi_0 \sin \Omega + \cos \phi_0 \cos i \cos \Omega&\cos \phi_0 \sin i\\ \sin i \sin \Omega&- \sin i \cos \Omega&\cos i\end{pmatrix}\begin{pmatrix}x\\ y\\ z\end{pmatrix}

where \phi_0 is the argument of perigee, i is the inclination and \Omega is the longitude of the ascending node.

Cartesian Coordinates

But, before we can use this transformation, we need to move to Cartesian Coordinates. This is easily accomplished.

> orbits = within(orbits, {
+   Z = 0
+   Y = a * sqrt(1 - e**2) * sin(E)
+   X = a * cos(E) - a * e
+ })
> head(orbits)
             Object Epoch      a       e        i       w   Node       M     q.    Q    P    H     MOID class hazard       E  theta        X        Y Z
1  100004 (1983 VA) 56800 2.5959 0.69970 0.284254 0.21050 1.3498 1.40950 0.7795 4.41 4.18 16.4 0.176375   APO  FALSE 2.03512 2.6336 -2.97885  1.65824 0
2 100085 (1992 UY4) 56800 2.6398 0.62481 0.048926 0.66896 5.3826 0.22042 0.9904 4.29 4.29 17.8 0.015111   APO   TRUE 0.54358 1.0511  0.60993  1.06602 0
3 100756 (1998 FM5) 56800 2.2686 0.55202 0.201156 5.43895 3.0869 4.82500 1.0163 3.52 3.42 16.1 0.100211   APO  FALSE 4.31582 3.8282 -2.12857 -1.74482 0
4  100926 (1998 MQ) 56800 1.7827 0.40777 0.422864 2.42053 3.8599 3.42868 1.0558 2.51 2.38 16.7 0.128731   AMO  FALSE 3.34594 3.2744 -2.47255 -0.33031 0
5  101869 (1999 MM) 56800 1.6243 0.61071 0.083167 4.68848 1.9388 0.75843 0.6323 2.62 2.07 19.3 0.001741   APO   TRUE 1.35497 2.0445 -0.64413  1.25637 0
6 101873 (1999 NC5) 56800 2.0295 0.39332 0.798850 5.15317 2.2489 5.65453 1.2312 2.83 2.89 16.3 0.437678   AMO  FALSE 5.33502 4.9613  0.38531 -1.51575 0

Note that the Z component is zero for all of the orbits since this is the coordinate which is perpendicular to the orbital planes (and all of the orbits lie strictly within the fundamental plane of their respective coordinate systems).

Now we need to generate a transformation matrix for each orbit. We will wrap that up in a function.

> transformation.matrix <- function(w, Node, i) {
+   a11 =   cos(w) * cos(Node) - sin(w) * cos(i) * sin(Node)
+   a12 =   cos(w) * sin(Node) + sin(w) * cos(i) * cos(Node)
+   a13 =   sin(w) * sin(i)
+   
+   a21 = - sin(w) * cos(Node) - cos(w) * cos(i) * sin(Node)
+   a22 = - sin(w) * sin(Node) + cos(w) * cos(i) * cos(Node)
+   a23 =   cos(w) * sin(i)
+   
+   a31 =   sin(i) * sin(Node)
+   a32 = - sin(i) * cos(Node)
+   a33 =   cos(i)
+   
+   matrix(c(a11, a12, a13, a21, a22, a23, a31, a32, a33), nrow = 3, byrow = TRUE)
+ }

First we will try this out for a single object.

> (A = do.call(transformation.matrix, orbits[1, c("w", "Node", "i")]))
          [,1]       [,2]     [,3]
[1,]  0.018683  0.9981068 0.058599
[2,] -0.961656  0.0018992 0.274251
[3,]  0.273620 -0.0614756 0.959871
> #
> # We want to invert this transformation, so we need to invert the matrix. However, since this is a
> # rotation matrix, the inverse and the transpose are the same... (we will use this fact later!)
> #
> A = solve(A)
> #
> A %*% matrix(unlist(orbits[1, c("X", "Y", "Z")]), ncol = 1)
         [,1]
[1,] -1.65031
[2,] -2.97006
[3,]  0.28022

That looks pretty reasonable. Now we need to do that systematically to all of the objects.

> require(plyr)
Loading required package: plyr
> 
> orbits = ddply(orbits, .(Object), function(df) {
+   A = transformation.matrix(df$w, df$Node, df$i)
+   #
+   # Invert transformation matrix
+   #
+   A = t(A)
+   #
+   # Transform to reference frame
+   #
+   r = A %*% matrix(unlist(df[, c("X", "Y", "Z")]), ncol = 1)
+   #
+   r = matrix(r, nrow = 1)
+   colnames(r) = c("x", "y", "z")
+   #
+   cbind(df, r)
+ })
> head(orbits)
             Object Epoch      a       e        i       w   Node       M     q.    Q    P    H     MOID class hazard       E  theta        X        Y Z        x        y         z
1  100004 (1983 VA) 56800 2.5959 0.69970 0.284254 0.21050 1.3498 1.40950 0.7795 4.41 4.18 16.4 0.176375   APO  FALSE 2.03512 2.6336 -2.97885  1.65824 0 -1.65031 -2.97006  0.280216
2 100085 (1992 UY4) 56800 2.6398 0.62481 0.048926 0.66896 5.3826 0.22042 0.9904 4.29 4.29 17.8 0.015111   APO   TRUE 0.54358 1.0511  0.60993  1.06602 0  0.83726  0.89659  0.059398
3 100756 (1998 FM5) 56800 2.2686 0.55202 0.201156 5.43895 3.0869 4.82500 1.0163 3.52 3.42 16.1 0.100211   APO  FALSE 4.31582 3.8282 -2.12857 -1.74482 0  2.69099 -0.57127  0.086300
4  100926 (1998 MQ) 56800 1.7827 0.40777 0.422864 2.42053 3.8599 3.42868 1.0558 2.51 2.38 16.7 0.128731   AMO  FALSE 3.34594 3.2744 -2.47255 -0.33031 0 -2.39321 -0.41532 -0.568055
5  101869 (1999 MM) 56800 1.6243 0.61071 0.083167 4.68848 1.9388 0.75843 0.6323 2.62 2.07 19.3 0.001741   APO   TRUE 1.35497 2.0445 -0.64413  1.25637 0 -1.02820  0.96622  0.050998
6 101873 (1999 NC5) 56800 2.0295 0.39332 0.798850 5.15317 2.2489 5.65453 1.2312 2.83 2.89 16.3 0.437678   AMO  FALSE 5.33502 4.9613  0.38531 -1.51575 0  1.29744 -0.50408 -0.713095

Looks good. Time to generate some plots. First we will look at the distribution of objects in the solar-ecliptic plane. Most of them are clustered within a few AU of the Sun, but there are a few stragglers at much larger distances.

> library(ggplot2)
> 
> ggplot(orbits, aes(x = x, y = y)) +
+   geom_point(alpha = 0.25) +
+   xlab("x [AU]") + ylab("y [AU]") +
+   scale_x_continuous(limits = c(-35, 35)) + scale_y_continuous(limits = c(-55, 15)) +
+   theme_classic() +
+   theme(aspect.ratio = 1)

reference-frame-xy-full

We can zoom in on the central region within 5 AU of the Sun. We will also produce a projection onto the x-z plane so that we can see how far these objects lie above or below the ecliptic plane.

> library(grid)
> 
> ggplot(orbits, aes(x = x, y = y)) +
+   geom_point(aes(colour = hazard), alpha = 0.25) +
+   scale_colour_discrete("Hazard", h = c(180, 0), c = 150, l = 40) +
+   xlab("x [AU]") + ylab("y [AU]") +
+   scale_x_continuous(limits = c(-5, 5)) + scale_y_continuous(limits = c(-5, 5)) +
+   annotation_custom(grob = circleGrob(
+     r = unit(0.5, "npc"), gp = gpar(fill = "transparent", col = "black", lwd = 2, lty = "dashed")),
+     xmin=-1, xmax=1, ymin=-1, ymax=1) +
+   theme_classic() +
+   theme(aspect.ratio = 1)
Warning: Removed 17 rows containing missing values (geom_point).
> 
> ggplot(orbits, aes(x = x, y = z)) +
+   geom_point(aes(colour = hazard), alpha = 0.25) +
+   scale_colour_discrete("Hazard", h = c(180, 0), c = 150, l = 40) +
+   xlab("x [AU]") + ylab("z [AU]") +
+   scale_x_continuous(limits = c(-5, 5)) + scale_y_continuous(limits = c(-5, 5)) +
+   theme_classic() +
+   theme(aspect.ratio = 1)
Warning: Removed 7 rows containing missing values (geom_point).

The dashed circle indicates the orbit of the Earth. Most of the objects are clustered around this orbit.

reference-frame-xy-zoom

Although there are numerous objects which lie a few AU on either side of the ecliptic plane, the vast majority are (like the planets) very close to or on the plane itself.

reference-frame-xz-zoom

Finally a three dimensional visualisation. I would have liked to rotate this plot to try and get a better perspective, but I could not manage that with scatterplot3d. It’s probably not the right tool for that particular job.

> library(scatterplot3d)
> 
> orbits.3d <- with(orbits, scatterplot3d(x, y, z, xlim = c(-5, 5), ylim = c(-5, 5), highlight.3d = TRUE))
> orbits.3d$plane3d(c(0, 0, 0))

reference-frame-zoom-scatter-3d

I have enjoyed working with these data. My original plan was to leave the project at this point. But, since the data are neatly classified according to two different schemes, it presents a good opportunity to try out some classification models. More on that shortly.