Comrades Marathon Finish Predictions

* If you see a bunch of [Math Processing Error] errors, you might want to try opening the page in a different browser. I have had some trouble with MathJax and Internet Explorer. Yet another reason to never use Windows.

There are various approaches to predicting Comrades Marathon finishing times. Lindsey Parry, for example, suggests that you use two and a half times your recent marathon time. Sports Digest provides a calculator which predicts finishing time using recent times over three distances. I understand that this calculator is based on the work of Norrie Williamson.

Let's give them a test. I finished the 2013 Comrades Marathon in 09:41. Based on my marathon time from February that year, which was 03:38, Parry's formula suggests that I should have finished at around 09:07. Throwing in my Two Oceans time for that year, 04:59, and a 21.1 km time of 01:58 a few weeks before Comrades, the Sports Digest calculator gives a projected finish time of 08:59. Clearly, relative to both of those predictions, I under-performed that year! Either that or the predictions were way off the mark.

It seems to me that, given the volume of data we gather on our runs, we should be able to generate better predictions. If the thought of maths or data makes you want to doze off, feel free to jump ahead, otherwise read on.

Riegel's Formula

In 1977 Peter Riegel published a formula for predicting running times, which became popular due to its simplicity. The formula itself looks like this:

 \Delta t_2 = \Delta t_1 \left( \frac{d_2}{d_1} \right)^{1.06}


which allows you to predict \Delta t_2 the time it will take you to run distance d_2, given that you know it takes you time \Delta t_1 to run distance d_1. Riegel called this his "endurance equation".

Reverse-Engineering Riegel's Model

Riegel's formula is an empirical model: it's based on data. In order to reverse engineer the model we are going to need some data too. Unfortunately I do not have access to data for a cohort of elite runners. However, I do have ample data for one particular runner: me. Since I come from the diametrically opposite end of the running spectrum (I believe the technical term would be "bog standard runner"), I think these data are probably more relevant to most runners anyway.

I compiled my data for the last three years based on the records kept by my trusty Garmin 910XT. A plot of time versus distance is given below.

training-data-linear

At first glance it looks like you could fit a straight line through those points. And you can, indeed, make a pretty decent linear fit.

> fit <- lm(TimeHours ~ Distance, data = training)
> 
> summary(fit)

Call:
lm(formula = TimeHours ~ Distance, data = training)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64254 -0.04592 -0.00618  0.02361  1.24900 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.1029964  0.0107648  -9.568   <2e-16 ***
Distance     0.1012847  0.0008664 116.902   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1394 on 442 degrees of freedom
Multiple R-squared:  0.9687,	Adjusted R-squared:  0.9686 
F-statistic: 1.367e+04 on 1 and 442 DF,  p-value: < 2.2e-16

However, applying a logarithmic transform to both axes gives a more uniform distribution of the data, which also now looks more linear.

training-data-log

Riegel observed that data for a variety of disciplines (running, swimming, cycling and race walking) conformed to the same pattern. Figure 1 from this paper is included below.

Figure 1 from Riegel's paper "Athletic Records and Human Endurance".

Figure 1 from Riegel's paper "Athletic Records and Human Endurance".

If we were to fit a straight line to the data on logarithmic axes then the relationship we'd be contemplating would have the form

\log \Delta t = m \log d + c


or, equivalently,

\Delta t = k d^m


which is a power law relating elapsed time to distance. It's pretty easy to get Riegel's formula from this. Taking two particular points on the power law, \Delta t_1 = k d_1^m and \Delta t_2 = k d_2^m, and eliminating k gives

\Delta t_2 = \Delta t_2 \left( \frac{d_2}{d_1} \right)^m


which is Riegel's formula with an unspecified value for the exponent. We'll call the exponent the "fatigue factor" since it determines the degree to which a runner slows down as distance increases.

How do we get a value for the fatigue factor? Well, by fitting the data, of course!

> fit <- lm(log(TimeHours) ~ log(Distance), data = training)
> 
> summary(fit)

Call:
lm(formula = log(TimeHours) ~ log(Distance), data = training)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27095 -0.04809 -0.01843  0.01552  0.80351 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -2.522669   0.018111  -139.3   <2e-16 ***
log(Distance)  1.045468   0.008307   125.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09424 on 442 degrees of freedom
Multiple R-squared:  0.9729,	Adjusted R-squared:  0.9728 
F-statistic: 1.584e+04 on 1 and 442 DF,  p-value: < 2.2e-16

The fitted value for the exponent is 1.05 (rounding up), which is pretty close to the value in Riegel's formula. The fitted model is included in the logarithmic plot above as the solid line, with the 95% prediction confidence interval indicated by the coloured ribbon. The linear plot below shows the data points used for training the model, the fit and confidence interval as well as dotted lines for constant paces ranging from 04:00 per km to 07:00 per km.

training-data-linear-model

Our model also provides us with an indication of the uncertainty in the fitted exponent: the 95% confidence interval extends from 1.029 to 1.062.

> confint(fit)
                  2.5 %    97.5 %
(Intercept)   -2.558264 -2.487074
log(Distance)  1.029142  1.061794

Estimating the Fatigue Factor

A fatigue factor of 1 would correspond to a straight line, which implies a constant pace regardless of distance. A value less than 1 implies faster pace at larger distances (rather unlikely in practice). Finally, a value larger than 1 implies progressively slower pace at larger distances.

The problem with the fatigue factor estimate above, which was based on a single model fit to all of the data, is that it's probably biased by the fact that most of the data are for runs of around 10 km. In this regime the relationship between time and distance is approximately linear, so that the resulting estimate of the fatigue factor is probably too small.

To get around this problem, I employed a bootstrap technique, creating a large number of subsets from the data. In each subset I weighted the samples to ensure that there was a more even distribution of distances in the mix. I calculated the fatigue factor for each subset, resulting in a range of estimates. Their distribution is plotted below.

fatigue-factor-bootstrap

According to this analysis, my personal fatigue factor is around 1.07 (the median value indicated by the dashed line in the plot above). The Shapiro-Wilk test suggests that the data is sufficiently non-Normal to justify a non-parameteric estimate of the 95% confidence interval for the fatigue factor, which runs from 1.03 to 1.11.

> shapiro.test(fatigue.factor)

	Shapiro-Wilk normality test

data:  fatigue.factor
W = 0.9824, p-value = 1.435e-12

> median(fatigue.factor)
[1] 1.072006
> quantile(fatigue.factor, c(0.025, 0.975))
    2.5%    97.5% 
1.030617 1.107044 

Riegel's analysis also lead to a range of values for the fatigue factor. As can be seen from the table below (extracted from his paper), the values range from 1.01 for Nordic skiing to 1.14 for roller skating. Values for running range from 1.05 to 1.08 depending on age group and gender.

Table 1 from Riegel's paper "Athletic Records and Human Endurance".

Table 1 from Riegel's paper "Athletic Records and Human Endurance".

Generating Predictions

The rules mentioned above for predicting finishing times are generally applied to data for a single race (or a selection of three races). But, again, given that we have so much data on hand, would it not make sense to generate a larger set of predictions?

Predicted finishing times for the Comrades Marathon

The distributions above indicate the predictions for this year's Comrades Marathon (which is apparently going to be 89 km) based on all of my training data this year and using both the default (1.06) and personalised (1.07) values for the fatigue factor. The distributions are interesting, but what we are really interested in is the expected finish times, which are 08:59 and 09:18 depending on what value you use for the fatigue factor. I have a little more confidence in my personalised value, so I am going to be aiming for 09:18 this year.

Comrades is a long day and a variety of factors can affect your finish time. It's good to have a ball-park idea though. If you would like me to generate a set of personalised predictions for you, just get in touch via the replies below.

Postscript

I repeated the analysis for one of my friends and colleagues. His fatigue factor also comes out as 1.07 although, interestingly, the distribution is bi-modal. I think I understand the reason for this though: his runs are divided clearly into two groups: training runs and short runs back and forth between work and the gym.

fatigue-factor-bootstrap-karl-wilhelm

Comrades Runners Disqualified: I'm Not Convinced

Apparently 14 runners have been disqualified for failing to complete the full distance at the 2012 and 2013 Comrades Marathons.

I'm not convinced.

Although 20 runners were charged with misconduct, six of them had a valid story. These runners had retired from the race but the bailers' bus had dropped them back on the course and they were "forced" to cross the finish line. I find this story a hard to digest. My understanding is that race numbers are either confiscated, destroyed or permanently marked when entering the bus. If this story is true then it should have thus been immediately obvious to officials that the runners in question had dropped out of the race. Their times should never have been recorded and they certainly should not have received medals (which presumably at least some of them did, since they have been instructed to return them!).

So that leaves the 14 runners who were disqualified. Were any of them among the group of mysterious negative splits identified previously? Unless KZNA or the CMA releases the names or race numbers, I guess we'll never know.

Encyclopaedia: Meteorites in Antarctica

A contribution which I wrote for Antarctica and the Arctic Circle: A Geographic Encyclopedia of the Earth's Polar Regions.

Interplanetary space is not a vacuum: it is teeming with objects of various sizes, from microscopic electrons and ions to large pieces of space debris. Some of these objects end up on earth. A meteoroid is a natural object with size between 10 μm and 1 m moving through space. Asteroids are bigger than meteoroids but smaller than a planet, and without their own atmosphere. Most meteoroids have a mass of less than a few grams, originating as fragments from asteroids, comets, and planets. They are often concentrated in a meteoroid stream, which is the debris left in the trail of a comet. A meteoroid or asteroid that enters the earth’s atmosphere produces a meteor (or shooting star), which is the visible trail formed as it falls through the atmosphere. When the earth passes through a meteoroid stream, a meteor shower is observed.

Meteoroids enter the earth’s atmosphere at speeds up to about 156,585 mph (251,999 km/h). Speed, size, incidence angle, and degree of fragmentation determine whether or not a meteoroid reaches the ground. Those meteoroids that penetrate the surface of the earth and survive the impact are known as meteorites. Meteorites are found everywhere on earth, but most have been discovered in Antarctica where they are clearly visible on the snowy surface or embedded in ice.

The flux of meteoritic material to the surface of the earth is between 104 and 106 kg per year. A meteorite with a size of 1 m hits the earth on average once per year. Smaller meteorites are more common, while larger ones are relatively rare. Larger meteorites can produce extensive craters like the Vredefort crater in South Africa, which is 300 km in diameter. A meteorite with a size of 3,280 ft. (1,000 m) would have a cataclysmic effect, but fortunately, such impacts are rare: roughly once every million years. Many hundred meteorites, each with a mass of a few grams, strike the earth every day. The majority of meteorites originate from tiny meteoroids that are so light that they slow down rapidly in the atmosphere. These are deposited as microscopic dust.

Meteorites are classified as stony (94% of all meteorites; composed of silicate minerals), iron (5%; alloys of nickel and iron), or stony-iron (1%; composites of stony and metallic materials). Although stony meteorites are the most common, they are more difficult to identify since they are similar to normal terrestrial rocks. Iron meteorites are readily identified since they are magnetic and denser than most terrestrial rocks. They are also the most likely to reach the ground since they seldom fragment.

Initial estimates of meteorite flux were made from eyewitness accounts. Such meteorite falls (where the meteor trail was observed) are naturally more common in densely populated areas. Unfortunately, such estimates are bedeviled by large uncertainties. Recent meteorite flux data have been recorded by camera networks. These too suffer from difficulties in calibrating the images of meteor trails to meteorite masses. However, meteorite finds (where the meteor trail was not observed, but the meteorite itself was discovered) present a robust history that extends back thousands of years. The hot deserts and Antarctica have favorable conditions for meteorite accumulation: weathering occurs slowly, and meteorites are easily recognized against a uniform background that is relatively free of other rocks. Meteorites can survive in these regions for more than 100,000 years, allowing variations in meteorite flux and mass distribution to be estimated. Some meteorites found in Antarctica had been there for more than 1 million years. Furthermore, there is evidence to suggest that some of the meteorites found in Antarctica contain material that originated either from the Moon or Mars.

More than 30,000 meteorites have been retrieved from Antarctica by various expeditions during the past 30 years. This exceeds the entire population of meteorites collected over the rest of the earth. Differences in the populations of recent meteorite falls and Antarctic meteorite finds indicate that the characteristics of meteorites landing on earth are changing with time as debris from the early solar system is progressively swept up.

The fate of a meteorite that falls onto Antarctica depends on the nature of the surface and the typical weather conditions. If it falls on ice, then it may remain exposed for a long time. Alternatively, it might become covered by snow and ultimately trapped in solid ice. There is a general flow of ice toward the periphery of the continent, so that these meteorites are eventually deposited in the Southern Ocean. The motion of the ice sheet can be slowed by mountain ranges, and meteorites embedded in the ice become concentrated in stranding zones. If the ice either evaporates via sublimation or is ablated by the wind, trapped meteorites become exposed, leading to high meteorite concentrations. The history of meteorite falls in the Antarctic can be deduced from stranding zones and ice cores.

A Sankey Plot with Uniform Coloured Edges

Following up on my previous post about generating Sankey plots with the riverplot package. It's also possible to generate plots which have constant coloured edges.

Here's how (using some of the data structures from the previous post too):

edges$col = sample(palette, size = nrow(edges), replace = TRUE)
edges$edgecol <- "col"

river <- makeRiver(nodes, edges)
style <- list(nodestyle= "invisible")

riverplot-example-constant-edge

Encyclopaedia: Discovery Expedition

A contribution which my wife and I wrote for Antarctica and the Arctic Circle: A Geographic Encyclopedia of the Earth's Polar Regions.

The British National Antarctic Expedition, now known as the Discovery Expedition, took place between 1901 and 1904. During this time, other expeditions from Germany, Sweden, France, and Scotland also traveled to Antarctica. The Discovery Expedition was carried out under the auspices of the Royal Society and the Royal Geographical Society, and was intended to renew Great Britain’s involvement in the exploration of Antarctica.

The Discovery Expedition had both exploratory and scientific objectives. Specifically, they were to forge as far south as possible and to conduct oceanographic, meteorological, magnetic, biological, and geological observations en route.

The projected cost of the expedition was £90,000, half of which was sourced from the British Government, while the balance was raised by the two royal societies. The expedition vessel, the Discovery, was a three-masted wooden sailing ship, specifically designed for research in Antarctic conditions. She was constructed in Dundee at a cost of around £51,000.

The ship’s crew was recruited from the Royal Navy, the Merchant Marine, and the civilian population. Robert Falcon Scott (1868–1912) was appointed as commander of the Discovery and leader of the expedition. Other noteworthy members of the crew were Ernest Shackleton (1874–1922), Tom Crean (1877–1938), and John Robert Francis “Frank” Wild (1873–1939). The scientific team consisted of Louis Bernacchi (meteorologist and magnetic observer), Hartley Ferrar (geologist), Thomas Hodgson (marine biologist), Edward Wilson (junior doctor and zoologist), and Reginald Koettlitz (senior doctor). Of these, only Bernacchi had previous Antarctic experience.

Discovery departed from Cardiff on August 6, 1901, traveling via Cape Town to Lyttelton, New Zealand. There, she prepared for the journey into the Southern Ocean, taking on supplies that included 45 live sheep donated by local farmers. After three weeks in port, she departed, sailing south. She stopped at Cape Adare and then Cape Crozier, which was to serve as a point where messages detailing the location of the expedition could be relayed to relief ships. Discovery then proceeded eastward along the Victoria Barrier (now known as the Ross Ice Shelf), before entering McMurdo Sound on February 8, 1902. Here she dropped anchor and soon became frozen into the sea ice, where she would be trapped for two years. Huts were erected on a nearby rocky peninsula. The larger hut was intended to act as accommodation. However, it was not conducive to habitation, and the expedition personnel continued to live on the ship. Two smaller huts housed scientific instruments including magnetometers and a seismograph.

The scientists, who had made measurements on the journey south, continued to take meteorological and magnetic observations during the winter. Preparations were also made for work to be undertaken during the next summer.

The men, who had little or no experience of skiing or dog sledging, began to practice these skills. They underestimated the dangerous conditions, and on a training journey, one of the parties fell to his death during a blizzard. Two overland journeys took place during the winter. A team led by Royds traveled to Cape Crozier to leave a message for a relief ship. When they arrived, they discovered a colony of emperor penguins. Scott, Wilson, and Shackleton also set off across the ice shelf with the goal of getting as far south as possible. Due to their inexperience with polar travel, progress was slow. The conditions of men and dogs deteriorated rapidly. The weaker dogs were killed and fed to the other dogs. They reached 82º 17′S before turning back. This was the closest that anybody had got to the South Pole. After the last of the dogs died, the men were forced to haul the sleds themselves. Shackleton was unable to assist since he was overcome with scurvy. The team returned to the Discovery on February 3, 1903.

While Scott’s party was away, a relief ship arrived with supplies and a letter addressed to him authorizing the expedition to stay for a second year. A number of the men, including the ailing Shackleton, returned home with the relief ship. The Discovery spent a second winter in the ice.

In the spring of 1903, a group led by Scott once again set out, heading westward toward the South Magnetic Pole. Since there were no dogs left, it was a man-hauling journey. Their progress, however, was appreciably better than on the previous journey. After ascending a large glacier, they reached the Polar Plateau at a height of around 6,500 ft. (2,000 m). On the return journey, they discovered a valley completely free of snow. The existence of such dry valleys was not previously known.

Upon Scott’s return, the Discovery was still stuck in the ice despite extensive efforts to free her. Two relief vessels soon arrived bearing instructions to Scott to abandon the Discovery if he was not able to free her quickly. Explosives and ice saws were employed to break up the ice around the ship. When she still would not budge, the men started to transfer important items to the relief ships. However, suddenly on Valentine’s Day, 1904, the ice around the Discovery broke up, and she was free. On February 17, 1904, she began her journey home. Returning via Lyttelton, the Discovery arrived in Portsmouth on September 10, 1904, and soon sailed for London, where she docked with little fanfare.

After his return, Scott was promoted to captain and received an array of awards. A number of the officers and crew also received the Polar Medal. Among the various outcomes of the expedition, the highlights were the discovery of the Polar Plateau, a dry valley, and an emperor penguin colony; evidence suggesting that the ice barrier was not seated on land but floating on the sea; and magnetic measurements that allowed a better estimate of the location of the South Magnetic Pole. Numerous new geological and biological specimens were also found.

Bags, Balls and the Hypergeometric Distribution

A friend came to me with a question. The original question was a little complicated, but in essence it could be explained in terms of the familiar urn problem. So, here's the problem: you have an urn with 50 white balls and 9 black balls. The black balls are individually numbered. Balls are drawn from the urn without replacement. What is the probability that

  1. the last ball drawn from the urn is a black ball (Event 1) and
  2. when the urn is refilled, the first ball drawn will be the
      same

    black ball (Event 2).

My colleague thought that this would be an extremely rare event. My intuition also suggested that it would be rather unlikely. As it turns out, we were both surprised.

Theoretical Approach

First we'll set up the parameters for the problem.

> NBALLS = 59
> NWHITE = 50
> NBLACK = NBALLS - NWHITE

We can calculate the probabilities for each of the events independently. Let's start with Event 2 since it's simpler: the probability of selecting a particular ball from the urn.

> P2 = 1 / NBALLS
> P2
[1] 0.01694915

The probability of Event 1 is a little more complicated. The Hypergeometric distribution describes the probability of achieving a specific number of successes in a specific number of draws from a finite population without replacement. It's precisely the distribution that we are after! We want to know the probability of drawing all of the white balls and all but one of the black balls, so that the last ball remaining is black.

> P1 = dhyper(NWHITE, NWHITE, NBLACK, NBALLS-1)
> P1
[1] 0.1525424

I think that's where my intuition let me down, because my feeling was that this number should have been smaller. The joint probability for the two events is then around 0.26%.

> P1 * P2
[1] 0.002585464

Simulation Approach

Feeling a little uncomfortable with the theoretical result and suspecting that I might have done something rash with the Hypergeometric distribution, I decided to validate the result with a simulation.

> BAG <- c(rep(NA, NWHITE), 1:NBLACK)
> 
> NSIM <- 1000000
> 
> hits <- lapply(1:NSIM, function(n) {
+ 	last <- sample(BAG)[NBALLS]
+ 	first <- sample(BAG, 1)
+ 	#
+ 	last == first
+ })
> #
> hits <- unlist(hits)
> #
> # The resulting vector has entries with the following possible values
> #
> # TRUE -> last ball is black and is the same ball that comes out first from repopulated bag
> # FALSE -> last ball is black and first ball from repopulated ball is black but different
> # NA -> either of these balls is white
> #
> hits <- ifelse(is.na(hits), FALSE, hits)

I also wanted to generate Binomial confidence intervals for the simulated result.

> library(binom)
> stats <- binom.confint(x = cumsum(hits), n = 1:NSIM, methods = "wilson")

The simulated results are plotted below with the theoretical result indicated by the dashed line. The pale blue ribbon indicates the range of the 95% confidence interval.

black-white-balls

Looks like both theory and simulation confirm that my intuition was wrong. I'd prefer not to say just how badly wrong it was.

Postscript: after the excitement of the Hypergeometric distribution had subsided, it occurred to me that the symmetry of the problem allowed for a simpler solution. The probability of Event 1 is actually the same as the probability of drawing any one of the black balls first out of a full urn. Don't believe me?

> NBLACK / NBALLS
[1] 0.1525424

Comrades Marathon Pacing Chart: Up Run

I've updated my Comrades Marathon pacing chart to include both the Up and Down runs. You can grab it here. The data for this year's race are not yet finalised (I think we will be running 87 km), but you can make changes when it's all confirmed.

The use of the chart is explained in a previous post. Any feedback on how this can be improved would be appreciated.

comrades-pacing-2015

The Price of Fuel: How Bad Could It Get?

The cost of fuel in South Africa (and I imagine pretty much everywhere else) is a contentious topic. It varies from month to month and, although it is clearly related to the price of crude oil and the exchange rate, various other forces play an influential role.

According to the Department of Energy the majority of South Africa's fuel is refined from imported crude oil. The rest is synthetic fuel produced locally from coal and natural gas. The largest expense for a fuel refinery is for raw materials. So little wonder that the cost of fuel should be intimately linked to the price of crude oil. The price of crude oil on the international market is quoted in US Dollars (USD) per barrel, so that the exchange rate between the South African Rand and the US Dollar (ZAR/USD) also exerts a strong influence on the South African fuel price.

I am going to adopt a simplistic model in which I'll assume that the price of South African fuel depends on only those two factors: crude oil price and the ZAR/USD exchange rate. We'll look at each of them individually before building the model.

Exchange Rate

The Rand reached a weak point against the USD in January 2002 when $1 cost around R11.70. After that it recovered and was relatively strong in 2005. However, since then it has been systematically weakening (with the exception of some brief intervals of recovery). At present (end of March 2015) you will pay around R12 for $1.

I guess this is part of the reason why new running shoes are so damn expensive.

date-delta-ZARUSD

Crude Oil Price

Crude oil has been getting progressively more expensive. The price of around $25 per barrel in 2000 escalated to a peak of roughly $135 per barrel in August 2008. This was followed by a massive decrease in price, bringing it down to only approximately $40 per barrel at the beginning of 2009. We have recently seen a similar dramatic decrease in the crude oil price (unfortunately, not as dramatic as the one in 2008).

date-delta-crude-oil

Fuel Price

Back in 2000 you could get a litre of fuel in South Africa for around R3. Those were halcyon days. Since then the cost of fuel has been steadily increasing, with the exception of a sharp spike in 2008 and the recent decline starting towards the end of 2014.

diesel-petrol-time

Building Simple Models

If you're not interested in the technical details of the model building process, feel free to skip forward to the results.

There is a clear linear relationship between the fuel price and the price of crude oil (plot on the left below, with Diesel in blue and Petrol in red). A power law relationship actually gives a tighter fit (see log-log plot on the right below). However, in the interests of simplicity we'll stick to linear relationships.

diesel-petrol-crude-oil

We'll put together three simple (and probably naive) linear models for the fuel price:

  • an additive linear model depending on exchange rate and crude oil price;
  • the same model but including the interaction between exchange rate and crude oil price; and
  • a model depending only on the above interaction.

The first model assumes that the influences of exchange rate and crude oil price are independent. A moment's thought reveals that this cannot be a particularly good model, since the effects of these two quantities must be inextricably linked. However, as we will see, the model provides a relatively good fit to the data, but probably will not extrapolate effectively.

> fit.1 <- lm(ULP_Inland_95 ~ DollarPerBarrel + ZARUSD, data = fuel.data)
> summary(fit.1)

Call:
lm(formula = ULP_Inland_95 ~ DollarPerBarrel + ZARUSD, data = fuel.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1686 -0.2130  0.1724  0.4877  1.1874 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -6.210351   0.446422  -13.91   <2e-16 ***
DollarPerBarrel  0.075478   0.003091   24.42   <2e-16 ***
ZARUSD           1.093711   0.051210   21.36   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7103 on 105 degrees of freedom
  (72 observations deleted due to missingness)
Multiple R-squared:  0.9305,	Adjusted R-squared:  0.9292 
F-statistic: 702.9 on 2 and 105 DF,  p-value: < 2.2e-16

> AIC(fit.1)
[1] 237.5634

The second model considers both the independent (additive) effects of exchange rate and crude oil price as well as their joint (multiplicative) effect. The p-values in the summary output below indicate that only the multiplicative term is statistically significant. This provides a motivation for the third model.

> fit.2 <- lm(ULP_Inland_95 ~ DollarPerBarrel * ZARUSD, data = fuel.data)
> summary(fit.2)

Call:
lm(formula = ULP_Inland_95 ~ DollarPerBarrel * ZARUSD, data = fuel.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2018 -0.3598  0.1219  0.4863  1.0814 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.285115   1.599475   1.429    0.156    
DollarPerBarrel        -0.019958   0.017625  -1.132    0.260    
ZARUSD                  0.050663   0.195616   0.259    0.796    
DollarPerBarrel:ZARUSD  0.011591   0.002115   5.481 2.97e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6287 on 104 degrees of freedom
  (72 observations deleted due to missingness)
Multiple R-squared:  0.9461,	Adjusted R-squared:  0.9445 
F-statistic: 608.3 on 3 and 104 DF,  p-value: < 2.2e-16

> AIC(fit.2)
[1] 212.1551

The third and final model uses only the interaction between exchange rate and crude oil price. The summary output below indicates that for this model the coefficients associated with both the intercept and interaction terms are significant.

> fit.3 <- lm(ULP_Inland_95 ~ DollarPerBarrel:ZARUSD, data = fuel.data)
> summary(fit.3)

Call:
lm(formula = ULP_Inland_95 ~ DollarPerBarrel:ZARUSD, data = fuel.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.06270 -0.27476  0.08142  0.44747  1.35974 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.9443004  0.2043706   9.514 6.98e-16 ***
DollarPerBarrel:ZARUSD 0.0102018  0.0002601  39.224  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6808 on 106 degrees of freedom
  (72 observations deleted due to missingness)
Multiple R-squared:  0.9355,	Adjusted R-squared:  0.9349 
F-statistic:  1539 on 1 and 106 DF,  p-value: < 2.2e-16

> AIC(fit.3)
[1] 227.4323

Based on standard metrics all three of these models are reasonable approximations to the data. The principle of parsimony suggests that of Models 2 and 3 the latter should be preferred (especially in light of the fact that the coefficients for the extra terms in Model 2 are not significant).

The quantile-quantile plots for the standardised residuals of the three models are given below. Both Model 1 and Model 2 have residuals which are skewed to the left. The distribution of residuals for Model 3 is close to normal but with heavy tails (large kurtosis).

model-residuals-quantile

The next set of plots show the relative errors for Models 2 and 3. For Model 3 there is a gradient from negative errors in the top left corner of the plot to positive errors in the bottom right corner. This indicates that Model 3 is systematically failing to capture some of the underlying variation in the data. The relative error for Model 2 does not display a similar pattern.

model-relative-error-scatter

So, despite the simplicity of Model 3, Model 2 provides a better description of the data. The pattern of relative errors for Model 1 is similar to that for Model 2.

Model Projections

So what do these models tell us?

The plots below show the predictions based on Models 1 and 2. The points indicate the data used to derive the models. The dashed lines denote contours of constant fuel price. The linear and hyperbolic shapes of the contours reflect the form of the underlying models.

It's dangerous to extrapolate too far beyond the domain covered by your data. However, the general trend of both models is obvious: increases in either the crude oil price or the ZAR/USD exchange rate will result in higher fuel costs. Let's consider some individual test cases. Suppose the price of crude oil remained around 60 USD per barrel but the ZAR/USD exchange rate went up to R15 per $1. In this case Model 1 predicts that the fuel price would reach R14.70 per litre, while Model 2 predicts R13.45 per litre. We've been in this regime before and it wasn't pleasant! Another scenario to consider would be the ZAR/USD exchange rate staying at around R12 per $1 but the price of crude oil going up to 160 USD per barrel. In this case Model 1 predicts that the fuel price would reach R19.00 per litre, while Model 2 predicts R21.10 per litre. Either of those is rather daunting. Finally, what would happen if the price of crude oil went up to 160 USD per barrel and the ZAR/USD exchange rate hit 15? Under these conditions Model 1 predicts R22.30 per litre, while Model 2 gives R26.20 per litre. Those are both pretty appalling!

model-predictions

Conclusions

These models only consider the effects of the international price of crude oil and exchange rates. Effects like the fuel price increases in March and April 2015, which were based on budgetary decisions, are of course not accounted for. Predictions should thus be treated with a healthy dose of skepticism. But as a general guide to the possible fuel prices we could expect in the future, it's quite illuminating. And not a little bit scary.

Encyclopaedia: Geospace

A contribution which I wrote for Antarctica and the Arctic Circle: A Geographic Encyclopedia of the Earth's Polar Regions. As a side note, I am normally quite pedantic about "Earth" and "Sun" being capitalised but the editor of the Encyclopaedia evidently had different views on this point.

The space immediately around the earth, including the upper atmosphere, ionosphere, and magnetosphere, is known as geospace. This region is of considerable scientific interest since it is the environment inhabited by most artificial (communication, navigation, meteorological, etc.) satellites. In addition to in situ measurements, observations of geospace are commonly made at high latitudes because the earth’s magnetic field focuses phenomena from a large volume of space down onto a relatively small portion of the earth’s surface. Both the Arctic and Antarctic are suitable sites. However, the Antarctic continent provides a better environment for the construction of research facilities. A significant number of countries with bases in Antarctica are engaged in geospace research.

Geospace is filled with plasma, which is a partially or fully ionized gas. As the fourth state of matter, plasma has characteristics that are completely different from those of neutral gas: the motions of charged particles in plasma are strongly influenced by externally imposed magnetic fields, as well as the electric fields from other charged particles and electromagnetic waves. This has important consequences for charged particle dynamics as well as the propagation of electromagnetic waves.

The plasma in geospace is permeated by the earth’s magnetic field and largely controlled by it. The earth’s intrinsic magnetic field is roughly dipolar (similar to that of a bar magnet). Magnetic field lines radiate outward from the surface of the earth near the North Magnetic Pole and extend a great distance into space before converging near the South Magnetic Pole. The magnetic field lines therefore cluster near the poles but are widely separated in the equatorial plane. For example, field lines separated in latitude by 60 miles (100 km) near the magnetic latitude of 60° on the surface of the earth are separated by about 900 miles (1500 km) in the equatorial plane. This means that a relatively small area on the earth’s surface maps to a very large volume of geospace. The magnetic field lines near the magnetic poles therefore act like an enormous funnel, concentrating the influences of most of geospace down onto only a small portion of the surface of the planet. Somewhat surprisingly, the earth’s North Magnetic Pole is currently located in the Southern Hemisphere (64° S 138° E), which is off the coast of Antarctica. The South Magnetic Pole is in the Canadian Arctic territory.

The geospace environment is linked to the sun by the interplanetary magnetic field (IMF) and the solar wind. The solar wind is plasma that flows at supersonic speeds (typically around 400 km/s) from the surface of the sun. The solar wind exerts pressure on the earth’s magnetic field, confining it to a cavity known as the magnetosphere. The orientation of the IMF and the speed and density of the solar wind, which vary with solar activity, have a marked effect on the state of the magnetosphere. The magnetosphere forms an enormous plasma physics laboratory, orders of magnitude larger than any terrestrial laboratory. This means that large-scale plasma physics phenomena, which are impossible to replicate on earth, can be studied.

Charged particles move easily parallel to magnetic field lines. However, the Lorentz force causes them to gyrate in a plane perpendicular to magnetic field lines. As a result, ions and electrons in the magnetosphere move on spiral paths along magnetic field lines. As a charged particle approaches a magnetic pole, the magnetic field strength increases and the speed of the particle along the magnetic field line is reduced. At some point, the particle is reflected back up the magnetic field line into the magnetosphere. This mirror point is at an altitude determined by the particle’s velocity. If the mirror point is within the neutral atmosphere, then there is a high probability that the particle will collide with a neutral atom or molecule, in which case it will lose energy and will not return to the magnetosphere. Such a particle is said to have precipitated. Positive and negative charged particles also drift in opposite directions around the earth, forming a ring current that causes measurable magnetic effects on the earth’s surface.

The aurora australis and aurora borealis (the southern and northern lights) are caused by energetic electron precipitation. Collisions between these electrons and neutral atoms or molecules in the upper atmosphere leave the neutrals in an excited state from which they subsequently decay by emitting light. Typically, the light is either green, red or violet, depending on the altitude of the emission. Green aurora, the most common, is due to the excitation of atomic oxygen.

Cosmic rays are extremely high-energy particles originating from outer space, generally outside the solar system. Most cosmic rays are atomic nuclei (predominantly hydrogen and helium atoms stripped of their electrons). They produce a cascade of secondary particles when they enter the earth’s atmosphere. Some of these secondaries are able to penetrate the earth’s surface. Since cosmic rays are charged particles, they too are able to access the earth’s lower atmosphere more readily near the magnetic poles. Only the most energetic cosmic rays can penetrate to the ground near the equator.

Certain electromagnetic waves are also affected by the earth’s magnetic field. Whistler-mode waves propagate along magnetic field lines through a plasma. Lightning strokes are the main source of whistler-mode waves. Some electromagnetic energy from lightning penetrates through the ionosphere and enters the magnetosphere, where it propagates in the whistler mode. Magnetic field lines guide the waves to the opposite hemisphere where they penetrate down through the ionosphere to reach the ground. The whistler mode is dispersive, which means that distinct frequencies travel at different speeds. The brief electromagnetic pulse generated by lightning is thus transformed as it passes through the magnetosphere into a unique signal that emerges in the opposite hemisphere. This is called a “whistler.” Measurements of whistler dispersion are used to determine plasma densities at great distances from the earth, which otherwise could be measured only by satellites.