Excel: Copying with Relative Links

Copying a range of cells but keeping the targets for relative links? Doesn't quite work as you expected. Or hoped.

The greens cells below each contain a relative link to the "data" cell five places to the left.

copy-rel-links-1

If we simply Copy and Paste those green cells then we see that the relative links move as well. This is often what you want to happen. Often but not always. What about if you want the links in the copied cells to be preserved? None of the various Paste options offered in Excel will do this.

copy-rel-links-2

But there is a way around this problem. First Copy the range of cells in question and Paste to a temporary location. Next Cut the range of cells and Paste to the new location. The Cut and Paste action will not alter the relative links.

copy-rel-links-3

Finally Copy and Paste from the temporary location back to the original location. Delete the temporary copy.

copy-rel-links-4

Amazon EC2: Adding Swap

So, after upgrading to R 3.2.0 on my EC2 instance, I was installing newer versions of various packages and I ran into a problem with dplyr: virtual memory exhausted!

Seemed like a good time to add some swap.

Adding Swap and Turning it On

First become root and then make some swap files. I am in favour of creating a few smaller swap files rather than a single monolithic one.

# dd if=/dev/zero of=/var/swap.1 bs=1M count=1024
# dd if=/dev/zero of=/var/swap.2 bs=1M count=1024
# dd if=/dev/zero of=/var/swap.3 bs=1M count=1024

Next you'll set up a swap area on each of these files.

# /sbin/mkswap /var/swap.1
# /sbin/mkswap /var/swap.2
# /sbin/mkswap /var/swap.3

Finally activate as many of the swap files as you require to give you sufficient virtual memory. I just needed one for starters.

# sudo /sbin/swapon /var/swap.1

If you want the swap space to be activated again after reboot then you will need to add an entry to /etc/fstab. More information can be found here.

Turning it Off Again

When you are done with the memory intensive operations you might want to disable the swap files.

# /sbin/swapoff /var/swap.1

Amazon EC2: Upgrading R

After installing R and Shiny on my EC2 instance I discovered that the default version of R was a little dated and I wanted to update to R 3.2.0. It's not terribly complicated, but here are the steps I took.

First, become root.

$ sudo /bin/bash

Remove the old version of R.

# apt-get remove r-base-core

Edit /etc/apt/sources.list and add the following:

deb http://cran.rstudio.com/bin/linux/ubuntu trusty/

Back in the shell, add the public keys.

# gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
# gpg -a --export E084DAB9 | apt-key add -

Update the package list and install the new base R package.

# apt-get update
# apt-get upgrade
# apt-get install r-base

When you launch R you should find that it is the sparkling new version.

Now you will want to update all of the packages too, so launch R (as root) and then do:

> update.packages(lib.loc = "/usr/local/lib/R/site-library", ask = FALSE)

If you have packages installed in a user library, you should update those too.

Finally, if you run into memory problems with the package updates, then you'll need to add a swap file.

Disney: Quality over Quantity

I read an interesting article in the 10 June 2015 edition of The Wall Street Journal. The graphic below concisely illustrates the focused strategy being adopted by Disney: a steady decline in the number of movies released per year accompanied by a steady ascent in operating margin.

Quality over quantity for enhanced overall profit. Makes sense to me.

disney-movies

R Recipe: RStudio and UNC Paths

RStudio does not like Uniform Naming Convention (UNC) paths. This can be a problem if, for example, you install it under Citrix. The solution is to create a suitable environment file. This is what worked for me: I created an .Renviron file in my Documents folder on the Citrix remote drive. The file had the following contents:

R_LIBS_USER="H:/myCitrixFiles/Documents/R/win-library/3.1"
R_USER="H:/myCitrixFiles/Documents"

Simple. After that, the flurry of error and warning messages at startup disappeared and I was able install packages without any problems.

Hosting Shiny on Amazon EC2

I recently finished some work on a Shiny application which incorporated a Random Forest model. The model was stored in a RData file and loaded by server.R during initialisation. This worked fine when tested locally but when I tried to deploy the application on shinyapps.io I ran into a problem: evidently you can only upload server.R and ui.R files. Nothing else.

Bummer.

I looked around for alternatives and found that Amazon Elastic Compute Cloud (EC2) was very viable indeed. I just needed to get it suitably configured. A helpful article documented the process from an OSX perspective. This is the analogous Ubuntu view (which really only pertains to the last few steps of connecting via SSH and uploading your code).

Before embarking on this adventure it might be worthwhile reading some of the material about Getting Started with AWS.

Create an Account

The first step is to create an account at aws.amazon.com. After you've logged into your account you should see a console like the one below. Select the EC2 link under Compute.

AWS Management Console-002.bmp

Next, from the EC2 Dashboard select Launch Instance.

EC2 Management Console-003.bmp

Step 1: There is an extensive range of machine images to choose from, but we will select the Ubuntu Server.

EC2 Management Console-004.bmp

Step 2: Select the default option. Same applies for Step 3, Step 4 and Step 5.

EC2 Management Console-005.bmp

Step 6: Choose the security settings shown below. SSH access should be restricted to your local machine alone. When you are done, select Review & Launch. Further information on access control can be found here.

EC2 Management Console-007.bmp

Step 7: Create a new key pair. Download the key and store it somewhere safe! Now press Launch Instances.

EC2 Management Console-008.bmp

The launch status of your instance will then be confirmed.

EC2 Management Console-009.bmp

At any later time the status of your running instance(s) can be inspected from the EC2 dashboard.

EC2 Management Console-010.bmp

SSH Connection

Now in order to install R and Shiny we need to login to our instance via SSH. In the command below you would need to substitute the name of your key file and also the Public DNS of your instance as the host name (the latter is available from the EC2 Dashboard).

$ ssh -i AWS-key.pem ubuntu@ec2-57-29-93-35.us-west-2.compute.amazonaws.com

More detailed information on SSH access can be found here.

Installing R

Once you have the SSH connection up and running, execute the following on your remote instance:

sudo apt-get update
sudo apt-get install r-base
sudo apt-get install r-base-dev

Installing Shiny

To install the Shiny package, execute the following on your remote instance:

sudo su - -c "R -e \"install.packages('shiny', repos = 'http://cran.rstudio.com/')\""

During the installation a directory /srv/shiny-server/ will have been created, where your applications will be stored.

Installing and Testing your Applications

Transfer your applications across to the remote instance using sftp or scp. Then move them to a location under /srv/shiny-server/. You should now be ready to roll. You access the Shiny server on port 3838. So assuming for example, your application resides in a sub-folder called medal-predictions, then you would browse to http://ec2-52-24-93-52.us-west-2.compute.amazonaws.com:3838/medal-predictions/.

Comrades Marathon Medal Predictions

IMG_5239

With only a few days to go until race day, most Comrades Marathon athletes will focusing on resting, getting enough sleep, hydrating, eating and giving a wide berth to anybody who looks even remotely ill.

They will probably also be thinking a lot about Sunday's race. What will the weather be like? Will it be cold at the start? (Unlikely since it's been so warm in Durban.) How will they feel on the day? Will they manage to find their seconds along the route?

For the more performance oriented among them (and, let's face it, that's most runners!), there will also be thoughts of what time they will do on the day and what medal they'll walk away with. I've considered ways for projecting finish times in a previous article. Today I'm going to focus on a somewhat simpler goal: making a Comrades Marathon medal prediction.

In the process I have put together a small application which will make medal predictions based on recent race times.

medal-prediction-interface

I'm not going to delve too deeply into the details, but if you really don't have the patience, feel free to skip forward to the results or click on the image above, which will take you to the application. If you have trouble accessing the application it's probable that you are sitting behind a firewall that is blocking it. Try again from home.

Raw Data

The data for this analysis were compiled from a variety of sources. I scraped the medal results off the Comrades Marathon Results Archive. Times for other distances were cobbled together from Two Oceans Marathon Results, RaceTec Results and the home pages of some well organised running clubs.

The distribution of the data is broken down below as a function of gender, Comrades Marathon medal and other distances for which I have data. For instance, I have data for 45 female runners who got a Bronze medal and for whom a 32 km race time was available.

medal-race-distance-count

Unfortunately the data are pretty sparse for Gold, Wally Hayward and Silver medalists, especially for females. I'll be collecting more data over the coming months and the coverage in these areas should improve. Athletes that are contenders for these medals should have a pretty good idea of what their likely prospects are anyway, so the model is not likely to be awfully interesting for them. This model is intended more for runners who are aiming at a Bill Rowan, Bronze or Vic Clapham medal.

Decision Tree

The first step in the modelling process was to build a decision tree. Primarily this was to check whether it was feasible to predict a medal class based on race times for other distances (I'm happy to say that it was!). The secondary motivation was to assess what the most important variables were. The resulting tree is plotted below. Open this plot in a new window so that you can zoom in on the details. As far as the labels on the tree are concerned, "min" stands for "minimum" time over the corresponding distance and times (labels on the branches) are given in decimal hours.

medal-ctree

The first thing to observe is that the most important predictor is 56 km race time. This dominates the first few levels in the tree hierachy. Of slightly lesser importance is 42.2 km race time, followed by 25 km race time. It's interesting to note that 32 km and 10 km results does no feature at all in the tree, probably due to the relative scarcity of results over these distances in the data.

Some specific observations from the tree are:

  • Male runners who can do 56 km in less than 03:30 have around 20% chance of getting a Gold medal.
  • Female runners who can do 56 km in less than 04:06 have about 80% chance of getting a Gold medal.
  • Runners who can do 42.2 km in less than about 02:50 are very likely to get a Silver medal.
  • Somewhat more specifically, runners who do 56 km in less than 05:53 and 42.2 km in more than 04:49 are probably in line for a Vic Clapham.

Note that the first three observations above should be taken with a pinch of salt since, due to a lack of data, the model is not well trained for Gold, Wally Hayward and Silver medals.

You'd readily be forgiven for thinking that this decision tree is an awfully complex piece of apparatus for calculating something as simple as the colour of your medal.

Well, yes, it is. And I am going to make it simpler for you. But before I make it simpler, I am going to make it slightly more complicated.

A Forest of Decision Trees

Instead of just using a single decision tree, I built a Random Forest consisting of numerous trees, each of which was trained on a subset of the data. Unfortunately the resulting model is not as easy to visualise as a single decision tree, but the results are far more robust.

Medal Prediction Application

To make this a little more accessible I bundled the model up in a Shiny application which I deployed here on Amazon Web Services. Give it a try. You'll need to enter the times you achieved over one or more race distances during the last few months. Note that these are race times, not training run times. The latter are not good predictors for your Comrades medal.

Let's have a quick look at some sample predictions. Suppose that you are a male athlete who has recent times of 00:45, 01:45, 04:00 and 05:00 for 10, 21.1, 42.2 and 56 km races respectively, then according to the model you have a 77% probability of getting a Bronze medal and around 11% chance of getting either a Bill Rowan or Vic Clapham medal. There's a small chance (less than 1%) that you might be in the running for a Silver medal.

medal-predictions-bog-standard

What about a male runner who recently ran 03:20 for 56 km? There is around 20% chance that he would get a Gold medal. Failing that he would most likely (60% chance) get a Silver.

medal-predictions-gold-male

If you happen to have race results for the last few years that I could incorporate into the model, please get in touch. I'm keen to collaborate on improving this tool.

R Recipe: Aligning Axes in ggplot2

Faceted plots in ggplot2 are phenomenal. They give you a simple way to break down an array of plots according to the values of one or more categorical variables. But what if you want to stack plots of different variables? Not quite so simple. But certainly possible. I gathered together this solution from a variety of sources on stackoverflow, notably this one and this other one. A similar issue for vertical alignment is addressed here.

To illustrate, first let's generate some sample data.

> data <- data.frame(x = seq(0, 100, 1))
> #
> data = transform(data,
+                  y1 = sin(x * pi / 10),
+                  y2 = x**2
+                  )

Then we'll import a couple of libraries: ggplot2, of course, and gridExtra which will give us the stacking functionality.

> library(ggplot2)
> library(gridExtra)

We'll generate the two plots.

> p1 <- ggplot(data, aes(x = x)) + geom_line(aes(y = y1)) + theme_classic()
> p2 <- ggplot(data, aes(x = x)) + geom_bar(aes(y = y2), stat = "identity") + theme_classic()

We could just slap these plots onto a grid, which would produce an acceptable plot. But my inner perfectionist is not happy with the fact that the two vertical axes do not line up and, consequently, the horizontal scales are slightly different.

aligned-plot-naive

The first step towards fixing this small issue is to take the plots and convert them into gtables.

> p1 <- ggplot_gtable(ggplot_build(p1))
> p2 <- ggplot_gtable(ggplot_build(p2))

Next the pivotal bit: find the widths of each of the plots, calculate the maximum and then apply it to each of them individually. This effectively applies a uniform layout to each of the plots.

> maxWidth = unit.pmax(p1$widths[2:3], p2$widths[2:3])
> 
> p1$widths[2:3] <- maxWidth
> p2$widths[2:3] <- maxWidth

And the final result is a stacked plot with perfectly aligned vertical axes. Success!

> grid.arrange(p1, p2, heights = c(3, 2))

aligned-plot

R Recipe: Reordering Columns in a Flexible Way

Suppose you have a data frame with a number of columns.

> names(trading)
 [1] "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"         "TP"         "OpenPrice" 
 [9] "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"       "Duration"   "Trader"    
[17] "System"

You want to put the Trader and System columns first but you also want to do this in a flexible way. One approach would be to specify column numbers.

> trading = trading[, c(16:17, 1:15)]
> names(trading)
 [1] "Trader"     "System"     "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"        
 [9] "TP"         "OpenPrice"  "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"      
[17] "Duration"

This does the job but it's not very flexible. After all, the number of columns might change. Rather do it by specifying column names.

> refcols <- c("Trader", "System")
> #
> trading <- trading[, c(refcols, setdiff(names(trading), refcols))]
> names(trading)
 [1] "Trader"     "System"     "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"        
 [9] "TP"         "OpenPrice"  "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"      
[17] "Duration"

Recent Common Ancestors: Simple Model

An interesting paper (Modelling the recent common ancestry of all living humans, Nature, 431, 562–566, 2004) by Rohde, Olson and Chang concludes with the words:

Further work is needed to determine the effect of this common ancestry on patterns of genetic variation in structured populations. But to the extent that ancestry is considered in genealogical rather than genetic terms, our findings suggest a remarkable proposition: no matter the languages we speak or the colour of our skin, we share ancestors who planted rice on the banks of the Yangtze, who first domesticated horses on the steppes of the Ukraine, who hunted giant sloths in the forests of North and South America, and who laboured to build the Great Pyramid of Khufu.

This paper considered two models of our genealogical heritage. Neither of the models was exhaustive and they could not take into account all of the complexities involved in determining the genealogy of the Earth's population. However, they captured many of the essential features. And despite their relative simplicity, the models reveal an unexpected result: a genealogical Common Ancestor (CA) of the entire present day human population would have lived in the relatively recent past.

It would be interesting to replicate some of those results.

An Even Simpler Model

In an earlier paper, Recent common ancestors of all present-day individuals, Chang considered a simpler model which he described as follows:

We assume the population size is constant at n. Generations are discrete and non-overlapping. The genealogy is formed by this random process: in each generation, each individual chooses two parents at random from the previous generation. The choices are made just as in the standard Wright–Fisher model (randomly and equally likely over the n possibilities) the only difference being that here each individual chooses twice instead of once. All choices are made independently. Thus, for example, it is possible that when an individual chooses his two parents, he chooses the same individual twice, so that in fact he ends up with just one parent; this happens with probability 1/n.

At first sight this appears to be an extremely crude and abstract model. However it captures many of the essential features required for modelling genealogical inheritance. It also has the advantage of not being too difficult to implement in R. If these details are not your cup of tea, feel free to skip forward to the results.

First we'll need a couple of helper functions to neatly generate labels for generations (G) and individuals (I).

> label.individuals <- function(N) {
+   paste("I", str_pad(1:N, 2, pad = "0"), sep = "")
+ }
> 
> make.generation <- function(N, g) {
+   paste(paste0("G", str_pad(g, 2, pad = "0")), label.individuals(N), sep = "/")
+ }
> 
> make.generation(3, 5)
[1] "G05/I01" "G05/I02" "G05/I03"

Each indivual is identified by a label of the form "G05/I01", which gives the generation number and also the specific identifier within that generation.

Next we'll have a function to generate the random links between individuals in successive generations. The function will take two arguments: the number of individuals per generation and the number of ancestor generations (those prior to the "current" generation).

> # N - number of people per generation
> # G - number of ancestor generations (there will be G+1 generations!)
> #
> generate.edges <- function(N, G) {
+   edges = lapply(0:(G-1),  function(g) {
+     children <- rep(make.generation(N, g), each = 2)
+     #
+     parents <- sample(make.generation(N, g+1), 2 * N, replace = TRUE)
+     #
+     data.frame(parents = parents, children = children, stringsAsFactors = FALSE)
+   })
+   do.call(rbind, edges)
+ }
> 
> head(generate.edges(3, 3), 10)
   parents children
1  G01/I02  G00/I01
2  G01/I01  G00/I01
3  G01/I01  G00/I02
4  G01/I02  G00/I02
5  G01/I01  G00/I03
6  G01/I03  G00/I03
7  G02/I02  G01/I01
8  G02/I01  G01/I01
9  G02/I01  G01/I02
10 G02/I02  G01/I02

So, for example, the data generated above links the child node G00/I01 (individual 1 in generation 0) to two parent nodes, G01/I02 and G01/I01 (individuals 1 and 2 in generation 1).

Finally we'll wrap all of that up in a graph-like object.

> library(igraph)
> generate.nodes <- function(N, G) {
+   lapply(0:G, function(g) make.generation(N, g))
+ }
> 
> generate.graph <- function(N, G) {
+   nodes = generate.nodes(N, G)
+   #
+   net <- graph.data.frame(generate.edges(N, G), directed = TRUE, vertices = unlist(nodes))
+   #
+   # Edge layout
+   #
+   x = rep(1:N, G+1)
+   y = rep(0:G, each = N)
+   #
+   net$layout = cbind(x, y)
+   #
+   net$individuals = label.individuals(N)
+   net$generations = nodes
+   #
+   net
+ }

Let's give it a whirl on a graph consisting of 10 ancestor generations (11 generations including the current generation) and 8 individuals per generation. The result is plotted below with the oldest generation (G10) at the top and the current generation (G00) at the bottom. The edges indicate parent/progeny links.

sample-graph

Okay, so it looks like all of the infrastructure is in place. Now we need to put together the functionality for analysing the relationships between generations. We are really only interested in how any one of the ancestor generations relates to the current generation, which makes things a little simpler. First we write a function to calculate the number of steps from an arbitrary node (identified by label) to each of the nodes in the current generation. This is simplified by the fact that igraph already has a shortest.path() function.

> generation.descendants <- function(net, node) {
+   present.generation <- first(net$generations)
+   #
+   shortest.paths(net, v = node, to = present.generation, mode = "out")
+ }

Next a pair of helper functions which indicate whether a particular node is a CA of all nodes in the current generation or if it has gone extinct. Node G03/I08 in the above graph is an example of an extinct node since it has no child nodes.

> common.ancestor <- function(net, node) {
+   all(is.finite(generation.descendants(net, node)))
+ }
>   
> extinct <- function(net, node) {
+   all(is.infinite(generation.descendants(net, node)))
+ }

Let's test those. We'll generate another small graph for the test.

> set.seed(1)
> #
> net <- generate.graph(3, 5)
> #
> # This is an ancestor of all.
> #
> generation.descendants(net, "G05/I01")
        G00/I01 G00/I02 G00/I03
G05/I01       5       5       5
> common.ancestor(net, "G05/I01")
[1] TRUE
> #
> # This is an ancestor of all but G00/I02
> #
> generation.descendants(net, "G02/I03")
        G00/I01 G00/I02 G00/I03
G02/I03       2     Inf       2
> common.ancestor(net, "G02/I03")
[1] FALSE
> #
> # This node is extinct.
> #
> generation.descendants(net, "G03/I01")
        G00/I01 G00/I02 G00/I03
G03/I01     Inf     Inf     Inf
> extinct(net, "G03/I01")
[1] TRUE

It would also be handy to have a function which gives the distance from every node in a particular generation to each of the nodes in the current generation.

> generation.distance <- function(net, g) {
+   current.generation <- net$generations[[g+1]]
+   #
+   do.call(rbind, lapply(current.generation, function(n) {generation.descendants(net, n)}))
+ }
> generation.distance(net, 3)
        G00/I01 G00/I02 G00/I03
G03/I01     Inf     Inf     Inf
G03/I02       3       3       3
G03/I03       3       3       3

So here we see that G03/I02 and G03/I03 are both ancestors of each of the individuals in the current generation, while G03/I01 has gone extinct.

Finally we can pull all of this together with a single function that classifies individuals in previous generations according to whether they are an ancestor of at least one but not all of the current generation; a common ancestor of all of the current generation; or extinct.

> classify.generation <- function(net, g) {
+   factor(apply(generation.distance(net, g), 1, function(p) {
+     ifelse(all(is.infinite(p)), 1, ifelse(all(is.finite(p)), 2, 0))
+   }), levels = 0:2, labels = c("ancestor", "extinct", "common"))
+ }
> classify.generation(net, 3)
G03/I01 G03/I02 G03/I03 
extinct  common  common 
Levels: ancestor extinct common

Results

First let's have a look at how the relationship between previous generations and the current generation changes over time. The plot below indicates the proportion of ancestors, common ancestors and extinct individuals as a function of generation number. Note that time runs backwards along the horizontal axis, with the current generation on the left and successive ancestor generations moving towards the right.

The graph used to generate these data had 1000 individuals per generation. We can see that for the first 10 generations around 80% of the individuals were ancestors of one or more (but not all!) of the current generation. The remaining 20% or so were extinct. Then in generation 11 we have the first CA. This individual would be labelled the Most Recent Common Ancestor (MRCA). Going further back in time the fraction of CAs increases while the proportion of mere ancestors declines until around generation 19 when all of the ancestors are CAs.

ancestor-evolution

There are two important epochs to identify on the plot above:

  1. the generation at which the MRCA emerged (this is denoted T_n); and
  2. the generation at which all individuals are either CAs or extinct (this is denoted U_n).

Chang's paper gives asymptotic expressions for how these epochs vary as a function of generation size. It also presents a table with sample values of T_n and U_n for generations of size 500, 1000, 2000 and 4000. Those samples support the asymptotic relationships.

Since we have all of the infrastructure in place, we can conduct an independent test of the relationships. Below is a boxplot generated from 100 realisations each of calculations with generations of size 10, 100 and 1000. The blue boxes reflect our estimates for T_n, while the green boxes are the estimates of U_n. The two dashed lines reflect the asymptotic relationships. There's pretty good agreement between the theoretical and experimental results, although our estimates for U_n appear to be consistently larger than expected. The sample size was limited though... Furthermore, these are asymptotic relationships, so we really only expect them to hold exactly for large generation sizes.

Un-Tn-versus-N

When would our MRCA have been alive based on this simple model and the Earth's current population? Setting the generation size to the Earth's current population of around 7 billion people gives T_n at about 33 generations and a U_n of approximately 58 generations. I know, those 7 billion people are not from the same generation, but we are going for an estimate here...

What does a generation translate to in terms of years? That figure has varied significantly over time. It also depends on location. However, in developed nations it is currently around 30 years. WolframAlpha equates a generation to 28 years. So, based on these estimates we have T_n at around 1000 years and our estimate of U_n translates into less than 2 millennia.

Of course, these numbers need to be taken with a pinch of salt since they are based on a model which assumes that the population has been static at around 7 billion, which we know to be far from true! However these results are also not completely incompatible with the more sophisticated results of Rohde, Olson and Chang who found that the MRCA of all present-day humans lived just a few thousand years ago.