Plotting Flows with riverplot

I have been looking for an intuitive way to plot flows or connections between states in a process. An obvious choice is a Sankey Plot, but I could not find a satisfactory implementation in R… until I read the riverplot post by January Weiner. His riverplot package does precisely what I am need.

Getting your data into the right format is a slightly clunky procedure. However, my impression is that the package is still a work in progress and it’s likely that this process will change in the future. For now though, here is an illustration of how a multi-level plot can be constructed.

Assembling the Data

The plan for this example is to have four nodes at each of six layers, with flows between layers. The data are a little contrived, but they illustrate the procedure quite nicely and they produce a result which is not dissimilar to the final plot I was after. We have to create data structures for both nodes and edges. I will start with the edges and then use these data to extract the nodes.

The edges data frame consists of records with a “from” node (N1) and a “to” node (N2) as well as a value for the flow between them. Here I systematically construct a grid of random flows and remove some records to break the symmetry.

> edges = data.frame(N1 = paste0(rep(LETTERS[1:4], each = 4), rep(1:5, each = 16)),
+                    N2 = paste0(rep(LETTERS[1:4], 4), rep(2:6, each = 16)),
+                    Value = runif(80, min = 2, max = 5) * rep(c(1, 0.8, 0.6, 0.4, 0.3), each = 16),
+                    stringsAsFactors = F)
> 
> edges = edges[sample(c(TRUE, FALSE), nrow(edges), replace = TRUE, prob = c(0.8, 0.2)),]
> head(edges)
   N1 N2  Value
1  A1 A2 2.3514
4  A1 D2 2.2052
5  B1 A2 3.0959
7  B1 C2 2.8756
9  C1 A2 4.5099
10 C1 B2 4.1782

The names of the nodes are then extracted from the edge data frame. Horizontal and vertical locations for the nodes are calculated based on the labels. These locations are not strictly necessary because the package will work out sensible default values for you.

> nodes = data.frame(ID = unique(c(edges$N1, edges$N2)), stringsAsFactors = FALSE)
> #
> nodes$x = as.integer(substr(nodes$ID, 2, 2))
> nodes$y = as.integer(sapply(substr(nodes$ID, 1, 1), charToRaw)) - 65
> #
> rownames(nodes) = nodes$ID
> head(nodes)
   ID x y
A1 A1 1 0
B1 B1 1 1
C1 C1 1 2
D1 D1 1 3
A2 A2 2 0
B2 B2 2 1

Finally we construct a list of styles which will be applied to each node. It’s important to choose suitable colours and introduce transparency for overlaps (which is done here by pasting “60″ onto the RGB strings).

> library(RColorBrewer)
> #
> palette = paste0(brewer.pal(4, "Set1"), "60")
> #
> styles = lapply(nodes$y, function(n) {
+   list(col = palette[n+1], lty = 0, textcol = "black")
+ })
> names(styles) = nodes$ID

Constructing the riverplot Object

Now we are in a position to construct the riverplot object. We do this by joining the node, edge and style data structures into a list and then adding “riverplot” to the list of class attributes.

> library(riverplot)
> 
> rp <- list(nodes = nodes, edges = edges, styles = styles)
> #
> class(rp) <- c(class(rp), "riverplot")

Producing the plot is then simple.

> plot(rp, plot_area = 0.95, yscale=0.06)

riverplot-example

Conclusion

I can think of a whole host of applications for figures like this, so I am very excited about the prospects. I know that I am going to have to figure out how to add additional labels to the figures, but I’m pretty sure that will not be too much of an obstacle.

The current version of riverplot is v0.3. Incidentally, when I stumbled on a small bug in v0.2 of riverplot, January was very quick to respond with a fix.

Commitments of Traders: Moves in the Last Week

In my previous post I gave some background information on the Commitments of Traders report along with a selection of summary plots.

One of the more interesting pieces of information that one can glean from these reports is the shift in trading sentiment from week to week. Below is a plot reflecting the relative change in the number of long and short positions held by traders in each of the sectors (Commercial, Non-Commercial and Non-Reportable).

The changes are normalised to the total number of positions (both long and short) held in the previous week. To illustrate how this works, consider the JPY.

> tail(subset(OP, name == "JPY" & sector == "Commercial"), 2)
      name       date     sector   long   shrt
11842  JPY 2014-05-13 Commercial 125523 -35537
11845  JPY 2014-05-20 Commercial 117310 -48851

This indicates that the number of positions that are long relative to the JPY has decreased while the number of positions that are short on the JPY (given by a negative number) has increased. Both of these changes are consistent with the fact that traders are selling the JPY in favour of other currencies.

A synopsis of these data for a range of currencies is given in the plot below. This is how the plot works. Again we will consider Commercial trades involving the JPY. We are thus looking at the second to last row and first column. Here there are two cells: long trades on the left and short trades on the right. The coloured bars indicate the relative change for long (blue) and short (orange) trades. The relative changes are normalised to the total number of trades for the currency and sector in the previous week. We can see here that the orange bar is broader than the blue bar indicating that the change in short trades is larger than the change in long trades. The grey boxes show the 95% confidence interval for the expected range of these changes. The closer the bars come to the edge of the boxes, the more significant the change. So, in this case, the change in the number of short trades is significant and is probably an indication of a change in sentiment regarding the JPY.

If anyone is interested in updated charts like this, please just let me know.

140520-weekly-change

Whistlers and Volcanic Lightning

A paper authored with my very talented student, Claire Antel, entitled “Investigating Dunedin whistlers using volcanic lightning” has just been published in Geophysical Research Letters. The paper looks at electromagnetic signals (“whistlers”) received at Dunedin, New Zealand, caused by lightning over volcanoes thousands of km away near the Aleutian Chain.

Abstract

Whistlers detected at Dunedin, New Zealand are an anomaly: there is little lightning around Dunedin’s conjugate point yet whistlers appear in relatively large numbers. These surplus whistlers have consequently inspired investigations into their origins. Dunedin’s lightning-sparse conjugate point lies in the Aleutian Islands, a region populated with active volcanoes. Their presence has allowed us to perform a novel analysis: the correlation of whistlers to volcanic lightning. We report on our investigation, which successfully yielded the first observations of “volcanic whistlers.” It was found that the single July 2008 Mount Okmok eruption had an impressive effect on the number of whistlers at Dunedin. The eruptions at Mount Redoubt in 2009 also caused a sporadic flow of whistlers in Dunedin.

In Other Media

Claire has also given an interview about the paper.

Comrades Marathon: A Race for Geriatrics?

It has been suggested that the average Comrades Marathon runner is gradually getting older. As an “average runner” myself, I will not deny that I am personally getting older. But, what I really mean is that the average age of all runners taking part in this great event is gradually increasing. This is not just an idle hypothesis: it is supported by the data. If you’re interested in the technical details of the analysis, these are included at the end, otherwise read on for the results.

Results

The histograms below show graphically how the distribution of runners’ ages at the Comrades Marathon has changed every decade starting in the 1980s and proceeding through to the 2010s. The data are encoded using blue for male and pink for female runners (apologies for the banality!). It is readily apparent how the distributions have shifted consistently towards older ages with the passing of the decades. The vertical lines in each panel indicate the average age for male (dashed line) and female (solid line) runners. Whereas in the 1980s the average age for both genders was around 34, in the 2010s it has shifted to over 40 for females and almost 42 for males.

age-decade-histogram

Maybe clumping the data together into decades is hiding some of the details. The plot below shows the average age for each gender as a function of the race year. The plotted points are the observed average age, the solid line is a linear model fitted to these data and the dashed lines delineate a 95% confidence interval.

Prior to 1990 the average age for both genders was around 35 and varies somewhat erratically from year to year. Interestingly there is a pronounced decrease in the average age for both genders around 1990. Evidently something attracted more young runners that year… Since 1990 though there has been a consistent increase in average age. In 2013 the average age for men was fractionally less than 42, while for women it was over 40.

mean-age-year-regression

Conclusion

Of course, the title of this article is hyperbolic. The Comrades Marathon is a long way from being a race for geriatrics. However, there is very clear evidence that the average age of runners is getting higher every year. A linear model, which is a reasonably good fit to the data, indicates that the average age increases by 0.26 years annually and is generally 0.6 years higher for men than women. If this trend continues then, by the time of the 100th edition of the race, the average age will be almost 45.

Is the aging Comrades Marathon field a problem and, if so, what can be done about it?

Analysis

As before I have used the Comrades Marathon results from 1980 through to 2013. Since my last post on this topic I have refactored these data, which now look like this:

> head(results)
       key year age gender category   status  medal direction medal_count decade
1  6a18da7 1980  39   Male   Senior Finished Bronze         D          20   1980
2   6570be 1980  39   Male   Senior Finished Bronze         D          16   1980
3 4371bd17 1980  29   Male   Senior Finished Bronze         D           9   1980
4 58792c25 1980  24   Male   Senior Finished Silver         D          25   1980
5 16fe5d63 1980  58   Male   Master Finished Bronze         D           9   1980
6 541c273e 1980  43   Male  Veteran Finished Silver         D          18   1980

The first step in the analysis was to compile decadal and annual summary statistics using plyr.

> decade.statistics = ddply(results, .(decade, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
> #
> year.statistics = ddply(results, .(year, gender), summarize,
+                           median.age = median(age, na.rm = TRUE),
+                           mean.age = mean(age, na.rm = TRUE))
> head(decade.statistics)
  decade gender median.age mean.age
1   1980 Female         34   34.352
2   1980   Male         34   34.937
3   1990 Female         36   36.188
4   1990   Male         36   36.440
5   2000 Female         39   39.364
6   2000   Male         39   39.799
> head(year.statistics)
  year gender median.age mean.age
1 1980 Female       35.0   35.061
2 1980   Male       33.0   34.091
3 1981 Female       33.5   34.096
4 1981   Male       34.0   34.528
5 1982 Female       34.5   35.032
6 1982   Male       34.0   34.729

The decadal data were used to generate the histograms. I then considered a selection of linear models applied to the annual data.

> fit.1 <- lm(mean.age ~ year, data = year.statistics)
> fit.2 <- lm(mean.age ~ year + year:gender, data = year.statistics)
> fit.3 <- lm(mean.age ~ year + gender, data = year.statistics)
> fit.4 <- lm(mean.age ~ year + year * gender, data = year.statistics)

The first model applies a simple linear relationship between average age and year. There is no discrimination between genders. The model summary (below) indicates that the average age increases by about 0.26 years annually. Both the intercept and slope coefficients are highly significant.

> summary(fit.1)

Call:
lm(formula = mean.age ~ year, data = year.statistics)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3181 -0.5322 -0.0118  0.4971  1.9897 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.80e+02   1.83e+01   -26.2   <2e-16 ***
year         2.59e-01   9.15e-03    28.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.74 on 66 degrees of freedom
Multiple R-squared:  0.924,	Adjusted R-squared:  0.923 
F-statistic:  801 on 1 and 66 DF,  p-value: <2e-16

The second model considers the effect on the slope of an interaction between year and gender. Here we see that the slope is slightly large for males than females. Although this interaction coefficient is statistically significant, it is extremely small relative to the slope coefficient itself. However, given that the value of the abscissa is around 2000, it still contributes roughly 0.6 extra years to the average age for men.

> summary(fit.2)

Call:
lm(formula = mean.age ~ year + year:gender, data = year.statistics)

Residuals:
   Min     1Q Median     3Q    Max 
-1.103 -0.522  0.024  0.388  2.287 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -4.80e+02   1.68e+01  -28.57  < 2e-16 ***
year             2.59e-01   8.41e-03   30.78  < 2e-16 ***
year:genderMale  3.00e-04   8.26e-05    3.63  0.00056 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935 
F-statistic:  481 on 2 and 65 DF,  p-value: <2e-16

The third model considers an offset on the intercept based on gender. Here, again, we see that the effect of gender is small, with the fit for males being shifted slightly upwards. Again, although this effect is statistically significant, it has only a small effect on the model. Note that the value of this coefficient (5.98e-01 years) is consistent with the effect of the interaction term (0.6 years for typical values of the abscissa) in the second model above.

> summary(fit.3)

Call:
lm(formula = mean.age ~ year + gender, data = year.statistics)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1038 -0.5225  0.0259  0.3866  2.2885 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.80e+02   1.68e+01  -28.58  < 2e-16 ***
year         2.59e-01   8.41e-03   30.79  < 2e-16 ***
genderMale   5.98e-01   1.65e-01    3.62  0.00057 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.68 on 65 degrees of freedom
Multiple R-squared:  0.937,	Adjusted R-squared:  0.935 
F-statistic:  480 on 2 and 65 DF,  p-value: <2e-16

The fourth and final model considers both an interaction between year and gender as well as an offset of the intercept based on gender. Here we see that the data does not differ sufficiently on the basis of gender to support both of these effects, and neither of the resulting coefficients is statistically significant.

> summary(fit.4)

Call:
lm(formula = mean.age ~ year + year * gender, data = year.statistics)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0730 -0.5127 -0.0492  0.4225  2.1273 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -460.3631    23.6813  -19.44   <2e-16 ***
year               0.2491     0.0119   21.00   <2e-16 ***
genderMale       -38.4188    33.4904   -1.15     0.26    
year:genderMale    0.0195     0.0168    1.17     0.25    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.679 on 64 degrees of freedom
Multiple R-squared:  0.938,	Adjusted R-squared:  0.935 
F-statistic:  322 on 3 and 64 DF,  p-value: <2e-16

On the basis of the above discussion, the fourth model can be immediately abandoned. But how do we choose between the three remaining models? An ANOVA indicates that the second model is a significant improvement over the first model. There is little to choose, however, between the second and third models. I find the second model more intuitive, since I would expect there to be a slight gender difference in the rate of aging, rather than a simple offset. We will thus adopt the second model, which indicates that the average age of runners increases by about 0.259 years annually, with the men aging slightly faster than the women.

> anova(fit.1, fit.2, fit.3, fit.4)
Analysis of Variance Table

Model 1: mean.age ~ year
Model 2: mean.age ~ year + year:gender
Model 3: mean.age ~ year + gender
Model 4: mean.age ~ year + year * gender
  Res.Df  RSS Df Sum of Sq     F  Pr(>F)    
1     66 36.2                               
2     65 30.1  1      6.09 13.23 0.00055 ***
3     65 30.1  0     -0.02                  
4     64 29.5  1      0.62  1.36 0.24833    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Lastly, I constructed a data frame based on the second model which gives both the model prediction and a 95% uncertainty interval. This was used to generate the second set of plots.

fit.data <- data.frame(year = rep(1980:2020, each = 2), gender = c("Female", "Male"))
fit.data <- cbind(fit.data, predict(fit.2, fit.data, level = 0.95, interval = "prediction"))

Where to Put EAs and Indicators in New MT4 Builds

If you are creating an EA or indicator from scratch, then the MetaTrader editor places the files in the correct location and the terminal is automatically able to find them. However, if the files originate from a third party then you will need to know where to insert them so that they show up in the terminal. For older builds of MetaTrader 4 the directory structure was fairly simple. Everything was to be found in a folder under Program Files. Its contents looked like this:

MT4-old-folders-windows

EAs would go in the experts folder, while indicators would go one level further down in the indicators sub-folder.

But everything is different with the new builds of MetaTrader 4.

Windows

On Windows (and here I am referring specifically to Windows 7, although the setup will be similar on other flavours) the structure of the MetaTrader folder (still found under Program Files) now looks like this:

MT4-new-folders-windows

There seems to be something missing, right? Where do the EA and indicator files go? To find these you will need to look elsewhere. There is now a separate directory hierarchy for each user, in which these files are stored. Browsing under the AppData folder for your user you should find

MT4-personal-folder-windows

Locating this folder can be a little tricky. First look under Users to find the folder associated with your user name. Within this folder you may or may not be able to see an AppData folder. If you can’t see it then you will need to reveal hidden files by changing the folder options.

folder-options-hidden-files

Once you have dug down to your Terminal folder it should contain a set of sub-folders like this:

MT4-personal-folder-contents-windows

If you descend further into one of those obscurely named folders, then you will find further sub-folders.

MT4-personal-instance-contents-windows

Finally, within the MT4 folder, you will find this:

MT4-mt4-folder-windows

The Experts and Indicators folders are where you will insert those EAs and indicators.

Linux

If you are running MetaTrader under Wine on Linux, then the new directory structure looks like this:

MT4-new-folders-linux

Evidently the structure is a little simpler here, because all of the files and directories are still lumped in a single location. The Experts and Indicators directories are to be found under MQL4.

Comrades Marathon Negative Splits: Cheat Strikes Again

It looks likes one of the suspect runners from my previous posts cheated again in this year’s Comrades Marathon. Brad Brown has published evidence that the runner in question, Kitty Chutergon (race number 25058), had another athlete running with his number for most of the race. This is a change in strategy from last year, where he appears to have been assisted along the route, having missed the timing mats at Camperdown and Polly Shortts.

Twins, Tripods and Phantoms at the Comrades Marathon

Having picked up a viral infection days before this year’s Comrades Marathon, on 1 June I was left with time on my hands and somewhat desperate for any distraction. So I spent some time looking at my archive of Comrades data and considering some new questions. For example, what are the chances of two runners passing through halfway and the finish line at exactly the same time? How likely is it that three runners achieve the same feat?

My data for the 2013 up run gives times with one second precision, so these questions could be answered if I relaxed the constraints from “exactly the same time” to “within one second of each other”. We’ll call such simultaneous pairs of runners “twins” and simultaneous threesomes will be known as “tripods”. How many twins are there? How many tripods? The answers are somewhat surprising. What’s even more surprising is another category: “phantoms”.

If you are not interested in the details of the analysis (and I’m guessing that you probably aren’t), please skip forward to the pictures and analysis.

Looking at the Data

The first step is to subset the data, leaving a data frame containing only the times at halfway and the finish, indexed by a unique runner key.

> simultaneous = subset(splits,
+                      year == 2013 & !is.na(medal))[, c("key", "drummond.time", "race.time")]
> simultaneous = simultaneous[complete.cases(simultaneous),]
> #
> rownames(simultaneous) = simultaneous$key
> simultaneous$key <- NULL
> head(simultaneous)
         drummond.time race.time
4bdcb291        320.15    712.42
4e488aab        294.65    656.90
ab59fc97        304.62    643.67
89d3e09b        270.32    646.78
fc728816        211.27    492.95
7b761740        274.60    584.37

Next we calculate the “distance” (this is a distance in time and not in space) between runners, which is effectively the squared difference between the halfway and finish times for each pair of runners. This yields a rather large matrix with rows and columns labelled by runner key. These data are then transformed into a format where each row represents a pair of runners.

> simultaneous = dist(simultaneous)
> library(reshape2)
> simultaneous = melt(as.matrix(simultaneous))
> head(simultaneous)
      Var1     Var2   value
1 4bdcb291 4bdcb291   0.000
2 4e488aab 4bdcb291  61.093
3 ab59fc97 4bdcb291  70.483
4 89d3e09b 4bdcb291  82.408
5 fc728816 4bdcb291 244.992
6 7b761740 4bdcb291 135.910

We can immediately see that there are some redundant entries. We need to remove the matrix diagonal (obviously the times match when a runner is compared to himself!) and keep only one half of the matrix.

> simultaneous = subset(simultaneous, as.character(Var1) < as.character(Var2))

Finally we retain only the records for those pairs of runners who crossed both mats simultaneously (in retrospect, this could have been done earlier!).

> simultaneous = subset(simultaneous, value == 0)
> head(simultaneous)
            Var1     Var2 value
623174  5217dfc9 75a78d04     0
971958  d8c9c403 e6e0d6e3     0
2024105 2e8f7778 9acc46ee     0
2464116 5f18d86f 9a1697ff     0
2467712 63033429 9a1697ff     0
3538608 54a92b96 f574be97     0

We can then merge in the data for race numbers and names, leaving us with an (anonymised) data set that looks like this:

> simultaneous = simultaneous[order(simultaneous$race.time),]
> head(simultaneous)[, c(4, 6, 8)]
    race.number.x race.number.y race.time
133         59235         56915  07:54:21
9           26132         23470  08:06:55
62          44008         31833  08:25:58
61          25035         36706  08:35:42
54          28868         25910  08:46:42
26          47703         31424  08:47:08
> tail(simultaneous)[, c(4, 6, 8)]
   race.number.x race.number.y race.time
71         54689         16554  11:55:59
60          8846         23003  11:56:26
44          9235         49251  11:56:47
38         53354         53352  11:56:56
28         19268         59916  11:57:49
20         22499         40754  11:58:26

Twins

As it turns out, there are a remarkably large number of Comrades twins. In the 2013 race there were more than 100 such pairs. So they are not as rare as I had assumed they would be.

Tripods

Although there were relatively many Comrades twins, there were only two tripods. In both cases, all three members of the tripod shared the same surname, so they are presumably related.

The members of the first tripod all belong to the same running club, two of them are in the 30-39 age category and the third is in the 60+ group. There’s a clear family resemblance, so I’m guessing that they are father and sons. Dad had gathered 9 medals, while the sons had 2 and 3 medals respectively. What a day they must have had together!

tripod-11832

The second tripod also consisted of three runners from the same club. Based on gender and age groups, I suspect that they are Mom, Dad and son. The parents had collected 8 medals each, while junior had 3. What a privilege to run the race with your folks! Lucky guy.

tripod-53713

And now things get more interesting…

Phantom #1

The runner with race number 26132 appears to have run all the way from Durban to Pietermaritzburg with runner 23470! Check out the splits below.

splits-26132

splits-23470

Not only did they pass through halfway and the finish at the same time, but they crossed every mat along the route at precisely the same time. Yet, somewhat mysteriously, there is no sign of 23470 in the race photographs…

phantom-26132-A

phantom-26132-B

phantom-26132-C

You might notice that there is another runner with 26132 in all three of the images above. That’s not 23470. He has race number 28151 and he is not the phantom! His splits below show that he only started running with 26132 somewhere between Camperdown and Polly Shortts.

splits-28151

If you search the race photographs for the phantom’s race number (23470), you will find that there are no pictures of him at all! That’s right, nineteen photographs of 26132 and not a single photograph of 23470.

Phantom #2

The runner with race number 53367 was also accompanied by a phantom with race number 27587. Again, as can be seen from the splits below, these two crossed every mat on the course at precisely the same time.

splits-53367

splits-27587

Yet, despite the fact that 53367 is quite evident in the race photos, there is no sign of 27587.

phantom-53367-B

phantom-53367-C

phantom-53367-D

phantom-53367-A

I would have expected to see a photograph of 53367 embracing his running mate at the finish, yet we find him pictured with two other runners. In fact, if you search the race photographs for 27587 you will find that there are no photographs of him at all. You will, however, find twelve photographs of 53367.

Conclusion

Well done to the tripods, I think you guys are awesome! As for the phantoms (and their running mates), you have some explaining to do.