A very high level talk about Random Number Generators which I gave at a recent DERIVCO Brown Bag session.

# Simulating Intricate Branching Patterns with DLA

Manfred Schroeder's book Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise is a fruitful source of interesting topics and projects. He gives a thorough description of Diffusion-Limited Aggregation (DLA) as a technique for simulating physical processes which produce intricate branching structures. Examples, as illustrated below, include Lichtenberg Figures, dielectric breakdown, electrodeposition and Hele-Shaw flow.

## Diffusion-Limited Aggregation

DLA is conceptually simple. A seed particle is fixed at the origin of the coordinate system. Another particle is introduced at a relatively large distance from the origin, which then proceeds to move on a random walk. Either it will wander off to infinity or it will come into contact with the particle at the origin, to which it sticks irreversibly. Now another particle is introduced and the process repeats itself. As successive particles are added to the system, a portion of them become bound to the growing cluster of particles at the origin.

The objects which evolve from this process are intrinsically random, yet have self-similar structure across a range of scales. There is also an element of positive feedback, where once a protuberance has formed on the cluster, further particles are more likely to adhere to it since they will probably encounter it first.

## A Simple Implementation in R

First we need to construct a grid. We will start small: a 20 by 20 grid filled with NA except for four seed locations at the centre.

> # Dimensions of grid > W <- 20 > grid <- matrix(NA, nrow = W, ncol = W) > grid[W/2 + c(0, 1), W/2 + c(0, 1)] = 0 > grid [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [2,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [3,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [4,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [5,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [6,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [7,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [8,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [9,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [10,] NA NA NA NA NA NA NA NA NA 0 0 NA NA NA NA NA NA NA NA NA [11,] NA NA NA NA NA NA NA NA NA 0 0 NA NA NA NA NA NA NA NA NA [12,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [13,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [14,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [15,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [16,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [17,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [18,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [19,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [20,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

We need to generate two dimensional random walks. To do this I created a table of possible moves, from which individual steps could be sampled at random. The table presently only caters for the cells immediately above, below, left or right of the current cell. It would be a simple matter to extend the table to allow diagonal moves as well, but these more distant moves would need to be weighted accordingly.

> moves <- data.frame(dx = c(0, 0, +1, -1), dy = c(+1, -1, 0, 0)) > # > M = nrow(moves) > moves dx dy 1 0 1 2 0 -1 3 1 0 4 -1 0

Next a function to transport a particle from its initial location until it either leaves the grid or adheres to the cluster at the origin. Again a possible refinement here would be to allow sticking to next-nearest neighbours as well

> diffuse <- function(p) { + count = 0 + # + while (TRUE) { + p = p + moves[sample(M, 1),] + # + count = count + 1 + # + # Black boundary conditions + # + if (p$x > W | p$y > W | p$x < 1 | p$y < 1) return(NA) + # + # Check if it sticks (to nearest neighbour) + # + if (p$x < W && !is.na(grid[p$x+1, p$y])) break + if (p$x > 1 && !is.na(grid[p$x-1, p$y])) break + if (p$y < W && !is.na(grid[p$x, p$y+1])) break + if (p$y > 1 && !is.na(grid[p$x, p$y-1])) break + } + # + return(c(p, count = count)) + }

Finally we are ready to apply this procedure to a batch of particles.

> library(foreach) > > # Number of particles per batch > # > PBATCH <- 5000 > # > # Select starting position > # > phi = runif(PBATCH, 0, 2 * pi) > # > x = round((1 + cos(phi)) * W / 2 + 0.5) > y = round((1 + sin(phi)) * W / 2 + 0.5) > # > particles <- data.frame(x, y) > > result = foreach(n = 1:PBATCH) %do% diffuse(particles[n,]) > > lapply(result, function(p) {if (length(p) == 3) grid[p$x, p$y] <<- p$count})

The resulting grid shows all of the locations where particles have adhered to the cluster. The number at each location is the diffusion time, which indicates the number of steps required for the particle to move from its initial location to its final resting place. The shape of the cluster is a little boring at present: essentially a circular disk centred on the origin. This is due to the size of the problem and we will need to have a much larger grid to produce more interesting structure.

> grid [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [2,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [3,] NA NA NA NA NA NA NA NA NA 2 18 NA NA NA NA NA NA NA NA NA [4,] NA NA NA NA NA NA NA NA 8 6 17 10 NA NA NA NA NA NA NA NA [5,] NA NA NA NA NA NA NA 6 11 7 54 25 15 NA NA NA NA NA NA NA [6,] NA NA NA NA NA NA 16 10 11 58 69 18 31 16 NA NA NA NA NA NA [7,] NA NA NA NA NA 20 19 10 21 24 32 50 24 65 8 NA NA NA NA NA [8,] NA NA NA NA 18 12 55 26 13 151 86 20 21 26 27 34 NA NA NA NA [9,] NA NA NA 56 21 43 19 53 43 30 26 37 66 52 30 22 10 NA NA NA [10,] NA NA 29 9 9 23 70 38 48 0 0 122 26 44 22 10 27 5 NA NA [11,] NA NA 2 9 10 36 32 38 24 0 0 54 14 21 65 14 30 29 NA NA [12,] NA NA NA 5 10 16 13 83 52 43 23 42 39 23 66 9 32 NA NA NA [13,] NA NA NA NA 21 70 28 31 NA 41 61 15 17 29 25 17 NA NA NA NA [14,] NA NA NA NA NA 8 29 19 7 47 119 37 19 9 10 NA NA NA NA NA [15,] NA NA NA NA NA NA 15 33 68 26 38 13 33 8 NA NA NA NA NA NA [16,] NA NA NA NA NA NA NA 10 12 12 15 35 11 NA NA NA NA NA NA NA [17,] NA NA NA NA NA NA NA NA 20 8 6 5 NA NA NA NA NA NA NA NA [18,] NA NA NA NA NA NA NA NA NA 18 20 NA NA NA NA NA NA NA NA NA [19,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [20,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Taking a look at the distribution of diffusion times below we can see that there is a strong peak between 10 and 15. The majority of particles diffuse in less than 40 steps, while the longest diffusion time is 151 steps. The grid above shows that, as one would expect, smaller diffusion times are found close to the surface of the cluster, while longer times are embedded closer to the core.

## Scaling it Up

To produce a more interesting cluster we need to scale the grid size up considerably. But this also means that the run time escalates enormously. So, to make this even remotely tractable, I had to parallelise the algorithm. I did this using the SNOW package and ran it on an MPI cluster. The changes to the code are trivial, involving only the creation and initialisation of the cluster and changing %do% to %dopar% in the foreach() loop.

## Conclusion

This is nowhere near being an efficient implementation of DLA. But it gets the job done and is illustrative of how the algorithm works. To do production runs one would want to code the algorithm in a low-level language like C or C++ and take advantage of the inherent parallelism of the algorithm.

# Lessons I've Learned from Physics

These are the slides from a short talk that I gave to a group of developers at DERIVCO Dev night. The title of the talk seemed a little pompous in retrospect and I considered retitling it *The Physics of Cats and Mayonnaise*. One of my colleagues came up with the title *Levers and Gravity and Shit*, which I rather like too.

The content of the talk is a little hard to follow from the slides alone, but the general gist of it is that there are some broad principles in Physics which can be interpreted (tenuously extended?) to the world of software development.

A very nice TED talk in a similar vein was given by Dan Cobley, entitled *What physics taught me about marketing*.

# Rosetta's Rendezvous with Comet 67P

On 12 November 2014 the Philae lander from the Rosetta robotic space probe successfully landed on comet 67P/Churyumovâ€“Gerasimenko. A couple of great comics to commemorate this momentous achievement.

# Zacks Data on Quandl

Data from Zacks Research have just been made available on Quandl. Registered Quandl users have free preview access to these data, which cover the following:

- Earnings Estimates: forward-looking consensus forecasts;
- Earnings Surprises: estimated future and actual historical earnings;
- Earnings Announcements: predictions for earnings announcement dates, parameters, and supplementary data;
- Sales Estimates: analogous to earnings estimates, but for sales;
- Dividend Data: dividend history and future announcement dates.

These data describe over 5000 publicly traded US and Canadian companies and are updated daily.

## Finding the Data

If you are not already a registered Quandl user, now is the time to sign up. You will find links to all of the data sets mentioned above from the Quandl vendors page. Then, for example, from the Earnings Estimates page you can search for a particular company. I selected Hewlett Packard, which links to a page giving summary data on the Earnings per Share (EPS) for the next three years. These data are presented both in tabular format as well as an interactive plot.

Browsing the data via the Quandl web site gives you a good appreciation of what is available and the general characteristics of the data. However, to do something meaningful you would probably want to download data into an offline analysis package.

## Getting the Data into R

I am going to focus on accessing the data through R using the Quandl package.

library(Quandl)

Obtaining the data is remarkably simple. First you need to authenticate yourself.

> Quandl.auth("ZxixTxUxTxzxyxwxFxyx")

You will find your authorisation token under the Account Settings on the Quandl web site.

Grabbing the data is done via the Quandl() function, to which you need to provide the appropriate data set code.

Beneath the data set code you will also find a number of links which will popup the precise code fragment required for downloading the data in a variety of formats and on a selection of platforms (notable amongst these are R, Python and Matlab although there are interfaces for a variety of other platforms too).

> # Annual estimates > # > Quandl("ZEE/HPQ_A", trim_start="2014-10-31", trim_end="2017-10-31")[,1:5] DATE EPS_MEAN_EST EPS_MEDIAN_EST EPS_HIGH_EST EPS_LOW_EST 1 2017-10-31 3.90 3.900 3.90 3.90 2 2016-10-31 4.09 4.100 4.31 3.87 3 2015-10-31 3.94 3.945 4.01 3.88 4 2014-10-31 3.73 3.730 3.74 3.70 > # Quarterly estimates > # > Quandl("ZEE/HPQ_Q", trim_start="2014-10-31", trim_end="2017-10-31")[,1:5] > HP[,1:5] DATE EPS_MEAN_EST EPS_MEDIAN_EST EPS_HIGH_EST EPS_LOW_EST 1 2015-10-31 1.10 1.10 1.14 1.04 2 2015-07-31 0.97 0.98 1.00 0.94 3 2015-04-30 0.95 0.95 1.00 0.91 4 2015-01-31 0.92 0.92 1.00 0.85 5 2014-10-31 1.05 1.05 1.07 1.03

Here we see a subset of the EPS data available for Hewlett Packard, giving the maximum and minimum as well as the mean and median projections of EPS at both annual and quarterly resolution.

Next we'll look at a comparison of historical actual and estimated earnings.

> Quandl("ZES/HPQ", trim_start="2011-11-21", trim_end="2014-08-20")[,1:6] DATE EPS_MEAN_EST EPS_ACT EPS_ACT_ADJ EPS_AMT_DIFF_SURP EPS_PCT_DIFF_SURP 1 2014-08-20 0.89 0.89 -0.37 0.00 0.00 2 2014-05-22 0.88 0.88 -0.22 0.00 0.00 3 2014-02-20 0.85 0.90 -0.16 0.05 5.88 4 2013-11-26 1.00 1.01 -0.28 0.01 1.00 5 2013-08-21 0.87 0.86 -0.15 -0.01 -1.15 6 2013-05-22 0.81 0.87 -0.32 0.06 7.41 7 2013-02-21 0.71 0.82 -0.19 0.11 15.49 8 2012-11-20 1.14 1.16 -4.65 0.02 1.75 9 2012-08-22 0.99 1.00 -5.50 0.01 1.01 10 2012-05-23 0.91 0.98 -0.18 0.07 7.69 11 2012-02-22 0.87 0.92 -0.18 0.05 5.75 12 2011-11-21 1.13 1.17 -1.05 0.04 3.54

Looking at the last column gives the EPS surprise amount (difference between the actual and estimated EPS) as a percentage. It's clear that the estimates are generally rather good.

The last thing that we are going to look at is dividend data.

> Quandl("ZDIV/HPQ", trim_start="2014-11-07", trim_end="2014-11-07")[,1:6] AS_OF DIV_ANNOUNCE_DATE DIV_EX_DATE DIV_PAY_DATE DIV_REC_DATE DIV_AMT 1 2014-11-07 20140717 20140908 20141001 20140910 0.16

Here we see that a $0.16 per share dividend was announced on 17 July 2014 and paid on 1 October 2014.

Having access to these data for a wide range of companies promises to be an enormously useful resource. Unfortunately access to the preview data is fairly limited, but if you plan on making full use of the data, then the premium access starting at $100 per month seems like a reasonable deal.

# Creating More Effective Graphs

A few years ago I ordered a copy of the 2005 edition of Creating More Effective Graphs by Naomi Robbins. Somewhat shamefully I admit that the book got buried beneath a deluge of papers and other books and never received the attention it was due. Having recently discovered the R Graph Catalog, which implements many of the plots from the book using ggplot2, I had to dig it out and give it some serious attention.

Both the book and web site are excellent resources if you are looking for informative ways to present your data.

Being a big fan of xkcd, I rather enjoyed the example plot in xkcd style (which I don't think is covered in the book...). The code provided on the web site is used as the basis for the plot below.

This plot is broadly consistent with the data from the Public Data archive on Google, but the effects of smoothing in the xkcd style plot can be clearly seen. Is this really important? Well, I suppose that depends on the objective of the plot. If it's just to inform (and look funky in the process), then the xkcd plot is perfectly fine. If you are looking for something more precise, then a more conventional plot without smoothing would be more appropriate.

I like the xkcd style plot though and here's the code for generating it, loosely derived from the code on the web site.

> library(ggplot2) > library(xkcd) > > countries <- c("Rwanda", "South Africa", "Norway", "Swaziland", "Brazil") > > hdf <- droplevels(subset(read.delim(file = "http://tiny.cc/gapminder"), country %in% countries)) > > direct_label <- data.frame(year = 2009, + lifeExp = hdf$lifeExp[hdf$year == 2007], + country = hdf$country[hdf$year == 2007]) > > set.seed(123) > > ggplot() + + geom_smooth(data = hdf, + aes(x = year, y = lifeExp, group = country, linetype = country), + se = FALSE, color = "black") + + geom_text(aes(x = year + 2.5, y = lifeExp + 3, label = country), data = direct_label, + hjust = 1, vjust = 1, family = "xkcd", size = 7) + + theme(legend.position = "none") + + ylab("Life Expectancy") + + xkcdaxis(c(1952, 2010), c(20, 83))

# Standard Bank: Striving for Mediocrity

Recently I was in my local Standard Bank branch. After finally reaching the front of the queue and being helped by a reasonably courteous young man, I was asked if I would mind filling out a survey. Sure. No problem. I had been in the bank for 30 minutes, I could probably afford another 30 seconds.

And then I was handed this abomination:

So, if I was deliriously satisfied with the service that I had received, then I would award them a 10. If I was neither impressed nor dismayed, I would give them a 9. But if I was not happy at all, then I would give them an 8.

Let me repeat that so that the horror sinks in: if I was completely dissatisfied with their service then I would give them an 8! Out of 10. That's 80%.

80% for shoddy service!

Whoever is managing this little piece of supposed market research should be ashamed. What a load of rubbish.

# Commitments of Traders: Update 19 August 2014

# Commitments of Traders: Update 12 August 2014

# Plotting Flows with riverplot

I have been looking for an intuitive way to plot flows or connections between states in a process. An obvious choice is a Sankey Plot, but I could not find a satisfactory implementation in R... until I read the riverplot post by January Weiner. His riverplot package does precisely what I am need.

Getting your data into the right format is a slightly clunky procedure. However, my impression is that the package is still a work in progress and it's likely that this process will change in the future. For now though, here is an illustration of how a multi-level plot can be constructed.

## Assembling the Data

The plan for this example is to have four nodes at each of six layers, with flows between layers. The data are a little contrived, but they illustrate the procedure quite nicely and they produce a result which is not dissimilar to the final plot I was after. We have to create data structures for both nodes and edges. I will start with the edges and then use these data to extract the nodes.

The edges data frame consists of records with a "from" node (N1) and a "to" node (N2) as well as a value for the flow between them. Here I systematically construct a grid of random flows and remove some records to break the symmetry.

> edges = data.frame(N1 = paste0(rep(LETTERS[1:4], each = 4), rep(1:5, each = 16)), + N2 = paste0(rep(LETTERS[1:4], 4), rep(2:6, each = 16)), + Value = runif(80, min = 2, max = 5) * rep(c(1, 0.8, 0.6, 0.4, 0.3), each = 16), + stringsAsFactors = F) > > edges = edges[sample(c(TRUE, FALSE), nrow(edges), replace = TRUE, prob = c(0.8, 0.2)),] > head(edges) N1 N2 Value 1 A1 A2 2.3514 4 A1 D2 2.2052 5 B1 A2 3.0959 7 B1 C2 2.8756 9 C1 A2 4.5099 10 C1 B2 4.1782

The names of the nodes are then extracted from the edge data frame. Horizontal and vertical locations for the nodes are calculated based on the labels. These locations are not strictly necessary because the package will work out sensible default values for you.

> nodes = data.frame(ID = unique(c(edges$N1, edges$N2)), stringsAsFactors = FALSE) > # > nodes$x = as.integer(substr(nodes$ID, 2, 2)) > nodes$y = as.integer(sapply(substr(nodes$ID, 1, 1), charToRaw)) - 65 > # > rownames(nodes) = nodes$ID > head(nodes) ID x y A1 A1 1 0 B1 B1 1 1 C1 C1 1 2 D1 D1 1 3 A2 A2 2 0 B2 B2 2 1

Finally we construct a list of styles which will be applied to each node. It's important to choose suitable colours and introduce transparency for overlaps (which is done here by pasting "60" onto the RGB strings).

> library(RColorBrewer) > # > palette = paste0(brewer.pal(4, "Set1"), "60") > # > styles = lapply(nodes$y, function(n) { + list(col = palette[n+1], lty = 0, textcol = "black") + }) > names(styles) = nodes$ID

## Constructing the riverplot Object

Now we are in a position to construct the riverplot object. We do this by joining the node, edge and style data structures into a list and then adding "riverplot" to the list of class attributes.

> library(riverplot) > > rp <- list(nodes = nodes, edges = edges, styles = styles) > # > class(rp) <- c(class(rp), "riverplot")

Producing the plot is then simple.

> plot(rp, plot_area = 0.95, yscale=0.06)

## Conclusion

I can think of a whole host of applications for figures like this, so I am very excited about the prospects. I know that I am going to have to figure out how to add additional labels to the figures, but I'm pretty sure that will not be too much of an obstacle.

The current version of riverplot is v0.3. Incidentally, when I stumbled on a small bug in v0.2 of riverplot, January was very quick to respond with a fix.