Rafal Lukawiecki – Putting Science into the Business of Data Science

A talk by Rafal Lukawiecki at the Microsoft Machine Learning & Data Science Summit (2016).

Data science relies on the scientific method of reasoning to help make business decisions based on analytics. Let Rafal explain how his customers apply the trusted processes and the principles of hypothesis testing with machine learning and statistics towards solving their day-to-day, practical business problems. Rafal will speak from his 10 years of experience in data mining and statistics, using the Microsoft data platform for high-value customer identification, recommendation and gap analysis, customer paths and acquisition modelling, price optimization and other forms of advanced analytics. He will share how this work leads to measurable results that not only make his customers more successful but also keep the data scientist true to his principles. This session will also help you understand how to start and justify data science projects in ways that business decision makers find attractive, and how to do it all with the help of the Microsoft platform.

R, HDF5 Data and Lightning

I used to spend an inordinate amount of time digging through lightning data. These data came from a number of sources, the World Wide Lightning Location Network (WWLLN) and LIS/OTD being the most common. I recently needed to work with some Hierarchical Data Format (HDF) data. HDF is something of a niche format and, since that was the format used for the LIS/OTD data, I went to review those old scripts. It was very pleasant rediscovering work I did some time ago.

LIS and OTD

The Optical Transient Detector (OTD) and Lightning Imaging Sensor (LIS) were instruments for detecting lightning discharges from Low Earth Orbit. OTD was launched in 1995 on the MicroLab-1 satellite into a near polar orbit with inclination 70°. OTD achieved global (spatial) coverage for the period May 1995 to April 2000 with roughly 60% uptime. LIS was an instrument on the TRMM satellite, launched into a 35° inclination orbit during 1997. Data from LIS were thus confined to more tropical latitudes. The TRMM mission only ended in April 2015.

The seminal work using data from OTD, Global frequency and distribution of lightning as observed from space by the Optical Transient Detector, was published by Hugh Christian and his collaborators in 2003. It’s open access and well worth a read if you are interested in where and when lightning happens across the Earth’s surface.

Preprocessing HDF4 to HDF5

The LIS/OTD data are available as HDF4 files from http://thunder.msfc.nasa.gov/data/. To load them into R I first converted to HDF5 using a tool from the h5utils suite:

$ h5fromh4 -d lrfc LISOTD_LRFC_V2.3.2014.hdf

Loading HDF5 in R

Support for HDF5 in R appears to have evolved appreciably in recent years. I originally used the hdf5 package. Then some time later transitioned to the h5r package. Neither of these appear on CRAN at present. Current support for HDF5 is via the h5 package. This package depends on the h5c++ library, which I needed to grab.

$ sudo apt-get install libhdf5-dev

Then, back in R I installed and loaded the h5 package.

> install.packages("h5")
> library(h5)

Ready to roll!

Low Resolution Full Climatology

The next step was to interrogate the contents of one of the HDF files. A given file may contain multiple data sets (this is part of the “hierarchical” nature of HDF), so we’ll check on what data sets are packed into one of those files. Let’s look at the Low Resolution Full Climatology (LRFC) data.

> file = h5file(name = "data/LISOTD_LRFC_V2.3.2014.h5", mode = "r")
> dataset = list.datasets(file)
> cat("datasets:", dataset, "\n")
datasets: /lrfc

Just a single data set, but that’s by design: we only extracted one using h5fromh4 above. What are the characteristics of that data set?

> print(file[dataset])
DataSet 'lrfc' (72 x 144)
type: numeric
chunksize: NA
maxdim: 72 x 144

It contains numerical data and has dimensions 72 by 144, which means that it has been projected onto a latitude/longitude grid with 2.5° resolution. We’ll just go ahead and read in those data.

> lrfc = readDataSet(file[dataset])
> class(lrfc)
[1] "matrix"
> dim(lrfc)
[1]  72 144

That wasn’t too hard. And it’s not much more complicated if there are multiple data sets per file.

Below is a ggplot showing the annualised distribution of lightning across the Earth’s surface. It’s apparent that most lightning occurs over land in the tropics, with the highest concentration in Central Africa. The units of the colour scale are flashes per square km per year. Higher resolution data can be found the HRFC file (High Resolution Full Climatology), but the LRFC is quite sufficient to get a flavour of the data.

lis-otd-flash-density

Low Resolution Annual Climatology

The Low Resolution Annual Climatology (LRAC) data have the same spatial resolution as LRFC but the data are broken down by day of year. This allows us to see how lightning activity varies at a particular location during the course of a year.

> file = h5file(name = "data/LISOTD_LRFC_V2.3.2014.h5", mode = "r")
> dataset = list.datasets(file)
> lrac = readDataSet(file[dataset])
> dim(lrac)
[1]  72 144 365

The data are now packed into a three dimensions array, where the first two dimensions are spatial (as for LRFC) and the third dimension corresponds to day of year.

We’ll look at two specific grid cells, one centred at 28.75° S 28.75° E (near the northern border of Lesotho, which according to the plot above is a region of relatively intense lightning activity) and the other at 31.25° S 31.25° E (in the Indian Ocean, just off the coast of Margate, South Africa). The annualised time series are plotted below. Lesotho has a clear annual cycle, with peak lightning activity during the Summer months but extending well into Spring and Autumn. There is very little lightning activity in Lesotho during Winter due to extremely dry and cold conditions. The cell over the Indian Ocean has a relatively high level of sustained lightning activity throughout the year. This is due to the presence of the warm Agulhas Current flowing down the east coast of South Africa. We wrote a paper about this, Processes driving thunderstorms over the Agulhas Current. It’s also open access, so go ahead and check it out.

lis-otd-time-series

Although the data in the plot above are rather jagged, with some aggregation and mild filtering they become pleasingly smooth and regular. We observed that you could actually fit the resulting curves rather well with just a pair of sinusoids. That work was documented in A harmonic model for the temporal variation of lightning activity over Africa. Like the other two papers, it’s open access. Enjoy (if that sort of thing is your cup of tea).

Conclusion

My original intention with this post was to show how to handle HDF data in R. But in retrospect it has achieved a second objective, showing that it’s possible to do some meaningful Science with data that’s in the public domain.

Recent Common Ancestors: Simple Model

An interesting paper (Modelling the recent common ancestry of all living humans, Nature, 431, 562–566, 2004) by Rohde, Olson and Chang concludes with the words:

Further work is needed to determine the effect of this common ancestry on patterns of genetic variation in structured populations. But to the extent that ancestry is considered in genealogical rather than genetic terms, our findings suggest a remarkable proposition: no matter the languages we speak or the colour of our skin, we share ancestors who planted rice on the banks of the Yangtze, who first domesticated horses on the steppes of the Ukraine, who hunted giant sloths in the forests of North and South America, and who laboured to build the Great Pyramid of Khufu.

This paper considered two models of our genealogical heritage. Neither of the models was exhaustive and they could not take into account all of the complexities involved in determining the genealogy of the Earth’s population. However, they captured many of the essential features. And despite their relative simplicity, the models reveal an unexpected result: a genealogical Common Ancestor (CA) of the entire present day human population would have lived in the relatively recent past.

It would be interesting to replicate some of those results.

An Even Simpler Model

In an earlier paper, Recent common ancestors of all present-day individuals, Chang considered a simpler model which he described as follows:

We assume the population size is constant at n. Generations are discrete and non-overlapping. The genealogy is formed by this random process: in each generation, each individual chooses two parents at random from the previous generation. The choices are made just as in the standard Wright–Fisher model (randomly and equally likely over the n possibilities) the only difference being that here each individual chooses twice instead of once. All choices are made independently. Thus, for example, it is possible that when an individual chooses his two parents, he chooses the same individual twice, so that in fact he ends up with just one parent; this happens with probability 1/n.

At first sight this appears to be an extremely crude and abstract model. However it captures many of the essential features required for modelling genealogical inheritance. It also has the advantage of not being too difficult to implement in R. If these details are not your cup of tea, feel free to skip forward to the results.

First we’ll need a couple of helper functions to neatly generate labels for generations (G) and individuals (I).

> label.individuals <- function(N) {
+   paste("I", str_pad(1:N, 2, pad = "0"), sep = "")
+ }
> 
> make.generation <- function(N, g) {
+   paste(paste0("G", str_pad(g, 2, pad = "0")), label.individuals(N), sep = "/")
+ }
> 
> make.generation(3, 5)
[1] "G05/I01" "G05/I02" "G05/I03"

Each indivual is identified by a label of the form “G05/I01”, which gives the generation number and also the specific identifier within that generation.

Next we’ll have a function to generate the random links between individuals in successive generations. The function will take two arguments: the number of individuals per generation and the number of ancestor generations (those prior to the “current” generation).

> # N - number of people per generation
> # G - number of ancestor generations (there will be G+1 generations!)
> #
> generate.edges <- function(N, G) {
+   edges = lapply(0:(G-1),  function(g) {
+     children <- rep(make.generation(N, g), each = 2)
+     #
+     parents <- sample(make.generation(N, g+1), 2 * N, replace = TRUE)
+     #
+     data.frame(parents = parents, children = children, stringsAsFactors = FALSE)
+   })
+   do.call(rbind, edges)
+ }
> 
> head(generate.edges(3, 3), 10)
   parents children
1  G01/I02  G00/I01
2  G01/I01  G00/I01
3  G01/I01  G00/I02
4  G01/I02  G00/I02
5  G01/I01  G00/I03
6  G01/I03  G00/I03
7  G02/I02  G01/I01
8  G02/I01  G01/I01
9  G02/I01  G01/I02
10 G02/I02  G01/I02

So, for example, the data generated above links the child node G00/I01 (individual 1 in generation 0) to two parent nodes, G01/I02 and G01/I01 (individuals 1 and 2 in generation 1).

Finally we’ll wrap all of that up in a graph-like object.

> library(igraph)
> generate.nodes <- function(N, G) {
+   lapply(0:G, function(g) make.generation(N, g))
+ }
> 
> generate.graph <- function(N, G) {
+   nodes = generate.nodes(N, G)
+   #
+   net <- graph.data.frame(generate.edges(N, G), directed = TRUE, vertices = unlist(nodes))
+   #
+   # Edge layout
+   #
+   x = rep(1:N, G+1)
+   y = rep(0:G, each = N)
+   #
+   net$layout = cbind(x, y)
+   #
+   net$individuals = label.individuals(N)
+   net$generations = nodes
+   #
+   net
+ }

Let’s give it a whirl on a graph consisting of 10 ancestor generations (11 generations including the current generation) and 8 individuals per generation. The result is plotted below with the oldest generation (G10) at the top and the current generation (G00) at the bottom. The edges indicate parent/progeny links.

sample-graph

Okay, so it looks like all of the infrastructure is in place. Now we need to put together the functionality for analysing the relationships between generations. We are really only interested in how any one of the ancestor generations relates to the current generation, which makes things a little simpler. First we write a function to calculate the number of steps from an arbitrary node (identified by label) to each of the nodes in the current generation. This is simplified by the fact that igraph already has a shortest.path() function.

> generation.descendants <- function(net, node) {
+   present.generation <- first(net$generations)
+   #
+   shortest.paths(net, v = node, to = present.generation, mode = "out")
+ }

Next a pair of helper functions which indicate whether a particular node is a CA of all nodes in the current generation or if it has gone extinct. Node G03/I08 in the above graph is an example of an extinct node since it has no child nodes.

> common.ancestor <- function(net, node) {
+   all(is.finite(generation.descendants(net, node)))
+ }
>   
> extinct <- function(net, node) {
+   all(is.infinite(generation.descendants(net, node)))
+ }

Let’s test those. We’ll generate another small graph for the test.

> set.seed(1)
> #
> net <- generate.graph(3, 5)
> #
> # This is an ancestor of all.
> #
> generation.descendants(net, "G05/I01")
        G00/I01 G00/I02 G00/I03
G05/I01       5       5       5
> common.ancestor(net, "G05/I01")
[1] TRUE
> #
> # This is an ancestor of all but G00/I02
> #
> generation.descendants(net, "G02/I03")
        G00/I01 G00/I02 G00/I03
G02/I03       2     Inf       2
> common.ancestor(net, "G02/I03")
[1] FALSE
> #
> # This node is extinct.
> #
> generation.descendants(net, "G03/I01")
        G00/I01 G00/I02 G00/I03
G03/I01     Inf     Inf     Inf
> extinct(net, "G03/I01")
[1] TRUE

It would also be handy to have a function which gives the distance from every node in a particular generation to each of the nodes in the current generation.

> generation.distance <- function(net, g) {
+   current.generation <- net$generations[[g+1]]
+   #
+   do.call(rbind, lapply(current.generation, function(n) {generation.descendants(net, n)}))
+ }
> generation.distance(net, 3)
        G00/I01 G00/I02 G00/I03
G03/I01     Inf     Inf     Inf
G03/I02       3       3       3
G03/I03       3       3       3

So here we see that G03/I02 and G03/I03 are both ancestors of each of the individuals in the current generation, while G03/I01 has gone extinct.

Finally we can pull all of this together with a single function that classifies individuals in previous generations according to whether they are an ancestor of at least one but not all of the current generation; a common ancestor of all of the current generation; or extinct.

> classify.generation <- function(net, g) {
+   factor(apply(generation.distance(net, g), 1, function(p) {
+     ifelse(all(is.infinite(p)), 1, ifelse(all(is.finite(p)), 2, 0))
+   }), levels = 0:2, labels = c("ancestor", "extinct", "common"))
+ }
> classify.generation(net, 3)
G03/I01 G03/I02 G03/I03 
extinct  common  common 
Levels: ancestor extinct common

Results

First let’s have a look at how the relationship between previous generations and the current generation changes over time. The plot below indicates the proportion of ancestors, common ancestors and extinct individuals as a function of generation number. Note that time runs backwards along the horizontal axis, with the current generation on the left and successive ancestor generations moving towards the right.

The graph used to generate these data had 1000 individuals per generation. We can see that for the first 10 generations around 80% of the individuals were ancestors of one or more (but not all!) of the current generation. The remaining 20% or so were extinct. Then in generation 11 we have the first CA. This individual would be labelled the Most Recent Common Ancestor (MRCA). Going further back in time the fraction of CAs increases while the proportion of mere ancestors declines until around generation 19 when all of the ancestors are CAs.

ancestor-evolution

There are two important epochs to identify on the plot above:

  1. the generation at which the MRCA emerged (this is denoted \(T_n\)); and
  2. the generation at which all individuals are either CAs or extinct (this is denoted \(U_n\)).

Chang’s paper gives asymptotic expressions for how these epochs vary as a function of generation size. It also presents a table with sample values of \(T_n\) and \(U_n\) for generations of size 500, 1000, 2000 and 4000. Those samples support the asymptotic relationships.

Since we have all of the infrastructure in place, we can conduct an independent test of the relationships. Below is a boxplot generated from 100 realisations each of calculations with generations of size 10, 100 and 1000. The blue boxes reflect our estimates for \(T_n\), while the green boxes are the estimates of \(U_n\). The two dashed lines reflect the asymptotic relationships. There’s pretty good agreement between the theoretical and experimental results, although our estimates for \(U_n\) appear to be consistently larger than expected. The sample size was limited though… Furthermore, these are asymptotic relationships, so we really only expect them to hold exactly for large generation sizes.

Un-Tn-versus-N

When would our MRCA have been alive based on this simple model and the Earth’s current population? Setting the generation size to the Earth’s current population of around 7 billion people gives \(T_n\) at about 33 generations and a \(U_n\) of approximately 58 generations. I know, those 7 billion people are not from the same generation, but we are going for an estimate here…

What does a generation translate to in terms of years? That figure has varied significantly over time. It also depends on location. However, in developed nations it is currently around 30 years. WolframAlpha equates a generation to 28 years. So, based on these estimates we have \(T_n\) at around 1000 years and our estimate of \(U_n\) translates into less than 2 millennia.

Of course, these numbers need to be taken with a pinch of salt since they are based on a model which assumes that the population has been static at around 7 billion, which we know to be far from true! However these results are also not completely incompatible with the more sophisticated results of Rohde, Olson and Chang who found that the MRCA of all present-day humans lived just a few thousand years ago.

Simulating Intricate Branching Patterns with DLA

Manfred Schroeder’s book Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise is a fruitful source of interesting topics and projects. He gives a thorough description of Diffusion-Limited Aggregation (DLA) as a technique for simulating physical processes which produce intricate branching structures. Examples, as illustrated below, include Lichtenberg Figures, dielectric breakdown, electrodeposition and Hele-Shaw flow.

dla-post-illustration

Diffusion-Limited Aggregation

DLA is conceptually simple. A seed particle is fixed at the origin of the coordinate system. Another particle is introduced at a relatively large distance from the origin, which then proceeds to move on a random walk. Either it will wander off to infinity or it will come into contact with the particle at the origin, to which it sticks irreversibly. Now another particle is introduced and the process repeats itself. As successive particles are added to the system, a portion of them become bound to the growing cluster of particles at the origin.

The objects which evolve from this process are intrinsically random, yet have self-similar structure across a range of scales. There is also an element of positive feedback, where once a protuberance has formed on the cluster, further particles are more likely to adhere to it since they will probably encounter it first.

A Simple Implementation in R

First we need to construct a grid. We will start small: a 20 by 20 grid filled with NA except for four seed locations at the centre.

> # Dimensions of grid
> W <- 20
> grid <- matrix(NA, nrow = W, ncol = W)
> grid[W/2 + c(0, 1), W/2 + c(0, 1)] = 0
> grid
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
 [1,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [2,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [3,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [4,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [5,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [6,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [7,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [8,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [9,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[10,]   NA   NA   NA   NA   NA   NA   NA   NA   NA     0     0    NA    NA    NA    NA    NA    NA    NA    NA    NA
[11,]   NA   NA   NA   NA   NA   NA   NA   NA   NA     0     0    NA    NA    NA    NA    NA    NA    NA    NA    NA
[12,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[13,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[14,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[15,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[16,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[17,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[18,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[19,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[20,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

We need to generate two dimensional random walks. To do this I created a table of possible moves, from which individual steps could be sampled at random. The table presently only caters for the cells immediately above, below, left or right of the current cell. It would be a simple matter to extend the table to allow diagonal moves as well, but these more distant moves would need to be weighted accordingly.

> moves <- data.frame(dx = c(0, 0, +1, -1), dy = c(+1, -1, 0, 0))
> #
> M = nrow(moves)
> moves
  dx dy
1  0  1
2  0 -1
3  1  0
4 -1  0

Next a function to transport a particle from its initial location until it either leaves the grid or adheres to the cluster at the origin. Again a possible refinement here would be to allow sticking to next-nearest neighbours as well

> diffuse <- function(p) {
+   count = 0
+   #
+   while (TRUE) {
+     p = p + moves[sample(M, 1),]
+     #
+     count = count + 1
+     #
+     # Black boundary conditions
+     #
+     if (p$x > W | p$y > W | p$x < 1 | p$y < 1) return(NA)
+     #
+     # Check if it sticks (to nearest neighbour)
+     #
+     if (p$x < W && !is.na(grid[p$x+1, p$y])) break
+     if (p$x > 1 && !is.na(grid[p$x-1, p$y])) break
+     if (p$y < W && !is.na(grid[p$x, p$y+1])) break
+     if (p$y > 1 && !is.na(grid[p$x, p$y-1])) break
+   }
+   #
+   return(c(p, count = count))
+ }

Finally we are ready to apply this procedure to a batch of particles.

> library(foreach)
>
> # Number of particles per batch
> #
> PBATCH <- 5000
> #
> # Select starting position
> #
> phi = runif(PBATCH, 0, 2 * pi)
> #
> x = round((1 + cos(phi)) * W / 2 + 0.5)
> y = round((1 + sin(phi)) * W / 2 + 0.5)
> #
> particles <- data.frame(x, y)
> 
> result = foreach(n = 1:PBATCH) %do% diffuse(particles[n,])
> 
> lapply(result, function(p) {if (length(p) == 3) grid[p$x, p$y] <<- p$count})

The resulting grid shows all of the locations where particles have adhered to the cluster. The number at each location is the diffusion time, which indicates the number of steps required for the particle to move from its initial location to its final resting place. The shape of the cluster is a little boring at present: essentially a circular disk centred on the origin. This is due to the size of the problem and we will need to have a much larger grid to produce more interesting structure.

> grid
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
 [1,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [2,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [3,]   NA   NA   NA   NA   NA   NA   NA   NA   NA     2    18    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [4,]   NA   NA   NA   NA   NA   NA   NA   NA    8     6    17    10    NA    NA    NA    NA    NA    NA    NA    NA
 [5,]   NA   NA   NA   NA   NA   NA   NA    6   11     7    54    25    15    NA    NA    NA    NA    NA    NA    NA
 [6,]   NA   NA   NA   NA   NA   NA   16   10   11    58    69    18    31    16    NA    NA    NA    NA    NA    NA
 [7,]   NA   NA   NA   NA   NA   20   19   10   21    24    32    50    24    65     8    NA    NA    NA    NA    NA
 [8,]   NA   NA   NA   NA   18   12   55   26   13   151    86    20    21    26    27    34    NA    NA    NA    NA
 [9,]   NA   NA   NA   56   21   43   19   53   43    30    26    37    66    52    30    22    10    NA    NA    NA
[10,]   NA   NA   29    9    9   23   70   38   48     0     0   122    26    44    22    10    27     5    NA    NA
[11,]   NA   NA    2    9   10   36   32   38   24     0     0    54    14    21    65    14    30    29    NA    NA
[12,]   NA   NA   NA    5   10   16   13   83   52    43    23    42    39    23    66     9    32    NA    NA    NA
[13,]   NA   NA   NA   NA   21   70   28   31   NA    41    61    15    17    29    25    17    NA    NA    NA    NA
[14,]   NA   NA   NA   NA   NA    8   29   19    7    47   119    37    19     9    10    NA    NA    NA    NA    NA
[15,]   NA   NA   NA   NA   NA   NA   15   33   68    26    38    13    33     8    NA    NA    NA    NA    NA    NA
[16,]   NA   NA   NA   NA   NA   NA   NA   10   12    12    15    35    11    NA    NA    NA    NA    NA    NA    NA
[17,]   NA   NA   NA   NA   NA   NA   NA   NA   20     8     6     5    NA    NA    NA    NA    NA    NA    NA    NA
[18,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    18    20    NA    NA    NA    NA    NA    NA    NA    NA    NA
[19,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[20,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Taking a look at the distribution of diffusion times below we can see that there is a strong peak between 10 and 15. The majority of particles diffuse in less than 40 steps, while the longest diffusion time is 151 steps. The grid above shows that, as one would expect, smaller diffusion times are found close to the surface of the cluster, while longer times are embedded closer to the core.

dla-big-steps-hist-sample

Scaling it Up

To produce a more interesting cluster we need to scale the grid size up considerably. But this also means that the run time escalates enormously. So, to make this even remotely tractable, I had to parallelise the algorithm. I did this using the SNOW package and ran it on an MPI cluster. The changes to the code are trivial, involving only the creation and initialisation of the cluster and changing %do% to %dopar% in the foreach() loop.

dla-big

Conclusion

This is nowhere near being an efficient implementation of DLA. But it gets the job done and is illustrative of how the algorithm works. To do production runs one would want to code the algorithm in a low-level language like C or C++ and take advantage of the inherent parallelism of the algorithm.

Whistlers and Volcanic Lightning

A paper authored with my very talented student, Claire Antel, entitled “Investigating Dunedin whistlers using volcanic lightning” has just been published in Geophysical Research Letters. The paper looks at electromagnetic signals (“whistlers”) received at Dunedin, New Zealand, caused by lightning over volcanoes thousands of km away near the Aleutian Chain.

Abstract

Whistlers detected at Dunedin, New Zealand are an anomaly: there is little lightning around Dunedin’s conjugate point yet whistlers appear in relatively large numbers. These surplus whistlers have consequently inspired investigations into their origins. Dunedin’s lightning-sparse conjugate point lies in the Aleutian Islands, a region populated with active volcanoes. Their presence has allowed us to perform a novel analysis: the correlation of whistlers to volcanic lightning. We report on our investigation, which successfully yielded the first observations of “volcanic whistlers.” It was found that the single July 2008 Mount Okmok eruption had an impressive effect on the number of whistlers at Dunedin. The eruptions at Mount Redoubt in 2009 also caused a sporadic flow of whistlers in Dunedin.

In Other Media

Claire has also given an interview about the paper.

Hazardous and Benign Space Objects: Orbits in the Solar-Ecliptic Reference Frame

In two previous posts in this series I have wrangled NEO orbital data into R and then solved Kepler’s Equation to get the eccentric anomaly for each NEO. The final stage in the visualisation of the NEO orbits will be the transformation of locations from the respective orbital planes into a single reference frame.

Reference Frame

The heliocentric ecliptic reference frame has the Sun at the origin and the ecliptic as the fundamental plane. For visualisation purposes we will be representing positions with respect to this plane in terms of rectangular (or Cartesian) coordinates.

Transformation

The transformation between the orbital plane (X, Y, Z) and the heliocentric ecliptic coordinate system (x, y, z) is achieved with

\begin{pmatrix}X\\ Y\\ Z\end{pmatrix}=\begin{pmatrix}\cos \phi_0 \cos \Omega - \sin \phi_0 \cos i \sin \Omega&\cos \phi_0 \sin \Omega + \sin \phi_0 \cos i \cos \Omega&\sin \phi_0 \sin i\\ - \sin \phi_0 \cos \Omega - \cos \phi_0 \cos i \sin \Omega&- \sin \phi_0 \sin \Omega + \cos \phi_0 \cos i \cos \Omega&\cos \phi_0 \sin i\\ \sin i \sin \Omega&- \sin i \cos \Omega&\cos i\end{pmatrix}\begin{pmatrix}x\\ y\\ z\end{pmatrix}

where \phi_0 is the argument of perigee, i is the inclination and \Omega is the longitude of the ascending node.

Cartesian Coordinates

But, before we can use this transformation, we need to move to Cartesian Coordinates. This is easily accomplished.

> orbits = within(orbits, {
+   Z = 0
+   Y = a * sqrt(1 - e**2) * sin(E)
+   X = a * cos(E) - a * e
+ })
> head(orbits)
             Object Epoch      a       e        i       w   Node       M     q.    Q    P    H     MOID class hazard       E  theta        X        Y Z
1  100004 (1983 VA) 56800 2.5959 0.69970 0.284254 0.21050 1.3498 1.40950 0.7795 4.41 4.18 16.4 0.176375   APO  FALSE 2.03512 2.6336 -2.97885  1.65824 0
2 100085 (1992 UY4) 56800 2.6398 0.62481 0.048926 0.66896 5.3826 0.22042 0.9904 4.29 4.29 17.8 0.015111   APO   TRUE 0.54358 1.0511  0.60993  1.06602 0
3 100756 (1998 FM5) 56800 2.2686 0.55202 0.201156 5.43895 3.0869 4.82500 1.0163 3.52 3.42 16.1 0.100211   APO  FALSE 4.31582 3.8282 -2.12857 -1.74482 0
4  100926 (1998 MQ) 56800 1.7827 0.40777 0.422864 2.42053 3.8599 3.42868 1.0558 2.51 2.38 16.7 0.128731   AMO  FALSE 3.34594 3.2744 -2.47255 -0.33031 0
5  101869 (1999 MM) 56800 1.6243 0.61071 0.083167 4.68848 1.9388 0.75843 0.6323 2.62 2.07 19.3 0.001741   APO   TRUE 1.35497 2.0445 -0.64413  1.25637 0
6 101873 (1999 NC5) 56800 2.0295 0.39332 0.798850 5.15317 2.2489 5.65453 1.2312 2.83 2.89 16.3 0.437678   AMO  FALSE 5.33502 4.9613  0.38531 -1.51575 0

Note that the Z component is zero for all of the orbits since this is the coordinate which is perpendicular to the orbital planes (and all of the orbits lie strictly within the fundamental plane of their respective coordinate systems).

Now we need to generate a transformation matrix for each orbit. We will wrap that up in a function.

> transformation.matrix <- function(w, Node, i) {
+   a11 =   cos(w) * cos(Node) - sin(w) * cos(i) * sin(Node)
+   a12 =   cos(w) * sin(Node) + sin(w) * cos(i) * cos(Node)
+   a13 =   sin(w) * sin(i)
+   
+   a21 = - sin(w) * cos(Node) - cos(w) * cos(i) * sin(Node)
+   a22 = - sin(w) * sin(Node) + cos(w) * cos(i) * cos(Node)
+   a23 =   cos(w) * sin(i)
+   
+   a31 =   sin(i) * sin(Node)
+   a32 = - sin(i) * cos(Node)
+   a33 =   cos(i)
+   
+   matrix(c(a11, a12, a13, a21, a22, a23, a31, a32, a33), nrow = 3, byrow = TRUE)
+ }

First we will try this out for a single object.

> (A = do.call(transformation.matrix, orbits[1, c("w", "Node", "i")]))
          [,1]       [,2]     [,3]
[1,]  0.018683  0.9981068 0.058599
[2,] -0.961656  0.0018992 0.274251
[3,]  0.273620 -0.0614756 0.959871
> #
> # We want to invert this transformation, so we need to invert the matrix. However, since this is a
> # rotation matrix, the inverse and the transpose are the same... (we will use this fact later!)
> #
> A = solve(A)
> #
> A %*% matrix(unlist(orbits[1, c("X", "Y", "Z")]), ncol = 1)
         [,1]
[1,] -1.65031
[2,] -2.97006
[3,]  0.28022

That looks pretty reasonable. Now we need to do that systematically to all of the objects.

> require(plyr)
Loading required package: plyr
> 
> orbits = ddply(orbits, .(Object), function(df) {
+   A = transformation.matrix(df$w, df$Node, df$i)
+   #
+   # Invert transformation matrix
+   #
+   A = t(A)
+   #
+   # Transform to reference frame
+   #
+   r = A %*% matrix(unlist(df[, c("X", "Y", "Z")]), ncol = 1)
+   #
+   r = matrix(r, nrow = 1)
+   colnames(r) = c("x", "y", "z")
+   #
+   cbind(df, r)
+ })
> head(orbits)
             Object Epoch      a       e        i       w   Node       M     q.    Q    P    H     MOID class hazard       E  theta        X        Y Z        x        y         z
1  100004 (1983 VA) 56800 2.5959 0.69970 0.284254 0.21050 1.3498 1.40950 0.7795 4.41 4.18 16.4 0.176375   APO  FALSE 2.03512 2.6336 -2.97885  1.65824 0 -1.65031 -2.97006  0.280216
2 100085 (1992 UY4) 56800 2.6398 0.62481 0.048926 0.66896 5.3826 0.22042 0.9904 4.29 4.29 17.8 0.015111   APO   TRUE 0.54358 1.0511  0.60993  1.06602 0  0.83726  0.89659  0.059398
3 100756 (1998 FM5) 56800 2.2686 0.55202 0.201156 5.43895 3.0869 4.82500 1.0163 3.52 3.42 16.1 0.100211   APO  FALSE 4.31582 3.8282 -2.12857 -1.74482 0  2.69099 -0.57127  0.086300
4  100926 (1998 MQ) 56800 1.7827 0.40777 0.422864 2.42053 3.8599 3.42868 1.0558 2.51 2.38 16.7 0.128731   AMO  FALSE 3.34594 3.2744 -2.47255 -0.33031 0 -2.39321 -0.41532 -0.568055
5  101869 (1999 MM) 56800 1.6243 0.61071 0.083167 4.68848 1.9388 0.75843 0.6323 2.62 2.07 19.3 0.001741   APO   TRUE 1.35497 2.0445 -0.64413  1.25637 0 -1.02820  0.96622  0.050998
6 101873 (1999 NC5) 56800 2.0295 0.39332 0.798850 5.15317 2.2489 5.65453 1.2312 2.83 2.89 16.3 0.437678   AMO  FALSE 5.33502 4.9613  0.38531 -1.51575 0  1.29744 -0.50408 -0.713095

Looks good. Time to generate some plots. First we will look at the distribution of objects in the solar-ecliptic plane. Most of them are clustered within a few AU of the Sun, but there are a few stragglers at much larger distances.

> library(ggplot2)
> 
> ggplot(orbits, aes(x = x, y = y)) +
+   geom_point(alpha = 0.25) +
+   xlab("x [AU]") + ylab("y [AU]") +
+   scale_x_continuous(limits = c(-35, 35)) + scale_y_continuous(limits = c(-55, 15)) +
+   theme_classic() +
+   theme(aspect.ratio = 1)

reference-frame-xy-full

We can zoom in on the central region within 5 AU of the Sun. We will also produce a projection onto the x-z plane so that we can see how far these objects lie above or below the ecliptic plane.

> library(grid)
> 
> ggplot(orbits, aes(x = x, y = y)) +
+   geom_point(aes(colour = hazard), alpha = 0.25) +
+   scale_colour_discrete("Hazard", h = c(180, 0), c = 150, l = 40) +
+   xlab("x [AU]") + ylab("y [AU]") +
+   scale_x_continuous(limits = c(-5, 5)) + scale_y_continuous(limits = c(-5, 5)) +
+   annotation_custom(grob = circleGrob(
+     r = unit(0.5, "npc"), gp = gpar(fill = "transparent", col = "black", lwd = 2, lty = "dashed")),
+     xmin=-1, xmax=1, ymin=-1, ymax=1) +
+   theme_classic() +
+   theme(aspect.ratio = 1)
Warning: Removed 17 rows containing missing values (geom_point).
> 
> ggplot(orbits, aes(x = x, y = z)) +
+   geom_point(aes(colour = hazard), alpha = 0.25) +
+   scale_colour_discrete("Hazard", h = c(180, 0), c = 150, l = 40) +
+   xlab("x [AU]") + ylab("z [AU]") +
+   scale_x_continuous(limits = c(-5, 5)) + scale_y_continuous(limits = c(-5, 5)) +
+   theme_classic() +
+   theme(aspect.ratio = 1)
Warning: Removed 7 rows containing missing values (geom_point).

The dashed circle indicates the orbit of the Earth. Most of the objects are clustered around this orbit.

reference-frame-xy-zoom

Although there are numerous objects which lie a few AU on either side of the ecliptic plane, the vast majority are (like the planets) very close to or on the plane itself.

reference-frame-xz-zoom

Finally a three dimensional visualisation. I would have liked to rotate this plot to try and get a better perspective, but I could not manage that with scatterplot3d. It’s probably not the right tool for that particular job.

> library(scatterplot3d)
> 
> orbits.3d <- with(orbits, scatterplot3d(x, y, z, xlim = c(-5, 5), ylim = c(-5, 5), highlight.3d = TRUE))
> orbits.3d$plane3d(c(0, 0, 0))

reference-frame-zoom-scatter-3d

I have enjoyed working with these data. My original plan was to leave the project at this point. But, since the data are neatly classified according to two different schemes, it presents a good opportunity to try out some classification models. More on that shortly.

Hazardous and Benign Space Objects: Solving Kepler’s Equation

Following on from my previous post about Near Earth Objects, today we are going to solve Kepler’s Equation to find the eccentric anomaly, which is the next step towards plotting the positions of these NEOs relative to Earth.

The Eccentric, True and Mean Anomalies

The relationship between the eccentric and true anomalies are depicted in the figure below (courtesy of Wikipedia). We are thinking about the object located at P, which is moving along an elliptical orbit (the red curve). The true anomaly, \theta, is the angular position of P measured around the focus of the orbit (the location of the Sun in this case) relative to the direction of periapsis (the extension to the right of the line CF). The eccentric anomaly, E, is measured relative to the same line, but the angle is taken at the centre of the ellipse. The mean anomaly, M, does not have a direct geometric interpretation. Although it has a value between 0 and 2\pi, it is not an angle. The mean anomaly relates position and time via Kepler’s Second Law, and is proportional to the area swept out by the line FP.

Eccentric_and_true_anomaly

Kepler’s Equation

The eccentric anomaly, mean anomaly and orbit eccentricity are related by Kepler’s Equation:

M = E - e \sin E.

This is a transcendental equation, meaning that it does not have an analytical solution. So we will need to use numerical techniques to solve it.

Solving Kepler’s Equation

Recall that our NEO orbital data look like this:

> head(orbits)
             Object Epoch        a         e          i         w     Node         M     q     Q    P    H     MOID class hazard
1  100004 (1983 VA) 56800 2.595899 0.6997035 0.28425360 0.2105033 1.349791 1.4094981 0.7795 4.41 4.18 16.4 0.176375   APO  FALSE
2 100085 (1992 UY4) 56800 2.639837 0.6248138 0.04892634 0.6689567 5.382550 0.2204248 0.9904 4.29 4.29 17.8 0.015111   APO   TRUE
3 100756 (1998 FM5) 56800 2.268602 0.5520201 0.20115638 5.4389541 3.086864 4.8250011 1.0163 3.52 3.42 16.1 0.100211   APO  FALSE
4  100926 (1998 MQ) 56800 1.782705 0.4077730 0.42286374 2.4205309 3.859910 3.4286821 1.0558 2.51 2.38 16.7 0.128731   AMO  FALSE
5  101869 (1999 MM) 56800 1.624303 0.6107117 0.08316718 4.6884786 1.938777 0.7584286 0.6323 2.62 2.07 19.3 0.001741   APO   TRUE
6 101873 (1999 NC5) 56800 2.029466 0.3933192 0.79884961 5.1531708 2.248914 5.6545349 1.2312 2.83 2.89 16.3 0.437678   AMO  FALSE

We are going to use the Newton-Raphson method via nleqslv() to solve Kepler’s Equation for each record.

> library(nleqslv)
> library(plyr)
> 
> # Solve Kepler's Equation to find the eccentric anomaly, E. This is a transcendental equation so we need to solve it
> # numerically using the Newton-Raphson method.
> #
> orbits = ddply(orbits, .(Object), mutate, E = {
+   kepler <- function(E) {
+     M - E + e * sin(E)
+   }
+   
+   nleqslv(0, kepler, method = "Newton")$x
+ })
> head(orbits)[, c("Object", "E")]
             Object        E
1         (1979 XB) 5.632016
2         (1982 YA) 3.252744
3         (1983 LC) 2.280627
4         (1985 WA) 5.814292
5         (1986 NA) 2.591137
6        (1987 SF3) 5.268315

Plotting the Results

We will plot the resulting relationship between the mean and eccentric anomalies using colour to indicate eccentricity.

library(ggplot2)
library(scales)

degrees <- function(x, ...) {parse(text = paste0(x, "*degree"))}

ggplot(orbits, aes(x = M / pi * 180, y = E / pi * 180, colour = e)) +
  geom_point() +
  scale_colour_gradient("Eccentricity (e)", limits=c(0, 1), high = "red", low = "blue") +
  xlab("Mean Anomaly (M)") + ylab("Eccentric Anomaly (E)") +
  scale_x_continuous(labels = degrees, breaks = seq(0, 360, 90)) +
  scale_y_continuous(labels = degrees, breaks = seq(0, 360, 90)) +
  theme_classic()

mean-eccentric-anomaly

As one would expect, when the eccentricity is zero (a circular orbit) the relationship between the two anomalies is linear.

Finding the True Anomaly

Now it is a simple matter to get the true anomaly as well.

> orbits = transform(orbits, theta = 2 * atan2(sqrt(1 + e) * sin (E/2), sqrt(1 - e) * cos (E/2)))

eccentric-true-anomaly

The Next Step

Now that we have the eccentric anomaly, we can calculate the Cartesian location of the objects in their respective orbital planes. We will then transform from the numerous orbital planes to a single reference frame.