International Open Data Day

As part of International Open Data Day we spent the morning with a bunch of like minded people poring over some open Census South Africa data. Excellent initiative, @opendatadurban, I’m very excited to see where this is all going and look forward to contributing to the journey!


The data above show the distribution of ages in a segment of the South African population who have either no schooling (blue) or have completed Grade 12 (orange). Click the image to access the interactive plot. Being a subset of the complete data set, this does not tell the full story, but it’s still a rather interesting view on the state of education in South Africa.

R, HDF5 Data and Lightning

I used to spend an inordinate amount of time digging through lightning data. These data came from a number of sources, the World Wide Lightning Location Network (WWLLN) and LIS/OTD being the most common. I recently needed to work with some Hierarchical Data Format (HDF) data. HDF is something of a niche format and, since that was the format used for the LIS/OTD data, I went to review those old scripts. It was very pleasant rediscovering work I did some time ago.


The Optical Transient Detector (OTD) and Lightning Imaging Sensor (LIS) were instruments for detecting lightning discharges from Low Earth Orbit. OTD was launched in 1995 on the MicroLab-1 satellite into a near polar orbit with inclination 70°. OTD achieved global (spatial) coverage for the period May 1995 to April 2000 with roughly 60% uptime. LIS was an instrument on the TRMM satellite, launched into a 35° inclination orbit during 1997. Data from LIS were thus confined to more tropical latitudes. The TRMM mission only ended in April 2015.

The seminal work using data from OTD, Global frequency and distribution of lightning as observed from space by the Optical Transient Detector, was published by Hugh Christian and his collaborators in 2003. It’s open access and well worth a read if you are interested in where and when lightning happens across the Earth’s surface.

Preprocessing HDF4 to HDF5

The LIS/OTD data are available as HDF4 files from To load them into R I first converted to HDF5 using a tool from the h5utils suite:

$ h5fromh4 -d lrfc LISOTD_LRFC_V2.3.2014.hdf

Loading HDF5 in R

Support for HDF5 in R appears to have evolved appreciably in recent years. I originally used the hdf5 package. Then some time later transitioned to the h5r package. Neither of these appear on CRAN at present. Current support for HDF5 is via the h5 package. This package depends on the h5c++ library, which I needed to grab.

$ sudo apt-get install libhdf5-dev

Then, back in R I installed and loaded the h5 package.

> install.packages("h5")
> library(h5)

Ready to roll!

Low Resolution Full Climatology

The next step was to interrogate the contents of one of the HDF files. A given file may contain multiple data sets (this is part of the “hierarchical” nature of HDF), so we’ll check on what data sets are packed into one of those files. Let’s look at the Low Resolution Full Climatology (LRFC) data.

> file = h5file(name = "data/LISOTD_LRFC_V2.3.2014.h5", mode = "r")
> dataset = list.datasets(file)
> cat("datasets:", dataset, "\n")
datasets: /lrfc

Just a single data set, but that’s by design: we only extracted one using h5fromh4 above. What are the characteristics of that data set?

> print(file[dataset])
DataSet 'lrfc' (72 x 144)
type: numeric
chunksize: NA
maxdim: 72 x 144

It contains numerical data and has dimensions 72 by 144, which means that it has been projected onto a latitude/longitude grid with 2.5° resolution. We’ll just go ahead and read in those data.

> lrfc = readDataSet(file[dataset])
> class(lrfc)
[1] "matrix"
> dim(lrfc)
[1]  72 144

That wasn’t too hard. And it’s not much more complicated if there are multiple data sets per file.

Below is a ggplot showing the annualised distribution of lightning across the Earth’s surface. It’s apparent that most lightning occurs over land in the tropics, with the highest concentration in Central Africa. The units of the colour scale are flashes per square km per year. Higher resolution data can be found the HRFC file (High Resolution Full Climatology), but the LRFC is quite sufficient to get a flavour of the data.


Low Resolution Annual Climatology

The Low Resolution Annual Climatology (LRAC) data have the same spatial resolution as LRFC but the data are broken down by day of year. This allows us to see how lightning activity varies at a particular location during the course of a year.

> file = h5file(name = "data/LISOTD_LRFC_V2.3.2014.h5", mode = "r")
> dataset = list.datasets(file)
> lrac = readDataSet(file[dataset])
> dim(lrac)
[1]  72 144 365

The data are now packed into a three dimensions array, where the first two dimensions are spatial (as for LRFC) and the third dimension corresponds to day of year.

We’ll look at two specific grid cells, one centred at 28.75° S 28.75° E (near the northern border of Lesotho, which according to the plot above is a region of relatively intense lightning activity) and the other at 31.25° S 31.25° E (in the Indian Ocean, just off the coast of Margate, South Africa). The annualised time series are plotted below. Lesotho has a clear annual cycle, with peak lightning activity during the Summer months but extending well into Spring and Autumn. There is very little lightning activity in Lesotho during Winter due to extremely dry and cold conditions. The cell over the Indian Ocean has a relatively high level of sustained lightning activity throughout the year. This is due to the presence of the warm Agulhas Current flowing down the east coast of South Africa. We wrote a paper about this, Processes driving thunderstorms over the Agulhas Current. It’s also open access, so go ahead and check it out.


Although the data in the plot above are rather jagged, with some aggregation and mild filtering they become pleasingly smooth and regular. We observed that you could actually fit the resulting curves rather well with just a pair of sinusoids. That work was documented in A harmonic model for the temporal variation of lightning activity over Africa. Like the other two papers, it’s open access. Enjoy (if that sort of thing is your cup of tea).


My original intention with this post was to show how to handle HDF data in R. But in retrospect it has achieved a second objective, showing that it’s possible to do some meaningful Science with data that’s in the public domain.

Ira Glass on the Creative Process

Nobody tells this to people who are beginners, I wish someone told me. All of us who do creative work, we get into it because we have good taste. But there is this gap. For the first couple years you make stuff, it’s just not that good. It’s trying to be good, it has potential, but it’s not. But your taste, the thing that got you into the game, is still killer. And your taste is why your work disappoints you. A lot of people never get past this phase, they quit. Most people I know who do interesting, creative work went through years of this. We know our work doesn’t have this special thing that we want it to have. We all go through this. And if you are just starting out or you are still in this phase, you gotta know its normal and the most important thing you can do is do a lot of work. Put yourself on a deadline so that every week you will finish one story. It is only by going through a volume of work that you will close that gap, and your work will be as good as your ambitions. And I took longer to figure out how to do this than anyone I’ve ever met. It’s gonna take awhile. It’s normal to take awhile. You’ve just gotta fight your way through.Ira Glass

Automating R scripts under Windows

Setting up an automated job under Linux is a cinch thanks to cron. Doing the same under Windows is a little more tricky, but still eminently doable.

I found this tutorial helpful:

That got me 99% of the way there. I wrote a batch file to trigger the script. The critical element was changing to the correct location before invoking R.

@echo off
cd "C:\Users\AndrewCo\Projects\"
R CMD BATCH script.R

Durban R Users Group Meetup: 24 February 2016 @ The Green Door

We’re kicking off the inaugural meeting of the Durban R Users Group with a live video presentation by Andrie de Vries (Senior Programme Manager, R Community Projects at Microsoft / Revolution Analytics). Andrie will be talking about “Demonstration of using R in the cloud together with Azure Machine Learning”. If you’ve kept up with Microsoft’s recent contributions to R then you’ll know that this is going to be an interesting talk.

We’ll also have Professor Bruce Page (University of KwaZulu-Natal, School of Biological and Conservation Sciences) talking about “An Introduction to Spatial Analysis in R”, which will cover using the T-LoCoH package for tracking space/time data. Bruce is an recognised expert on monitoring the movements and changes in home ranges of wild animals.

Andrew Collier will give a talk entitled “From Klueless to Kaggle in 15 Minutes” which will demonstrate how you can rapidly get up to speed on Data Analysis and Predictive Modelling in R.

Get the details and sign up here.

flipsideR: Support for ASX Option Chain Data

I previously wrote about some ad hoc R code for downloading Option Chain data from Google Finance. I finally wrapped it up into a package called flipsideR, which is now available via GitHub. Since I last wrote on this topic I’ve also added support for downloading option data from the Australian Securities Exchange (ASX).


Installation is straightforward using devtools.

> library(devtools)
> install_github('DataWookie/flipsideR')

You’re ready to roll!


As I mentioned previously, there’s already functionality in quantmod for retrieving option chain data.

> library(quantmod)
> AAPL = getOptionChain('AAPL')
> head(AAPL$calls)
                    Strike  Last  Chg   Bid   Ask Vol OI
AAPL160205C00070000     70 25.71 2.06 26.95 27.25   1 53
AAPL160205C00075000     75 20.50 0.00 21.95 22.25   1  2
AAPL160205C00080000     80 15.65 1.85 16.95 17.25  75 50
AAPL160205C00085000     85 10.61 1.42 11.95 12.25 151 44
AAPL160205C00086000     86 10.72 3.47 10.95 11.25  99 52
AAPL160205C00087000     87 10.30 0.00  9.95 10.25   1  1
> head(AAPL$puts)
                    Strike Last   Chg  Bid  Ask  Vol   OI
AAPL160205P00070000   70.0 0.01  0.00 0.00 0.02   13  428
AAPL160205P00075000   75.0 0.02  0.00 0.00 0.02    2  464
AAPL160205P00080000   80.0 0.02 -0.02 0.00 0.04  156 2774
AAPL160205P00083000   83.0 0.01 -0.06 0.01 0.05  102  625
AAPL160205P00085000   85.0 0.03 -0.07 0.03 0.04 1390 2999
AAPL160205P00085500   85.5 0.05 -0.15 0.02 0.06   10  248
> detach('package:quantmod', unload = TRUE)

The data that you’ll get with flipsideR is pretty similar. The major differences are:

  • it’s all in a single data frame (as opposed to being split into separate frames for put and call options);
  • there’s additional information regarding the expiry date for the option and the date and time at which these data were retrieved.

Let’s grab the AAPL data using flipsideR.

> library(flipsideR)
> AAPL = getOptionChain('AAPL')   
> head(AAPL)
  symbol type     expiry strike premium   bid   ask volume open.interest           retrieved
1   AAPL Call 2016-02-05     50      NA 46.95 47.25     NA             0 2016-01-31 06:03:30
2   AAPL Call 2016-02-05     55   42.05 41.95 42.25      9             0 2016-01-31 06:03:30
3   AAPL Call 2016-02-05     60      NA 36.95 37.25     NA             0 2016-01-31 06:03:30
4   AAPL Call 2016-02-05     65      NA 31.95 32.25     NA             0 2016-01-31 06:03:30
5   AAPL Call 2016-02-05     70   25.71 26.95 27.25     NA            53 2016-01-31 06:03:30
6   AAPL Call 2016-02-05     75   20.50 21.95 22.25     NA             2 2016-01-31 06:03:30
> tail(AAPL)
     symbol type     expiry strike premium   bid   ask volume open.interest           retrieved
1225   AAPL  Put 2018-01-19    155   58.30 60.15 62.50     NA            71 2016-01-31 06:03:46
1226   AAPL  Put 2018-01-19    160   63.95 64.30 66.70     NA            84 2016-01-31 06:03:46
1227   AAPL  Put 2018-01-19    165   71.75 68.95 71.35     17           197 2016-01-31 06:03:46
1228   AAPL  Put 2018-01-19    170   73.35 72.50 77.00     NA          3022 2016-01-31 06:03:46
1229   AAPL  Put 2018-01-19    175   66.55 77.30 81.35     NA            68 2016-01-31 06:03:46
1230   AAPL  Put 2018-01-19    180   88.00 82.00 86.80     NA          1074 2016-01-31 06:03:46

The AAPL data were retrieved from the default exchange, NASDAQ. However, it’s also possible to specify an alternative exchange. For example, CVX data from the NYSE.

> CVX = getOptionChain('CVX', 'NYSE')

Finally, it’s also now possible to grab data from ASX.

> OZL = getOptionChain('OZL', 'ASX')