As part of International Open Data Day we spent the morning with a bunch of like minded people poring over some open Census South Africa data. Excellent initiative, @opendatadurban, I’m very excited to see where this is all going and look forward to contributing to the journey!
The data above show the distribution of ages in a segment of the South African population who have either no schooling (blue) or have completed Grade 12 (orange). Click the image to access the interactive plot. Being a subset of the complete data set, this does not tell the full story, but it’s still a rather interesting view on the state of education in South Africa.
I used to spend an inordinate amount of time digging through lightning data. These data came from a number of sources, the World Wide Lightning Location Network (WWLLN) and LIS/OTD being the most common. I recently needed to work with some Hierarchical Data Format (HDF) data. HDF is something of a niche format and, since that was the format used for the LIS/OTD data, I went to review those old scripts. It was very pleasant rediscovering work I did some time ago.
LIS and OTD
The Optical Transient Detector (OTD) and Lightning Imaging Sensor (LIS) were instruments for detecting lightning discharges from Low Earth Orbit. OTD was launched in 1995 on the MicroLab-1 satellite into a near polar orbit with inclination 70°. OTD achieved global (spatial) coverage for the period May 1995 to April 2000 with roughly 60% uptime. LIS was an instrument on the TRMM satellite, launched into a 35° inclination orbit during 1997. Data from LIS were thus confined to more tropical latitudes. The TRMM mission only ended in April 2015.
The seminal work using data from OTD, Global frequency and distribution of lightning as observed from space by the Optical Transient Detector, was published by Hugh Christian and his collaborators in 2003. It’s open access and well worth a read if you are interested in where and when lightning happens across the Earth’s surface.
Preprocessing HDF4 to HDF5
The LIS/OTD data are available as HDF4 files from http://thunder.msfc.nasa.gov/data/. To load them into R I first converted to HDF5 using a tool from the h5utils suite:
$ h5fromh4 -d lrfc LISOTD_LRFC_V2.3.2014.hdf
Loading HDF5 in R
Support for HDF5 in R appears to have evolved appreciably in recent years. I originally used the hdf5 package. Then some time later transitioned to the h5r package. Neither of these appear on CRAN at present. Current support for HDF5 is via the h5 package. This package depends on the h5c++ library, which I needed to grab.
$ sudo apt-get install libhdf5-dev
Then, back in R I installed and loaded the h5 package.
> install.packages("h5") > library(h5)
Ready to roll!
Low Resolution Full Climatology
The next step was to interrogate the contents of one of the HDF files. A given file may contain multiple data sets (this is part of the “hierarchical” nature of HDF), so we’ll check on what data sets are packed into one of those files. Let’s look at the Low Resolution Full Climatology (LRFC) data.
> file = h5file(name = "data/LISOTD_LRFC_V2.3.2014.h5", mode = "r") > dataset = list.datasets(file) > cat("datasets:", dataset, "\n") datasets: /lrfc
Just a single data set, but that’s by design: we only extracted one using h5fromh4 above. What are the characteristics of that data set?
> print(file[dataset]) DataSet 'lrfc' (72 x 144) type: numeric chunksize: NA maxdim: 72 x 144
It contains numerical data and has dimensions 72 by 144, which means that it has been projected onto a latitude/longitude grid with 2.5° resolution. We’ll just go ahead and read in those data.
> lrfc = readDataSet(file[dataset]) > class(lrfc)  "matrix" > dim(lrfc)  72 144
That wasn’t too hard. And it’s not much more complicated if there are multiple data sets per file.
Below is a ggplot showing the annualised distribution of lightning across the Earth’s surface. It’s apparent that most lightning occurs over land in the tropics, with the highest concentration in Central Africa. The units of the colour scale are flashes per square km per year. Higher resolution data can be found the HRFC file (High Resolution Full Climatology), but the LRFC is quite sufficient to get a flavour of the data.
Low Resolution Annual Climatology
The Low Resolution Annual Climatology (LRAC) data have the same spatial resolution as LRFC but the data are broken down by day of year. This allows us to see how lightning activity varies at a particular location during the course of a year.
> file = h5file(name = "data/LISOTD_LRFC_V2.3.2014.h5", mode = "r") > dataset = list.datasets(file) > lrac = readDataSet(file[dataset]) > dim(lrac)  72 144 365
The data are now packed into a three dimensions array, where the first two dimensions are spatial (as for LRFC) and the third dimension corresponds to day of year.
We’ll look at two specific grid cells, one centred at 28.75° S 28.75° E (near the northern border of Lesotho, which according to the plot above is a region of relatively intense lightning activity) and the other at 31.25° S 31.25° E (in the Indian Ocean, just off the coast of Margate, South Africa). The annualised time series are plotted below. Lesotho has a clear annual cycle, with peak lightning activity during the Summer months but extending well into Spring and Autumn. There is very little lightning activity in Lesotho during Winter due to extremely dry and cold conditions. The cell over the Indian Ocean has a relatively high level of sustained lightning activity throughout the year. This is due to the presence of the warm Agulhas Current flowing down the east coast of South Africa. We wrote a paper about this, Processes driving thunderstorms over the Agulhas Current. It’s also open access, so go ahead and check it out.
Although the data in the plot above are rather jagged, with some aggregation and mild filtering they become pleasingly smooth and regular. We observed that you could actually fit the resulting curves rather well with just a pair of sinusoids. That work was documented in A harmonic model for the temporal variation of lightning activity over Africa. Like the other two papers, it’s open access. Enjoy (if that sort of thing is your cup of tea).
My original intention with this post was to show how to handle HDF data in R. But in retrospect it has achieved a second objective, showing that it’s possible to do some meaningful Science with data that’s in the public domain.
Stephen Lund combines two of my passions: technology and exercise. Awesome. Durban Doodles coming soon.
Nobody tells this to people who are beginners, I wish someone told me. All of us who do creative work, we get into it because we have good taste. But there is this gap. For the first couple years you make stuff, it’s just not that good. It’s trying to be good, it has potential, but it’s not. But your taste, the thing that got you into the game, is still killer. And your taste is why your work disappoints you. A lot of people never get past this phase, they quit. Most people I know who do interesting, creative work went through years of this. We know our work doesn’t have this special thing that we want it to have. We all go through this. And if you are just starting out or you are still in this phase, you gotta know its normal and the most important thing you can do is do a lot of work. Put yourself on a deadline so that every week you will finish one story. It is only by going through a volume of work that you will close that gap, and your work will be as good as your ambitions. And I took longer to figure out how to do this than anyone I’ve ever met. It’s gonna take awhile. It’s normal to take awhile. You’ve just gotta fight your way through.Ira Glass
Setting up an automated job under Linux is a cinch thanks to cron. Doing the same under Windows is a little more tricky, but still eminently doable.
I found this tutorial helpful:
That got me 99% of the way there. I wrote a batch file to trigger the script. The critical element was changing to the correct location before invoking R.
@echo off cd "C:\Users\AndrewCo\Projects\" R CMD BATCH script.R
We’re kicking off the inaugural meeting of the Durban R Users Group with a live video presentation by Andrie de Vries (Senior Programme Manager, R Community Projects at Microsoft / Revolution Analytics). Andrie will be talking about “Demonstration of using R in the cloud together with Azure Machine Learning”. If you’ve kept up with Microsoft’s recent contributions to R then you’ll know that this is going to be an interesting talk.
We’ll also have Professor Bruce Page (University of KwaZulu-Natal, School of Biological and Conservation Sciences) talking about “An Introduction to Spatial Analysis in R”, which will cover using the T-LoCoH package for tracking space/time data. Bruce is an recognised expert on monitoring the movements and changes in home ranges of wild animals.
Andrew Collier will give a talk entitled “From Klueless to Kaggle in 15 Minutes” which will demonstrate how you can rapidly get up to speed on Data Analysis and Predictive Modelling in R.
Get the details and sign up here.
I previously wrote about some ad hoc R code for downloading Option Chain data from Google Finance. I finally wrapped it up into a package called flipsideR, which is now available via GitHub. Since I last wrote on this topic I’ve also added support for downloading option data from the Australian Securities Exchange (ASX).
Installation is straightforward using devtools.
> library(devtools) > install_github('DataWookie/flipsideR')
You’re ready to roll!
As I mentioned previously, there’s already functionality in quantmod for retrieving option chain data.
> library(quantmod) > AAPL = getOptionChain('AAPL') > head(AAPL$calls) Strike Last Chg Bid Ask Vol OI AAPL160205C00070000 70 25.71 2.06 26.95 27.25 1 53 AAPL160205C00075000 75 20.50 0.00 21.95 22.25 1 2 AAPL160205C00080000 80 15.65 1.85 16.95 17.25 75 50 AAPL160205C00085000 85 10.61 1.42 11.95 12.25 151 44 AAPL160205C00086000 86 10.72 3.47 10.95 11.25 99 52 AAPL160205C00087000 87 10.30 0.00 9.95 10.25 1 1 > head(AAPL$puts) Strike Last Chg Bid Ask Vol OI AAPL160205P00070000 70.0 0.01 0.00 0.00 0.02 13 428 AAPL160205P00075000 75.0 0.02 0.00 0.00 0.02 2 464 AAPL160205P00080000 80.0 0.02 -0.02 0.00 0.04 156 2774 AAPL160205P00083000 83.0 0.01 -0.06 0.01 0.05 102 625 AAPL160205P00085000 85.0 0.03 -0.07 0.03 0.04 1390 2999 AAPL160205P00085500 85.5 0.05 -0.15 0.02 0.06 10 248 > detach('package:quantmod', unload = TRUE)
The data that you’ll get with flipsideR is pretty similar. The major differences are:
- it’s all in a single data frame (as opposed to being split into separate frames for put and call options);
- there’s additional information regarding the expiry date for the option and the date and time at which these data were retrieved.
Let’s grab the AAPL data using flipsideR.
> library(flipsideR) > AAPL = getOptionChain('AAPL') > head(AAPL) symbol type expiry strike premium bid ask volume open.interest retrieved 1 AAPL Call 2016-02-05 50 NA 46.95 47.25 NA 0 2016-01-31 06:03:30 2 AAPL Call 2016-02-05 55 42.05 41.95 42.25 9 0 2016-01-31 06:03:30 3 AAPL Call 2016-02-05 60 NA 36.95 37.25 NA 0 2016-01-31 06:03:30 4 AAPL Call 2016-02-05 65 NA 31.95 32.25 NA 0 2016-01-31 06:03:30 5 AAPL Call 2016-02-05 70 25.71 26.95 27.25 NA 53 2016-01-31 06:03:30 6 AAPL Call 2016-02-05 75 20.50 21.95 22.25 NA 2 2016-01-31 06:03:30 > tail(AAPL) symbol type expiry strike premium bid ask volume open.interest retrieved 1225 AAPL Put 2018-01-19 155 58.30 60.15 62.50 NA 71 2016-01-31 06:03:46 1226 AAPL Put 2018-01-19 160 63.95 64.30 66.70 NA 84 2016-01-31 06:03:46 1227 AAPL Put 2018-01-19 165 71.75 68.95 71.35 17 197 2016-01-31 06:03:46 1228 AAPL Put 2018-01-19 170 73.35 72.50 77.00 NA 3022 2016-01-31 06:03:46 1229 AAPL Put 2018-01-19 175 66.55 77.30 81.35 NA 68 2016-01-31 06:03:46 1230 AAPL Put 2018-01-19 180 88.00 82.00 86.80 NA 1074 2016-01-31 06:03:46
The AAPL data were retrieved from the default exchange, NASDAQ. However, it’s also possible to specify an alternative exchange. For example, CVX data from the NYSE.
> CVX = getOptionChain('CVX', 'NYSE')
Finally, it’s also now possible to grab data from ASX.
> OZL = getOptionChain('OZL', 'ASX')