Hazardous and Benign Space Objects: Solving Kepler's Equation
Following on from my previous post about Near Earth Objects, today we are going to solve Kepler’s Equation to find the eccentric anomaly, which is the next step towards plotting the positions of these NEOs relative to Earth.Read more
Comrades Marathon: Negative Splits and Cheating
With this year’s Comrades Marathon just less than a month away, I was reminded of a story from earlier in the year. Mark Dowdeswell, a statistician at Wits University, found evidence of cheating by some middle and back of the pack Comrades runners. He identified a group of 20 athletes who had suspicious negative splits: they ran much faster in the second half of the race. There was one runner in particular whose splits were just too good to be true. When the story was publicised, this particular runner claimed that it was a conspiracy.Read more
Hazardous and Benign Space Objects: Getting the Data
The recent story about a skydiver nearly being hit by falling meteor got me thinking about all the pieces of rock floating around in near-Earth space. Despite the fact that the supposed meteor was probably just a chunk of rock mistakenly packed in with a parachute, the fact that something like that could actually happen is quite intriguing. And not a little frightening.Read more
R Interface to Myfxbook
Myfxbook provides an interface to your FOREX trading accounts as well as an active trading community.Read more
Earthquakes: Land / Ocean Distribution
The next stage in my earthquake analysis project is to partition the events into groups with epicentre over land or water.Read more
Largest Volcanoes in Recorded History (and other statistics)
Around 199 years ago the largest volcano in recorded history, Mount Tambora, erupted, spewing an enormous volume of molten rock and ash into the atmosphere and onto the surrounding land.Read more
Earthquakes: Magnitude / Depth Chart
I am working on a project related to secondary effects of earthquakes. To guide me in the analysis I need a chart showing the location, magnitude and depth of recent earthquakes. There are a host of such charts available already, but since I had the required data on hand, it seemed like a good idea to take a stab at it myself.Read more
Daylight Saving Effect on Financial Indices
Does the transition to and from Daylight Saving Time (DST) have a (significant) effect on the stock market?Read more
Filtering Data with L1 Regularisation
Filtering Data with L2 Regularisation
I have just finished reading Momentum Strategies with L1 Filter by Tung-Lam Dao. The smoothing results presented in this paper are interesting and I thought it would be cool to implement the L1 and L2 filtering schemes in R. We’ll start with the L2 scheme here because it has an exact solution and I will follow up with the L1 scheme later on.Read more
Real Time Lightning Data
A map of real time lightning data from the World Wide Lightning Location Network (WWLLN) is now available here.Read more
How Much Time to Conceive?
This morning my wife presented me with a rather interesting statistic: a healthy couple has a 25% chance of conception every month , and that this should result in a 75% to 85% chance of conception after a year. This sounded rather interesting and it occurred to me that it really can’t be that simple. There are surely a lot of variables which influence this probability. Certainly age should be a factor and, after a short search, I found some more age-specific information which indicated that for a woman in her thirties, the probability is only around 15% [2,3].Read more
Processing EXIF Data
Updated Comrades Winners Chart
On Friday I received my copy of The Official Results Brochure for the 2013 Comrades Marathon. Always makes for a diverting half an hour’s reading. And the tables at the front provide some very interesting statistics. Seemed like a good opportunity to update my Chart of Comrades Winners.Read more
Contour and Density Layers with ggmap
Amy Cuddy: Your body language shapes who you are...
… but be careful with your plots: they might be misinterpreted.
Amy Cuddy gives a great talk. Provided me with lots to think about and I will happily confess that I have struck a few power poses (but only after ensuring that I am quite alone)!Read more
Deriving a Priority Queue from a Plain Vanilla Queue
Implementing a Queue as a Reference Class
I am working on a simulation for an Automatic Repeat-reQuest (ARQ) algorithm. After trying various options, I concluded that I would need an implementation of a queue to make this problem tractable. R does not have a native queue data structure, so this seemed like a good opportunity to implement one and learn something about Reference Classes in the process.
We use setRefClass() to create a generator function which will create objects of the Queue class.
Methods of the Generator Function
Let’s look at some of the methods for the generator function. The first of these is new(), which will create a fresh instance of the class. We will return to more examples of this later.
Next we have methods() and fields() which, not surprisingly, return vectors of the names of the class’s methods and data fields respectively.
Lastly the help() method which, by default, returns some general information about the class.
However, if provided with a method name as an argument, it will provide a help string for the specifed method. These help strings are embedded in the class definition using a syntax which will be familiar to Python programmers.
Creating Objects (and some Reference Class Features)
Each Queue object can be assigned a name. I am not sure whether this will be required, but it seemed like a handy feature to have. This can be done after the object has been created…
… or during the creation process.
We have seen that the name field can be accessed using the dollar operator. We can also use the accessors() method for the generator function to create methods for reading and writing to the name field.
Standard Queue Functionality
Let’s get down to the actual queue functionality. First we add elements to the queue using the push() method.
The queue now contains three items (two strings and a number). The items are stored in the data field. Ideally this should not be directly accessible, but the Reference Class implementation does not provide for private data. So you can access the items directly (but this defeats the object of a queue!).
Next we can start to access and remove items from the queue. The first way to do this uses pop(), which returns the item at the head of the queue and at the same time removes it from the queue.
What if we just want to have a look at the first item in the queue but not remove it? Use peek().
Next we have poll(), which is very much like pop() in that it removes the first item from the queue. But, as we will see in a moment, it behaves differently when the queue is empty.
So we have ejected all three items from the queue and we should expect the behaviour of these functions to be a little different now. pop() generates an error message: obviously we cannot remove an item from the queue if it is empty!
This might not be the best behaviour for harmonious code, so peek() and poll() more elegantly indicate that that the queue is exhausted by returning NULL.
Non-Standard Queue Functionality
I am going to need to be able to take a look at multiple items in the queue, so the peek() method accepts an argument which specifies item locations. To illustrate, first let’s push a number of items onto a queue.
As we have already seen, the default peek() behaviour is just to take a look at the item at the head of the queue.
But we can also choose items further down the queue as well as ranges of items.
If the requested range of locations extends beyond the end of the queue then the method returns NULL as before.
That is all the infrastructure I will need for my ARQ project. Well, almost. I would actually prefer a Priority Queue. I’ll be back with the implementation of that in a day or two. In the meantime though, if anyone has any ideas for additional functionality or has suggestions for how this code might be improved, I would be very pleased to hear them.
This code is now published in the liqueueR package.Read more
Lightning Activity Predictions For Single Buoy Moorings
A short talk that I gave at the LIGHTS 2013 Conference (Johannesburg, 12 September 2013). The slides are a little short on text because I like the audience to hear the content rather than read it. The objective with this project was to develop a model which would predict the occurrence of lightning in the vicinity of a Single Buoy Mooring (SBM). Analysis and visualisations were all done in R. I used data from the World Wide Lightning Location Network (WWLLN) and considered four possible models: Neural Network, Conditional Inference Tree, Support Vector Machine and Random Forest. Of the four, Random Forests produced the best performance. The preliminary results from the Random Forests model are very promising: there is good agreement between predicted and observed lightning occurrence in the vicinity of the SBM.Read more
Iterators in R
According to Wikipedia, an iterator is “an object that enables a programmer to traverse a container”. A collection of items (stashed in a container) can be thought of as being “iterable” if there is a logical progression from one element to the next (so a list is iterable, while a set is not). An iterator is then an object for moving through the container, one item at a time.
Iterators are a fundamental part of contemporary Python programming, where they form the basis for loops, list comprehensions and generator expressions. Iterators facilitate navigation of container classes in the C++ Standard Template Library (STL). They are also to be found in C#, Java, Scala, Matlab, PHP and Ruby.
The iterators package is written and maintained by Revolution Analytics. It contains all of the basic iterator functionality.
Iterating over a Vector
The iter() function transforms a container into an iterator. Using it we can make an iterator from a vector.
Here iteration is performed manually by calling nextElem(): each call yields the next element in the vector. And when we fall off the end of the vector, we get a StopIteration error.
Iterating over Data Frames and Matrices
Similarly, we can make an iterator from a data frame.
We can iterate over the data frame by column (the default behaviour)…
… or by row.
To my mind iterating by row is likely to be the most common application, so I am a little puzzled by the fact that it is not the default. No doubt the creators had a solid reason for this design choice though!
Not surprisingly, we can iterate through the rows and columns of a matrix as well. But, taking things one step further, we will do this in parallel using the foreach library. First we set up the requisites for multicore processing.
Then, for illustration purposes, we create a matrix of normally distributed random numbers. Our parallel loop will use an iterator to run through the matrix, computing the mean for every row. Of course, there are other ways of accomplishing the same task, but this syntax is rather neat and intuitive.
Another way of slicing up a matrix uses iapply(), which is analogous to the native apply() function. First, let’s create a small and orderly matrix.
Choosing margin = 1 gives us the rows, while margin = 2 yields the columns.
Iterating through a File
An iterator to read lines from a file is extremely useful. This is certainly the iterator pattern that I have used most often in Python. In R a file iterator is created using ireadLines().
Each call to nextElem() returns a new line from the file. We can put this into a loop and traverse the entire file.
Hmmmm. The code for the loop is a little clunky and it ends in a rather undignified fashion. Surely there is a better way to do this? Indeed there is! More on that later.
Filtering with an Iterator
iter() allows you to specify a filter function which can be used to select particular items from a container. This function should return a Boolean (or a value which can be coerced to Boolean) indicating whether (TRUE) or not (FALSE) an item should be accepted. To illustrate this, we will pick out prime numbers from the first one hundred integers. The gmp package has an isprime() function which is perfectly suited to this job.
First we’ll look at the basic functionality of isprime().
A return value of 2 indicates that a number is definitely prime, while 0 indicates a composite number. For small numbers these are the only two options, however, for larger numbers the Miller-Rabin primality test is applied, in which case a return value of 1 indicates that a number is probably prime.
So, here we go, grabbing the first four primes:
An Iterable Version of split()
The native function split() accepts two arguments: the first is a vector and the second is a factor which dictates how the vector’s elements will be divided into groups. The return value is a list of vectors corresponding to each of the groups. The isplit() function accepts the same arguments and results in an iterator which steps through each of the groups, returning a list with two fields: the key (one of the levels of the factor) and the corresponding values extracted from the vector.
Functions as Iterators
It is also possible to have a function packaged as an iterator. The function is called every time that the iterator is advanced.
Here each succesive element returned by the iterator is generated by a call to an anonymous function which extracts a single sample from our list of names. I am not quite sure how I would use this facility in a practical situation (why is this superior to simply calling the function?), but I am sure that a good application will reveal itself in due course.
Rolling Your Own: A Fibonacci Iterator
An iterator to generate Fibonacci Numbers is readily implemented in Python.
An analogous iterator can be implemented in R. Perhaps the syntax is not quite as succinct, but it certainly does the job.
Now we can instantiate a Fibonacci iterator and start generating terms…
Again, you might be wondering what advantages this provides over simply having a function which generates the sequence. Consider a situation where you need to have two independent sequences. Certainly the state of each sequence could be stored and passed as an argument. An object oriented solution would be natural. However, iterators will do the job very nicely too.
We have generated some terms from each of the iterators. Not surprisingly, when we return to generate further terms, we find that each of the iterators picks up at just the right position in the sequence.
The itertools package is written and maintained by Steve Weston (Yale University) and Hadley Wickham (RStudio). It provides a range of extensions to the basic iterator functionality.
Stop When You Reach the End
The Fibonacci iterator implemented above will keep on generating terms indefinitely. What if we wanted it to terminate after a finite number of terms? We could introduce a parameter for the required number of terms and generate a StopIteration error when we try to exceed this limit.
In practice this works fine…
… but it generates a rather rude error message if you try to iterate beyond the prescribed number of terms. How can we handle this more elegantly? Enter the ihasNext() wrapper which enables a hasNext() function for an iterator.
So, instead of blindly trying to proceed to the next item, we first check whether there are still items available in the iterator. This pattern can be applied to any situation where the iterator has only a finite number of terms. Iterating through the lines in a file would be a prime example!
That’s much better!
Limiting the Limitless
Adapting our Fibonacci iterator to stop after a finite number of terms was a bit of a kludge. Not to worry: the ilimit() wrapper allows us to stipulate a limit on the number of terms.
We have seen how to truncate an effectively infinite series of iterates. How about replicating a finite series? Using recycle() you can run through the iterable a number of times. As a first example, we create an iterator which will run through a integer sequence (1, 2 and 3) twice.
Next we generate the first six terms from the Fibonacci sequence and repeat them twice.
The enumerate() function returns an iterator which returns each element of the iterable as a list with two elements: an index and the corresponding value. So, for example, we can number each of the elements in our list of names.
Zip Them Up!
A generalisation of enumeration is created by zipping up arbitrary iterables.
Naturally, we can use this to replicate the functionality of enumerate().
What about returning groups of values from an iterator? The ichunk() wrapper takes two arguments: the first is an iterator, the second is the number of terms to be wrapped in a chunk. Here is an example of generating groups of three successive terms from the Fibonacci sequence.
Until I Say “Stop!”
Finally, an iterator which continues to generate terms until a time limit is reached. How many terms will our Fibonacci iterator generate in 0.1 seconds?
I have already identified a few older projects where the code can be significantly improved by the use of iterators. And I will certainly be using them in new projects. I hope that you too will be able to apply them in your work. Iterate and prosper!
- Weston, S. (2012). Writing Custom Iterators. </ul>