# Categorically Variable

Only search Categorically Variable.

## Real Time Lightning Data

A map of real time lightning data from the World Wide Lightning Location Network (WWLLN) is now available here.

Read more

## How Much Time to Conceive?

This morning my wife presented me with a rather interesting statistic: a healthy couple has a 25% chance of conception every month [1], and that this should result in a 75% to 85% chance of conception after a year. This sounded rather interesting and it occurred to me that it really can’t be that simple. There are surely a lot of variables which influence this probability. Certainly age should be a factor and, after a short search, I found some more age-specific information which indicated that for a woman in her thirties, the probability is only around 15% [2,3].

Read more

## Processing EXIF Data

I got quite inspired by the EXIF with R post on the Timely Portfolio blog and decided to do a similar analysis with my photographic database.

Read more

## Updated Comrades Winners Chart

On Friday I received my copy of The Official Results Brochure for the 2013 Comrades Marathon. Always makes for a diverting half an hour’s reading. And the tables at the front provide some very interesting statistics. Seemed like a good opportunity to update my Chart of Comrades Winners.

Read more

## Amy Cuddy: Your body language shapes who you are...

… but be careful with your plots: they might be misinterpreted.

Amy Cuddy gives a great talk. Provided me with lots to think about and I will happily confess that I have struck a few power poses (but only after ensuring that I am quite alone)!

Read more

## Deriving a Priority Queue from a Plain Vanilla Queue

Following up on my recent post about implementing a queue as a reference class, I am going to derive a Priority Queue class.

Read more

## Implementing a Queue as a Reference Class

I am working on a simulation for an Automatic Repeat-reQuest (ARQ) algorithm. After trying various options, I concluded that I would need an implementation of a queue to make this problem tractable. R does not have a native queue data structure, so this seemed like a good opportunity to implement one and learn something about Reference Classes in the process.

# The Implementation

We use setRefClass() to create a generator function which will create objects of the Queue class.

# Methods of the Generator Function

Let’s look at some of the methods for the generator function. The first of these is new(), which will create a fresh instance of the class. We will return to more examples of this later.

Next we have methods() and fields() which, not surprisingly, return vectors of the names of the class’s methods and data fields respectively.

Lastly the help() method which, by default, returns some general information about the class.

However, if provided with a method name as an argument, it will provide a help string for the specifed method. These help strings are embedded in the class definition using a syntax which will be familiar to Python programmers.

# Creating Objects (and some Reference Class Features)

Each Queue object can be assigned a name. I am not sure whether this will be required, but it seemed like a handy feature to have. This can be done after the object has been created…

… or during the creation process.

We have seen that the name field can be accessed using the dollar operator. We can also use the accessors() method for the generator function to create methods for reading and writing to the name field.

# Standard Queue Functionality

Let’s get down to the actual queue functionality. First we add elements to the queue using the push() method.

The queue now contains three items (two strings and a number). The items are stored in the data field. Ideally this should not be directly accessible, but the Reference Class implementation does not provide for private data. So you can access the items directly (but this defeats the object of a queue!).

Next we can start to access and remove items from the queue. The first way to do this uses pop(), which returns the item at the head of the queue and at the same time removes it from the queue.

What if we just want to have a look at the first item in the queue but not remove it? Use peek().

Next we have poll(), which is very much like pop() in that it removes the first item from the queue. But, as we will see in a moment, it behaves differently when the queue is empty.

So we have ejected all three items from the queue and we should expect the behaviour of these functions to be a little different now. pop() generates an error message: obviously we cannot remove an item from the queue if it is empty!

This might not be the best behaviour for harmonious code, so peek() and poll() more elegantly indicate that that the queue is exhausted by returning NULL.

## Non-Standard Queue Functionality

I am going to need to be able to take a look at multiple items in the queue, so the peek() method accepts an argument which specifies item locations. To illustrate, first let’s push a number of items onto a queue.

As we have already seen, the default peek() behaviour is just to take a look at the item at the head of the queue.

But we can also choose items further down the queue as well as ranges of items.

If the requested range of locations extends beyond the end of the queue then the method returns NULL as before.

# Conclusion

That is all the infrastructure I will need for my ARQ project. Well, almost. I would actually prefer a Priority Queue. I’ll be back with the implementation of that in a day or two. In the meantime though, if anyone has any ideas for additional functionality or has suggestions for how this code might be improved, I would be very pleased to hear them.

This code is now published in the liqueueR package.

Read more

## Lightning Activity Predictions For Single Buoy Moorings

A short talk that I gave at the LIGHTS 2013 Conference (Johannesburg, 12 September 2013). The slides are a little short on text because I like the audience to hear the content rather than read it. The objective with this project was to develop a model which would predict the occurrence of lightning in the vicinity of a Single Buoy Mooring (SBM). Analysis and visualisations were all done in R. I used data from the World Wide Lightning Location Network (WWLLN) and considered four possible models: Neural Network, Conditional Inference Tree, Support Vector Machine and Random Forest. Of the four, Random Forests produced the best performance. The preliminary results from the Random Forests model are very promising: there is good agreement between predicted and observed lightning occurrence in the vicinity of the SBM.

Read more

## Iterators in R

According to Wikipedia, an iterator is “an object that enables a programmer to traverse a container”. A collection of items (stashed in a container) can be thought of as being “iterable” if there is a logical progression from one element to the next (so a list is iterable, while a set is not). An iterator is then an object for moving through the container, one item at a time.

Iterators are a fundamental part of contemporary Python programming, where they form the basis for loops, list comprehensions and generator expressions. Iterators facilitate navigation of container classes in the C++ Standard Template Library (STL). They are also to be found in C#, Java, Scala, Matlab, PHP and Ruby.

Although iterators are not part of the native R language, they are implemented in the iterators and itertools packages.

# iterators Package

The iterators package is written and maintained by Revolution Analytics. It contains all of the basic iterator functionality.

## Iterating over a Vector

The iter() function transforms a container into an iterator. Using it we can make an iterator from a vector.

Here iteration is performed manually by calling nextElem(): each call yields the next element in the vector. And when we fall off the end of the vector, we get a StopIteration error.

## Iterating over Data Frames and Matrices

Similarly, we can make an iterator from a data frame.

We can iterate over the data frame by column (the default behaviour)…

… or by row.

To my mind iterating by row is likely to be the most common application, so I am a little puzzled by the fact that it is not the default. No doubt the creators had a solid reason for this design choice though!

Not surprisingly, we can iterate through the rows and columns of a matrix as well. But, taking things one step further, we will do this in parallel using the foreach library. First we set up the requisites for multicore processing.

Then, for illustration purposes, we create a matrix of normally distributed random numbers. Our parallel loop will use an iterator to run through the matrix, computing the mean for every row. Of course, there are other ways of accomplishing the same task, but this syntax is rather neat and intuitive.

Another way of slicing up a matrix uses iapply(), which is analogous to the native apply() function. First, let’s create a small and orderly matrix.

Choosing margin = 1 gives us the rows, while margin = 2 yields the columns.

## Iterating through a File

An iterator to read lines from a file is extremely useful. This is certainly the iterator pattern that I have used most often in Python. In R a file iterator is created using ireadLines().

Each call to nextElem() returns a new line from the file. We can put this into a loop and traverse the entire file.

Hmmmm. The code for the loop is a little clunky and it ends in a rather undignified fashion. Surely there is a better way to do this? Indeed there is! More on that later.

## Filtering with an Iterator

iter() allows you to specify a filter function which can be used to select particular items from a container. This function should return a Boolean (or a value which can be coerced to Boolean) indicating whether (TRUE) or not (FALSE) an item should be accepted. To illustrate this, we will pick out prime numbers from the first one hundred integers. The gmp package has an isprime() function which is perfectly suited to this job.

First we’ll look at the basic functionality of isprime().

A return value of 2 indicates that a number is definitely prime, while 0 indicates a composite number. For small numbers these are the only two options, however, for larger numbers the Miller-Rabin primality test is applied, in which case a return value of 1 indicates that a number is probably prime.

So, here we go, grabbing the first four primes:

## An Iterable Version of split()

The native function split() accepts two arguments: the first is a vector and the second is a factor which dictates how the vector’s elements will be divided into groups. The return value is a list of vectors corresponding to each of the groups. The isplit() function accepts the same arguments and results in an iterator which steps through each of the groups, returning a list with two fields: the key (one of the levels of the factor) and the corresponding values extracted from the vector.

## Functions as Iterators

It is also possible to have a function packaged as an iterator. The function is called every time that the iterator is advanced.

Here each succesive element returned by the iterator is generated by a call to an anonymous function which extracts a single sample from our list of names. I am not quite sure how I would use this facility in a practical situation (why is this superior to simply calling the function?), but I am sure that a good application will reveal itself in due course.

## Rolling Your Own: A Fibonacci Iterator

An iterator to generate Fibonacci Numbers is readily implemented in Python.

An analogous iterator can be implemented in R. Perhaps the syntax is not quite as succinct, but it certainly does the job.

Now we can instantiate a Fibonacci iterator and start generating terms…

Again, you might be wondering what advantages this provides over simply having a function which generates the sequence. Consider a situation where you need to have two independent sequences. Certainly the state of each sequence could be stored and passed as an argument. An object oriented solution would be natural. However, iterators will do the job very nicely too.

We have generated some terms from each of the iterators. Not surprisingly, when we return to generate further terms, we find that each of the iterators picks up at just the right position in the sequence.

# itertools Package

The itertools package is written and maintained by Steve Weston (Yale University) and Hadley Wickham (RStudio). It provides a range of extensions to the basic iterator functionality.

## Stop When You Reach the End

The Fibonacci iterator implemented above will keep on generating terms indefinitely. What if we wanted it to terminate after a finite number of terms? We could introduce a parameter for the required number of terms and generate a StopIteration error when we try to exceed this limit.

In practice this works fine…

… but it generates a rather rude error message if you try to iterate beyond the prescribed number of terms. How can we handle this more elegantly? Enter the ihasNext() wrapper which enables a hasNext() function for an iterator.

So, instead of blindly trying to proceed to the next item, we first check whether there are still items available in the iterator. This pattern can be applied to any situation where the iterator has only a finite number of terms. Iterating through the lines in a file would be a prime example!

That’s much better!

## Limiting the Limitless

Adapting our Fibonacci iterator to stop after a finite number of terms was a bit of a kludge. Not to worry: the ilimit() wrapper allows us to stipulate a limit on the number of terms.

## Recycle Everything

We have seen how to truncate an effectively infinite series of iterates. How about replicating a finite series? Using recycle() you can run through the iterable a number of times. As a first example, we create an iterator which will run through a integer sequence (1, 2 and 3) twice.

Next we generate the first six terms from the Fibonacci sequence and repeat them twice.

## Enumerating

The enumerate() function returns an iterator which returns each element of the iterable as a list with two elements: an index and the corresponding value. So, for example, we can number each of the elements in our list of names.

## Zip Them Up!

A generalisation of enumeration is created by zipping up arbitrary iterables.

Naturally, we can use this to replicate the functionality of enumerate().

## Getting Chunky!

What about returning groups of values from an iterator? The ichunk() wrapper takes two arguments: the first is an iterator, the second is the number of terms to be wrapped in a chunk. Here is an example of generating groups of three successive terms from the Fibonacci sequence.

## Until I Say “Stop!”

Finally, an iterator which continues to generate terms until a time limit is reached. How many terms will our Fibonacci iterator generate in 0.1 seconds?

# Conclusion

I have already identified a few older projects where the code can be significantly improved by the use of iterators. And I will certainly be using them in new projects. I hope that you too will be able to apply them in your work. Iterate and prosper!

Read more

## Introduction to Fractals

A short while ago I was contracted to write a short piece entitled “Introduction to Fractals”. The article can be found here. Admittedly it is hard to do justice to the topic in less than 1000 words.

Read more

## Percolation Threshold: Including Next-Nearest Neighbours

In my previous post about estimating the Percolation Threshold on a square lattice, I only considered flow from a given cell to its four nearest neighbours. It is a relatively simple matter to extend the recursive flow algorithm to include other configurations as well.

Malarz and Galam (2005) considered the problem of percolation on a square lattice for various ranges of neighbor links. Below is their illustration of (a) nearest neighbour “NN” and (b) next-nearest neighbour “NNN” links.

# Implementing Next-Nearest Neighbours

There were at least two options for modifying the flow function to accommodate different configurations of neighbours: either they could be given as a function parameter or defined as a global variable. Although the first option is the best from an encapsulation perspective, it does make the recursive calls more laborious. I got lazy and went with the second option.

We will add a third example grid to illustrate the effect.

Generating the compound plot with two different values of the global variable required a little thought. But fortunately the scoping rules in R allow for a rather nice implementation.

Here we have two plots for the same grid showing (left) NN and (right) NN+NNN percolation. Including the possibility of “diagonal” percolation extends the range of cells that are reachable and this grid, which does not percolate with just NN, does support percolation with NN+NNN.

No modifications are required to the percolation function.

# Effect on Percolation Threshold

Finally we can see the effect of including next-nearest neighbours on the percolation threshold. We perform the same logistic fit as previously.

This agrees well with the result for NN+NNN from Table 1 in Malarz and Galam (2005).

# A Larger Lattice

Finally, let’s look at a larger lattice at the percolation threshold.

# References

• Malarz, K., & Galam, S. (2005). Square-lattice site percolation at increasing ranges of neighbor bonds. Physical Review E, 71(1), 016125. doi:10.1103/PhysRevE.71.016125.
Read more

## Percolation Threshold on a Square Lattice

Manfred Schroeder touches on the topic of percolation a number of times in his encyclopaedic book on fractals (Schroeder, M. (1991) Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. Percolation has numerous practical applications, the most interesting of which (from my perspective) is the flow of hot water through ground coffee! The problem of percolation can be posed as follows: suppose that a liquid is poured onto a solid block of some substance. If the substance is porous then it is possible for the liquid to seep through the pores and make it all the way to the bottom of the block. Whether or not this happens is determined by the connectivity of the pores within the substance. If it is extremely porous then it is very likely that there will be an open path of pores connecting the top to the bottom and the liquid will flow freely. If, on the other hand, the porosity is low then such a path may not exist. Evidently there is a critical porosity threshold which divides these two regimes.

Read more

Read more

## Plotting Times of Discrete Events

I recently enjoyed reading O’Hara, R. B., & Kotze, D. J. (2010). _Do not log-transform count data_. Methods in Ecology and Evolution, 1(2), 118–122. doi:10.1111/j.2041-210X.2010.00021.x.</blockquote>

Read more

## Applying the Same Operation to a Number of Variables

Just a quick note on a short hack that I cobbled together this morning. I have an analysis where I need to perform the same set of operations to a list of variables. In order to do this in a compact and robust way, I wanted to write a loop that would run through the variables and apply the operations to each of them in turn. This can be done using get() and assign().

# Simple Illustration

To illustrate the procedure, I will use the simple example of squaring the numerical values stored in three variables. First we initialise the variables.

Then we loop over the variable names (as strings), creating a temporary copy of each one and applying the operation to the copy. Then the copy is assigned back to the original variable.

Finally we check that the operation has been executed as expected.

This is perhaps a little wasteful in terms of resources (creating the temporary variables), but does the job. Obviously in practice you would only implement this sort of solution if there were either a large number of variables to be transformed or the transformation required a relatively complicated set of operations.

# Alternative Implementations

Following up on the numerous insightful responses to this post, there are a number of other ways of skinning the same cat. But, I should point out that the solution above is still optimal for my particular application where I had a series of operations to be applied to each of the variables, some of which involved conditional branches, making a solution using vectorised operations rather messy. Furthermore, I did not want to have to pack and unpack from a list.

# Usage Case

To give a better idea of the type of scenario that I was looking at, consider a situation in which you have a number of data frames. Let’s call them A, B, C and D. The data in each is similar, yet each pertains to a distinct population. And, for whatever reason, you want to keep these data separate rather than consolidating them into a single data frame. Now suppose further that you wanted to perform a set of operations on each of them:

• retain only a subset of the columns;
• rename the remaining columns; and
• derive new columns using transformations of the existing columns.

Using the framework above you could achieve all of these objectives without any replication of code.

Read more

## Mounting a sshfs volume via the crontab

I need to mount a directory from my laptop on my desktop machine using sshfs. At first I was not making the mount terribly regularly, so I did it manually each time that I needed it. However, the frequency increased over time and I was eventually mounting it every day (or multiple times during the course of a day!). This was a perfect opportunity to employ some automation.

Read more

## Top 250 Movies at IMDb

Some years ago I allowed myself to accept a challenge to read the Top 100 Novels of All Time (complete list here). This list was put together by Richard Lacayo and Lev Grossman at Time Magazine.

To start with I could tick off a number of books that I had already read. That left me with around 75 books outstanding. So I knuckled down. The Lord of the Rings had been on my reading list for a number of years, so this was my first project. A little unfair for this trilogy to count as just one book… but I consumed it with gusto! One down. Other books followed. They were also great reads. And then I hit a couple of books that were just, well, to put it plainly, heavy going. I am sure that they were great books and my lack of enjoyment was entirely a reflection on me and not the quality of the books. No doubt I learned a lot from reading them. But it was hard work! At this stage it occurred to me that the book list was constructed from a rather specific perspective of what constituted a great book. A perspective which is quite different from my own. So I had to admit defeat: my literary tastes will have to mature a bit before I attack this list again!

Then last week I was reading through a back issue of The Linux Journal and came across an article which used shell tools to download and process the IMDb list of Top 250 Movies. This list is constructed from IMDb users’ votes and so represents a fairly democratic and egalitarian perspective. Working through a list of movies seems to me to be a lot easier than a list of books, so this appealed to my inner sloth. And gave me an idea for a quick little R script.

We will use the XML library to retrieve the page from IMDb and parse out the appropriate table.

The output reflects the content of the rating table exactly. However, the rank column is redundant since the same information is captured by the row labels. We can remove this column to make the data more concise.

There are still a few issues with the data:

• the years are bundled up with the titles;
• the rating data are strings;
• the votes data are also strings and have embedded commas.

All of these problems are easily fixed though.

I am happy to see that The Good, the Bad and the Ugly rates at number 5. This is one of my favourite movies! Clearly I am not alone.

Finally, to gain a little perspective on the relationship between the release year, votes and rating we can put together a simple bubble plot.

When I have some more time on my hands I am going to use the IMDb API to grab some additional information on each of these movies and see if anything interesting emerges from the larger data set.

Read more

## Flushing Live MetaTrader Logs to Disk

The logs generated by expert advisors and indicators when running live on MetaTrader are displayed in the Experts tab at the bottom of the terminal window. Sometimes it is more convenient to analyse these logs offline (especially since the order of the records in the terminal runs in a rather counter-intuitive bottom-to-top order!). However, because writing to the log files is buffered, there can be a delay before what you see in the terminal is actually written to disk.

Read more

## Clustering Lightning Discharges to Identify Storms

A short talk that I gave at the LIGHTS 2013 Conference (Johannesburg, 12 September 2013). The slides are relatively devoid of text because I like the audience to hear the content rather than read it. The central message of the presentation is that clustering lightning discharges into storms is not a trivial task, but still a worthwhile challenge because it can lead to some very interesting science!

Read more