# Categorically Variable

Only search Categorically Variable.

## Applying the Same Operation to a Number of Variables

Just a quick note on a short hack that I cobbled together this morning. I have an analysis where I need to perform the same set of operations to a list of variables. In order to do this in a compact and robust way, I wanted to write a loop that would run through the variables and apply the operations to each of them in turn. This can be done using get() and assign().

# Simple Illustration

To illustrate the procedure, I will use the simple example of squaring the numerical values stored in three variables. First we initialise the variables.

Then we loop over the variable names (as strings), creating a temporary copy of each one and applying the operation to the copy. Then the copy is assigned back to the original variable.

Finally we check that the operation has been executed as expected.

This is perhaps a little wasteful in terms of resources (creating the temporary variables), but does the job. Obviously in practice you would only implement this sort of solution if there were either a large number of variables to be transformed or the transformation required a relatively complicated set of operations.

# Alternative Implementations

Following up on the numerous insightful responses to this post, there are a number of other ways of skinning the same cat. But, I should point out that the solution above is still optimal for my particular application where I had a series of operations to be applied to each of the variables, some of which involved conditional branches, making a solution using vectorised operations rather messy. Furthermore, I did not want to have to pack and unpack from a list.

# Usage Case

To give a better idea of the type of scenario that I was looking at, consider a situation in which you have a number of data frames. Let’s call them A, B, C and D. The data in each is similar, yet each pertains to a distinct population. And, for whatever reason, you want to keep these data separate rather than consolidating them into a single data frame. Now suppose further that you wanted to perform a set of operations on each of them:

• retain only a subset of the columns;
• rename the remaining columns; and
• derive new columns using transformations of the existing columns.

Using the framework above you could achieve all of these objectives without any replication of code.

## Mounting a sshfs volume via the crontab

I need to mount a directory from my laptop on my desktop machine using sshfs. At first I was not making the mount terribly regularly, so I did it manually each time that I needed it. However, the frequency increased over time and I was eventually mounting it every day (or multiple times during the course of a day!). This was a perfect opportunity to employ some automation.

## Top 250 Movies at IMDb

Some years ago I allowed myself to accept a challenge to read the Top 100 Novels of All Time (complete list here). This list was put together by Richard Lacayo and Lev Grossman at Time Magazine.

To start with I could tick off a number of books that I had already read. That left me with around 75 books outstanding. So I knuckled down. The Lord of the Rings had been on my reading list for a number of years, so this was my first project. A little unfair for this trilogy to count as just one book… but I consumed it with gusto! One down. Other books followed. They were also great reads. And then I hit a couple of books that were just, well, to put it plainly, heavy going. I am sure that they were great books and my lack of enjoyment was entirely a reflection on me and not the quality of the books. No doubt I learned a lot from reading them. But it was hard work! At this stage it occurred to me that the book list was constructed from a rather specific perspective of what constituted a great book. A perspective which is quite different from my own. So I had to admit defeat: my literary tastes will have to mature a bit before I attack this list again!

Then last week I was reading through a back issue of The Linux Journal and came across an article which used shell tools to download and process the IMDb list of Top 250 Movies. This list is constructed from IMDb users’ votes and so represents a fairly democratic and egalitarian perspective. Working through a list of movies seems to me to be a lot easier than a list of books, so this appealed to my inner sloth. And gave me an idea for a quick little R script.

We will use the XML library to retrieve the page from IMDb and parse out the appropriate table.

The output reflects the content of the rating table exactly. However, the rank column is redundant since the same information is captured by the row labels. We can remove this column to make the data more concise.

There are still a few issues with the data:

• the years are bundled up with the titles;
• the rating data are strings;
• the votes data are also strings and have embedded commas.

All of these problems are easily fixed though.

I am happy to see that The Good, the Bad and the Ugly rates at number 5. This is one of my favourite movies! Clearly I am not alone.

Finally, to gain a little perspective on the relationship between the release year, votes and rating we can put together a simple bubble plot.

When I have some more time on my hands I am going to use the IMDb API to grab some additional information on each of these movies and see if anything interesting emerges from the larger data set.

## Flushing Live MetaTrader Logs to Disk

The logs generated by expert advisors and indicators when running live on MetaTrader are displayed in the Experts tab at the bottom of the terminal window. Sometimes it is more convenient to analyse these logs offline (especially since the order of the records in the terminal runs in a rather counter-intuitive bottom-to-top order!). However, because writing to the log files is buffered, there can be a delay before what you see in the terminal is actually written to disk.

## Clustering Lightning Discharges to Identify Storms

A short talk that I gave at the LIGHTS 2013 Conference (Johannesburg, 12 September 2013). The slides are relatively devoid of text because I like the audience to hear the content rather than read it. The central message of the presentation is that clustering lightning discharges into storms is not a trivial task, but still a worthwhile challenge because it can lead to some very interesting science!

## Clustering the Words of William Shakespeare

In my previous post I used the tm package to do some simple text mining on the Complete Works of William Shakespeare. Today I am taking some of those results and using them to generate word clusters.

# Preparing the Data

I will start with the Term Document Matrix (TDM) consisting of 71 words commonly used by Shakespeare.

This matrix is first converted from a sparse data format into a conventional matrix.

Next the TDM is normalised so that the rows sum to unity. Each entry in the normalised TDM then represents the number of times that a word occurs in a particular document relative to the number of occurrences across all of the documents.

# Clustering

We will be using a hierarchical clustering technique which operates on a dissimilarity matrix. We will use the Euclidean distance between each of the rows in the TDM, where each row is treated as a vector in a space of 182 dimensions.

Finally we perform agglomerative clustering using agnes() from the cluster package.

# Plotting a Dendrogram

Let’s have a look at the results of our labours.

This dendrogram reflects the tree-like structure of the word clusters. We can see that the words “enter”, “exeunt” and “scene” are clustered together, which makes sense since they are related to stage directions. Also “thee” and “thou” have similar usage. In the previous analysis we found that the occurrences of “love” and “eye” were highly correlated and consequently we find them clustered here too.

This is rather cool. No doubt a similar analysis applied to contemporary literature would yield extremely different results. Anybody keen on clustering the Complete Works of Terry Pratchett?

Time zones on MetaTrader can be slightly confusing. There are two important time zones:

• the time zone of the broker’s server and

And these need not be the same.

## What can be learned from 5 million books

This talk by Jean-Baptiste Michel and Erez Lieberman Aiden is phenomenal. The associated article is also well worth checking out: Michel, J.-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331, 176–182.

## Presenting Conformance Statistics

A client came to me with some conformance data. She was having a hard time making sense of it in a spreadsheet. I had a look at a couple of ways of presenting it that would bring out the important points.

# The Data

The data came as a spreadsheet with multiple sheets. Each of the sheets had a slightly different format, so the easiest thing to do was to save each one as a CSV file and then import them individually into R.

After some preliminary manipulation, this is what the data looked like:

Each record indicates the number of incidents per date and employee for each of 15 different manufacturing problems. The names of the employees have been anonymised to protect their dignities.

My initial instructions were something to the effect of “Don’t worry about the dates, just aggregate the data over years” (I’m paraphrasing, but that was the gist of it). As it turns out, the date information tells us something rather useful. But more of that later.

# Employee / Problem View

I first had a look at the number of incidences of each problem per employee.

To produce the tiled plot that I was after, I first had to transform the data into a tidy format. To do this I used melt() from the reshape2 library. I then derived year and day of week (DOW) columns from the date column and deleted the latter.

Next I used ddply() from the plyr package to consolidate the counts by employee, problem and year.

Time to make a quick plot to check that everything is on track.

That’s not too bad. Three panels, one for each year. Employee names on the y-axis and problem type on the x-axis. The colour scale indicates the number of issues per year, employee and problem. Numbers are overlaid on the coloured tiles because apparently the employees are a little pedantic about exact figures!

But it’s all a little disorderly. It might make more sense it we sorted the employees and problems according to the number of issues. First generate counts per employee and per problem. Then sort and extract ordered names. Finally use the ordered names when generating factors.

The new plot is much more orderly.

We can easily see who the worst culprits are and what problems crop up most often. The data for 2013 don’t look as bad as the previous years, but the year is not complete and the counts have not been normalised.

# Employee / Day of Week View

Although I had been told to ignore the date information, I suspected that there might be something interesting in there: perhaps some employees perform worse on certain days of the week?

Using ddply() again, I consolidated the counts by day of week, employee and year.

Then generated a similar plot.

Now that’s rather interesting: for a few of the employees there is a clear pattern of poor performance at the beginning of the week.

# Conclusion

I am not sure what my client is going to do with these plots, but it seems to me that there is quite a lot of actionable information in them, particularly with respect to which of her employees perform poorly on particular days of the week and in doing some specific tasks.

## Finding Correlations in Data with Uncertainty: Classical Solution

Following up on my previous post as a result of an excellent suggestion from Andrej Spiess. The data are indeed very heteroscedastic! Andrej suggested that an alternative way to attack this problem would be to use weighted correlation with weights being the inverse of the measurement variance.

## Finding Correlations in Data with Uncertainty: Bootstrap Solution

A week or so ago a colleague of mine asked if I knew how to calculate correlations for data with uncertainties. Now, if we are going to be honest, then all data should have some level of experimental or measurement error. However, I suspect that in the majority of cases these uncertainties are ignored when considering correlations. To what degree are uncertainties important? A moment’s thought would suggest that if the uncertainties are large enough then they should have a rather significant effect on correlation, or more properly, the uncertainty measure associated with the correlation. So, what is the best (or at least correct) way to proceed? Somewhat surprisingly a quick Google search did not turn up anything too helpful.

Let’s make this problem a little more concrete. My colleague’s data are plotted below. The independent variable is assumed to be well known but the dependent variable has measurement error. For each value of the independent variable multiple measurements of the dependent variable have been made. The average (mu) and standard deviation (sigma) of these measurements have then been recorded. There is a systematic trend in the measurement uncertainties, with larger error bars generally occurring for larger values of the independent variable (although there are some obvious exceptions!).

Now since I can’t publish those data, we will need to construct a synthetic data set in order to explore this issue.

The direct approach to calculating the correlation would be to just use the average values for each measurement.

This looks eminently reasonable: a correlation coefficient of 0.351 (significant at the 1% level) and a 95% confidence interval extending from 0.166 to 0.512.

We can assess the influence of the uncertainties by performing a bootstrap calculation. Let’s keep things simple to start with, using only the mean values.

Note that we are still only using the measurement means! The new bootstrap values for the correlation coefficient and its confidence interval are in good agreement with the direct results above. But that is no surprise because nothing has really changed. Yet.

Next we will adapt the bootstrap function so that it generates data by random sampling from normal distributions with means and standard deviations extracted from the data.

The bootstrap estimate of the correlation, 0.270, is quite different to the direct and simple bootstrap results. However, we now also have access to the bootstrap confidence intervals which take into account the uncertainty in the observations.

The 95% confidence interval for the correlation, taking into account uncertainties in the measurements, extends from 0.059 to 0.417. The correlation is still significant at the 5% level, but barely so!

Returning now to the original data and applying the same analysis. First we go the direct route.

Next we look at the bootstrap approach.

Hmmmm. That’s no good: it breaks because there is a single record which has missing data for sigma.

To deal with this small hitch we make a change to the bootstrap function to include only complete observations.

The warnings are generated because rnorm() is still producing NAs. Maybe a better approach would have been to only pass complete observations to boot() using complete.cases(). The bootstrap estimate of the correlation is quite different from what we obtained using the direct method!

The bootstrap 95% confidence interval for the correlation does not include zero, but it comes rather close! We can still conclude that the correlation is significant, although it might be a mistake to place too much faith in it.

I am not foolish enough to assert that this is the best (or even correct) way for dealing with this situation. But at least to me it seems to be feasible. I would be extremely pleased to receive feedback on problems with this approach and suggestions for how it might be improved.

Debugging an indicator or expert advisor (EA) can be a tricky business. Especially when you are doing the debugging remotely. So I write my MQL code to include copious amounts of debugging information to log files. The contents of these log files can be used to diagnose any problems. This articles tells you where you can find those files.

## Ngoje Trail Run

Team Pro-Print/Exegetic at the early morning start of the 45 km Ngoje Trail run near Eshowe on 3 August 2013.

## A Chart of Recent Comrades Marathon Winners

Continuing on my quest to document the Comrades Marathon results, today I have put together a chart showing the winners of both the men and ladies races since 1980. Click on the image below to see a larger version.

## Modelling the Age of the Oldest Person You Know

The blog post How old is the oldest person you know? by Arthur Charpentier was inspired by Prudential’s stickers campaign which asks you to record the age of the oldest person you know by placing a blue sticker on a number line. The result is a histogram of ages. The original experiment was carried out using 400 real stickers in a park in Austin.

Following up on my previous posts regarding the results of the Comrades Marathon, I was planning on putting together a set of models which would predict likelihood to finish and probable finishing time. Along the way I got distracted by something else that is just as interesting and which produces results which readily yield to qualitative interpretation: Conditional Inference Trees as implemented in the R package party.

Just to recall what the data look like:

Here the drummond.time and finish.time fields are expressed in decimal hours and correspond to the time taken to reach the half-way mark and the finish respectively. The status field indicates whether a runner finished the race or did not finish (DNF).

I am going to consider two models. The first will look at the probability of finishing and the second will look at the distribution of medals. The features which will be used to predict these outcomes will be gender, age category and half-way time at Drummond. To build the first model, first load the party library and then call ctree.

There is a deluge of information in the textual representation of the model. Making sense of this is a lot easier with a plot.

The image below is a little small. You will want to click on it to bring up a larger version.

To interpret the tree, start at the top node (Node 1) labelled drummond.time, indicating that of the features considered, the most important variable in determining a successful outcome at the race is the time to the half-way mark. We are presented with two options: times that are either less than or greater than 5.669 hours. The cutoff time at Drummond is 6.167 hours (06:10:00), so runners reaching half-way after 5.669 hours are already getting quite close to the cutoff time. Suppose that we take the > 5.669 branch. The next node again depends on the half-way time, in this case dividing the population at 5.811 hours. If we take the left branch then we are considering runners who got to Drummond after 5.669 hours but before 5.811 hours. The next node depends on age category. The two branches here are for runners who are 39 and younger (left branch) and older runners (right branch). If we take the right branch then we reach the terminal node. There were 553 runners in this category and the spine plot indicates that around 35% of those runners successfully finished the race.

Rummaging around in this tree, there is a lot of interesting information to be found. For example, female runners who are aged less than 49 years and pass through Drummond in a time of between 5.079 and 5.482 hours are around 95% likely to finish the race. In fact, this is the most successful group of runners (there were 634 of them in the field). The next best group was male runners in the same age category who got to half-way in less than 5.079 hour: roughly 90% of the 5419 runners in this group finished the race.

Constructing a model for medal allocation is done in a similar fashion.

Here I first extracted the subset of runners who finished the race (and for whom I have information on the medal allocated). Then, to make the plotting a little easier, the names of the levels in the medal factor are changed to a more compact representation.

Apologies for the bit of information overload. A plot brings out the salient information though.

Again you will want to click on the image below to make it legible.

Again the most important feature is the time at the half-way mark. If we look at the terminal node on the left (Node 3), which is the only one which contains athletes who received either Gold or Wally Hayward medals, then we see that they all passed through Drummond in a time of less than 3.439 hours. Almost all of the Silver medal athletes were also in this group, along with a good number of Bill Rowan runners. There are still a few Silver medal athletes in Node 6, which corresponds to runners who got to Drummond in less than 3.708 hours.

Shifting across to the other end of the plot and looking at runners who reached half-way in more than 5.266 hours. These are further divided into a group whose half-way time was more than 5.41 hours: these almost all got Vic Clapham medals. Interestingly, the outcome for athletes whose time at Drummond was greater than 5.266 hours but less than 5.41 hours depends on gender: the ladies achieved a higher proportion of Bronze medals than the men.

I could pore over these plots for hours. The take home message from this is that your outcome at the Comrades Marathon is most strongly determined by your pace in the first half of the race. Gender and age don’t seem to be particularly important, although they do exert an influence on your first half pace. Ladies who get to half-way at between 05:00 and 05:30 seem to have hit the sweet spot though with close to 100% success rate. Nice!