Categorically Variable

Only search Categorically Variable.

Applying the Same Operation to a Number of Variables

Just a quick note on a short hack that I cobbled together this morning. I have an analysis where I need to perform the same set of operations to a list of variables. In order to do this in a compact and robust way, I wanted to write a loop that would run through the variables and apply the operations to each of them in turn. This can be done using get() and assign().

Simple Illustration

To illustrate the procedure, I will use the simple example of squaring the numerical values stored in three variables. First we initialise the variables.

> x = 1
> y = 2
> z = 3

Then we loop over the variable names (as strings), creating a temporary copy of each one and applying the operation to the copy. Then the copy is assigned back to the original variable.

> for (n in c("x", "y", "z")) {
+   v = get(n)
+   #
+   v = v**2
+   #
+   assign(n, v)
+ }

Finally we check that the operation has been executed as expected.

> x
[1] 1
> y
[1] 4
> z
[1] 9

This is perhaps a little wasteful in terms of resources (creating the temporary variables), but does the job. Obviously in practice you would only implement this sort of solution if there were either a large number of variables to be transformed or the transformation required a relatively complicated set of operations.

Alternative Implementations

Following up on the numerous insightful responses to this post, there are a number of other ways of skinning the same cat. But, I should point out that the solution above is still optimal for my particular application where I had a series of operations to be applied to each of the variables, some of which involved conditional branches, making a solution using vectorised operations rather messy. Furthermore, I did not want to have to pack and unpack from a list.

Usage Case

To give a better idea of the type of scenario that I was looking at, consider a situation in which you have a number of data frames. Let’s call them A, B, C and D. The data in each is similar, yet each pertains to a distinct population. And, for whatever reason, you want to keep these data separate rather than consolidating them into a single data frame. Now suppose further that you wanted to perform a set of operations on each of them:

  • retain only a subset of the columns;
  • rename the remaining columns; and
  • derive new columns using transformations of the existing columns.

Using the framework above you could achieve all of these objectives without any replication of code.

 Read more

Mounting a sshfs volume via the crontab

I need to mount a directory from my laptop on my desktop machine using sshfs. At first I was not making the mount terribly regularly, so I did it manually each time that I needed it. However, the frequency increased over time and I was eventually mounting it every day (or multiple times during the course of a day!). This was a perfect opportunity to employ some automation.

 Read more

Top 250 Movies at IMDb

Some years ago I allowed myself to accept a challenge to read the Top 100 Novels of All Time (complete list here). This list was put together by Richard Lacayo and Lev Grossman at Time Magazine.

To start with I could tick off a number of books that I had already read. That left me with around 75 books outstanding. So I knuckled down. The Lord of the Rings had been on my reading list for a number of years, so this was my first project. A little unfair for this trilogy to count as just one book… but I consumed it with gusto! One down. Other books followed. They were also great reads. And then I hit a couple of books that were just, well, to put it plainly, heavy going. I am sure that they were great books and my lack of enjoyment was entirely a reflection on me and not the quality of the books. No doubt I learned a lot from reading them. But it was hard work! At this stage it occurred to me that the book list was constructed from a rather specific perspective of what constituted a great book. A perspective which is quite different from my own. So I had to admit defeat: my literary tastes will have to mature a bit before I attack this list again!

Then last week I was reading through a back issue of The Linux Journal and came across an article which used shell tools to download and process the IMDb list of Top 250 Movies. This list is constructed from IMDb users’ votes and so represents a fairly democratic and egalitarian perspective. Working through a list of movies seems to me to be a lot easier than a list of books, so this appealed to my inner sloth. And gave me an idea for a quick little R script.

We will use the XML library to retrieve the page from IMDb and parse out the appropriate table.

> library(XML)
>
> url <- "http://www.imdb.com/chart/top">
> best.movies <- readHTMLTable(url, which = 2, stringsAsFactors = FALSE)
>
> head(best.movies)
  Rank Rating                                 Title     Votes
1   1.    9.2       The Shawshank Redemption (1994) 1,065,332
2   2.    9.2                  The Godfather (1972)   746,693
3   3.    9.0         The Godfather: Part II (1974)   484,761
4   4.    8.9                   Pulp Fiction (1994)   825,063
5   5.    8.9 The Good, the Bad and the Ugly (1966)   319,222
6   6.    8.9                The Dark Knight (2008) 1,039,499

The output reflects the content of the rating table exactly. However, the rank column is redundant since the same information is captured by the row labels. We can remove this column to make the data more concise.

> best.movies[, 1] <- NULL
>
> head(best.movies)
  Rating                                 Title     Votes
1    9.2       The Shawshank Redemption (1994) 1,065,332
2    9.2                  The Godfather (1972)   746,693
3    9.0         The Godfather: Part II (1974)   484,761
4    8.9                   Pulp Fiction (1994)   825,063
5    8.9 The Good, the Bad and the Ugly (1966)   319,222
6    8.9                The Dark Knight (2008) 1,039,499

There are still a few issues with the data:

  • the years are bundled up with the titles;
  • the rating data are strings;
  • the votes data are also strings and have embedded commas.

All of these problems are easily fixed though.

> pattern = "(.*) \\((.*)\\)$"
>
> best.movies = transform(best.movies,
+                       Rating = as.numeric(Rating),
+                       Year   = as.integer(substr(gsub(pattern, "\\2", Title), 1, 4)),
+                       Title  = gsub(pattern, "\\1", Title),
+                       Votes  = as.integer(gsub(",", "", Votes))
+ )
>
> best.movies = best.movies[, c(4, 2, 3, 1)]
>
> head(best.movies)
  Year                          Title   Votes Rating
1 1994       The Shawshank Redemption 1065332    9.2
2 1972                  The Godfather  746693    9.2
3 1974         The Godfather: Part II  484761    9.0
4 1994                   Pulp Fiction  825063    8.9
5 1966 The Good, the Bad and the Ugly  319222    8.9
6 2008                The Dark Knight 1039499    8.9

I am happy to see that The Good, the Bad and the Ugly rates at number 5. This is one of my favourite movies! Clearly I am not alone.

Finally, to gain a little perspective on the relationship between the release year, votes and rating we can put together a simple bubble plot.

> library(ggplot2)
>
> ggplot(best.movies, aes(x = Year, y = Rating)) +
+   geom_point(aes(size = Votes), alpha = 0.5, position = "jitter", color = "darkgreen") +
+   scale_size(range = c(3, 15)) +
+   theme_classic()

When I have some more time on my hands I am going to use the IMDb API to grab some additional information on each of these movies and see if anything interesting emerges from the larger data set.

 Read more

Flushing Live MetaTrader Logs to Disk

The logs generated by expert advisors and indicators when running live on MetaTrader are displayed in the Experts tab at the bottom of the terminal window. Sometimes it is more convenient to analyse these logs offline (especially since the order of the records in the terminal runs in a rather counter-intuitive bottom-to-top order!). However, because writing to the log files is buffered, there can be a delay before what you see in the terminal is actually written to disk.

 Read more

Clustering Lightning Discharges to Identify Storms

A short talk that I gave at the LIGHTS 2013 Conference (Johannesburg, 12 September 2013). The slides are relatively devoid of text because I like the audience to hear the content rather than read it. The central message of the presentation is that clustering lightning discharges into storms is not a trivial task, but still a worthwhile challenge because it can lead to some very interesting science!

 Read more

Clustering the Words of William Shakespeare

In my previous post I used the tm package to do some simple text mining on the Complete Works of William Shakespeare. Today I am taking some of those results and using them to generate word clusters.

Preparing the Data

I will start with the Term Document Matrix (TDM) consisting of 71 words commonly used by Shakespeare.

> inspect(TDM.common[1:10,1:10])
A term-document matrix (10 terms, 10 documents)

Non-/sparse entries: 94/6
Sparsity           : 6%
Maximal term length: 6
Weighting          : term frequency (tf)

        Docs
Terms     1 2  3  4  5  6  7  8 9 10
  act     1 4  7  9  6  3  2 14 1  0
  art    53 0  9  3  5  3  2 17 0  6
  away   18 5  8  4  2 10  5 13 1  7
  call   17 1  4  2  2  1  6 17 3  7
  can    44 8 12  5 10  6 10 24 1  5
  come   19 9 16 17 12 15 14 89 9 15
  day    43 2  2  4  1  5  3 17 2  3
  enter   0 7 12 11 10 10 14 87 4  6
  exeunt  0 3  8  8  5  4  7 49 1  4
  exit    0 6  8  5  6  5  3 31 3  2

This matrix is first converted from a sparse data format into a conventional matrix.

> TDM.dense <- as.matrix(TDM.common)
> dim(TDM.dense)
[1] 71 182

Next the TDM is normalised so that the rows sum to unity. Each entry in the normalised TDM then represents the number of times that a word occurs in a particular document relative to the number of occurrences across all of the documents.

> TDM.scaled <- TDM.dense / rowSums(TDM.dense)

Clustering

We will be using a hierarchical clustering technique which operates on a dissimilarity matrix. We will use the Euclidean distance between each of the rows in the TDM, where each row is treated as a vector in a space of 182 dimensions.

> TDM.dist = dist(TDM.scaled)

Finally we perform agglomerative clustering using agnes() from the cluster package.

> library(cluster)
>
> hclusters  hclusters
Call:	 agnes(x = TDM.dist, method = "complete")
Agglomerative coefficient:  0.6256247
Order of objects:
 [1] act    great  way    away   hand   stand  life   can    hath   yet
[11] look   see    leav   let    shall  make   take   thus   made   till
[21] come   well   will   good   ill    like   now    give   upon   know
[31] may    must   man    much   think  hear   speak  never  one    say
[41] tell   enter  exeunt scene  exit   tis    mean   fear   men    keep
[51] word   name   lord   call   two    old    sir    first  art    thee
[61] thou   thi    day    live   heart  mine   time   part   true   eye
[71] love
Height (summary):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.02495 0.04509 0.05722 0.06050 0.06897 0.14260

Available components:
[1] "order"     "height"    "ac"        "merge"     "diss"      "call"
[7] "method"    "order.lab"

Plotting a Dendrogram

Let’s have a look at the results of our labours.

plot(hclusters, which.plots = 2, main = "", sub = "", xlab = "")

This dendrogram reflects the tree-like structure of the word clusters. We can see that the words “enter”, “exeunt” and “scene” are clustered together, which makes sense since they are related to stage directions. Also “thee” and “thou” have similar usage. In the previous analysis we found that the occurrences of “love” and “eye” were highly correlated and consequently we find them clustered here too.

This is rather cool. No doubt a similar analysis applied to contemporary literature would yield extremely different results. Anybody keen on clustering the Complete Works of Terry Pratchett?

 Read more

MetaTrader Time Zones

Time zones on MetaTrader can be slightly confusing. There are two important time zones:

  • the time zone of the broker’s server and
  • your local time zone.

And these need not be the same.

 Read more

Text Mining The Complete Works Of William Shakespeare

 Read more

What can be learned from 5 million books

This talk by Jean-Baptiste Michel and Erez Lieberman Aiden is phenomenal. The associated article is also well worth checking out: Michel, J.-B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331, 176–182.

 Read more

Presenting Conformance Statistics

A client came to me with some conformance data. She was having a hard time making sense of it in a spreadsheet. I had a look at a couple of ways of presenting it that would bring out the important points.

The Data

The data came as a spreadsheet with multiple sheets. Each of the sheets had a slightly different format, so the easiest thing to do was to save each one as a CSV file and then import them individually into R.

After some preliminary manipulation, this is what the data looked like:

> dim(P)
[1] 1487   17
> names(P)
 [1] "date"               "employee"           "Sorting Colour"     "Extra Material"
 [5] "Fluff"              "Plate Run Blind"    "Plate Maker"        "Colour Short"
 [9] "Stock Issued Short" "Catchup"            "Carton Marked"      "Die Cutting"
[13] "Glueing"            "Damaged Stock"      "Folding Problem"    "Sorting Setoff"
[17] "Bad Lays"
> head(P[, 1:7])
        date employee Sorting Colour Extra Material Fluff Plate Run Blind Plate Maker
1 2011-01-11      E01              0              1     0               0           0
2 2011-01-11      E01              0              1     0               0           0
3 2011-01-11      E37              0              0     0               0           0
4 2011-01-11      E41              0              1     0               0           0
5 2011-01-12      E42              0              1     0               0           0
6 2011-01-17      E01              0              1     0               0           0

Each record indicates the number of incidents per date and employee for each of 15 different manufacturing problems. The names of the employees have been anonymised to protect their dignities.

My initial instructions were something to the effect of “Don’t worry about the dates, just aggregate the data over years” (I’m paraphrasing, but that was the gist of it). As it turns out, the date information tells us something rather useful. But more of that later.

Employee / Problem View

I first had a look at the number of incidences of each problem per employee.

 > library(reshape2)
> library(plyr)
>
> Q = melt(P, id.vars = c("employee", "date"), variable.name = "problem")
> #
> # Remove "empty" rows (non-events)
> #
> Q = subset(Q, value == 1)
> #
> Q$year = strftime(Q$date, "%Y")
> Q$DOW = strftime(Q$date, "%A")
> Q$date &lt;- NULL
>
> head(Q)
   employee        problem value year     DOW
46      E11 Sorting Colour     1 2011  Friday
47      E15 Sorting Colour     1 2011  Friday
53      E26 Sorting Colour     1 2011  Friday
67      E26 Sorting Colour     1 2011  Monday
68      E26 Sorting Colour     1 2011  Monday
70      E01 Sorting Colour     1 2011 Tuesday

To produce the tiled plot that I was after, I first had to transform the data into a tidy format. To do this I used melt() from the reshape2 library. I then derived year and day of week (DOW) columns from the date column and deleted the latter.

Next I used ddply() from the plyr package to consolidate the counts by employee, problem and year.

> problem.table = ddply(Q, .(employee, problem, year), summarise, count = sum(value))
> head(problem.table)
  employee        problem year count
1      E01 Sorting Colour 2011    17
2      E01 Sorting Colour 2012     2
3      E01 Sorting Colour 2013     2
4      E01 Extra Material 2011    50
5      E01 Extra Material 2012    58
6      E01 Extra Material 2013    13

Time to make a quick plot to check that everything is on track.

> library(ggplot2)
> ggplot(problem.table, aes(x = problem, y = employee, fill = count)) +
+     geom_tile(colour = "white") +
+     xlab("") + ylab("") +
+     facet_grid(. ~ year) +
+     geom_text(aes(label = count), angle = 0, size = rel(3)) +
+     scale_fill_gradient(high="#FF0000" , low="#0000FF") +
+     theme(panel.background = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1)) +
+     theme(legend.position = "none")

That’s not too bad. Three panels, one for each year. Employee names on the y-axis and problem type on the x-axis. The colour scale indicates the number of issues per year, employee and problem. Numbers are overlaid on the coloured tiles because apparently the employees are a little pedantic about exact figures!

But it’s all a little disorderly. It might make more sense it we sorted the employees and problems according to the number of issues. First generate counts per employee and per problem. Then sort and extract ordered names. Finally use the ordered names when generating factors.

> CEMPLOYEE = with(problem.table, tapply(count, employee, sum))
> CPROBLEM  = with(problem.table, tapply(count, problem, sum))
> #
> FEMPLOYEE = names(sort(CEMPLOYEE, decreasing = TRUE))
> FPROBLEM  = names(sort(CPROBLEM, decreasing = TRUE))
>
> problem.table = transform(problem.table,
+                          employee = factor(employee, levels = FEMPLOYEE),
+                          problem  = factor(problem,  levels = FPROBLEM)
+                          )

The new plot is much more orderly.

We can easily see who the worst culprits are and what problems crop up most often. The data for 2013 don’t look as bad as the previous years, but the year is not complete and the counts have not been normalised.

Employee / Day of Week View

Although I had been told to ignore the date information, I suspected that there might be something interesting in there: perhaps some employees perform worse on certain days of the week?

Using ddply() again, I consolidated the counts by day of week, employee and year.

> problem.table = ddply(Q, .(DOW, employee, year), summarise, count = sum(value))

Then generated a similar plot.

Now that’s rather interesting: for a few of the employees there is a clear pattern of poor performance at the beginning of the week.

Conclusion

I am not sure what my client is going to do with these plots, but it seems to me that there is quite a lot of actionable information in them, particularly with respect to which of her employees perform poorly on particular days of the week and in doing some specific tasks.

 Read more

The Wonders Of Foreach

 Read more

Fitting A Model By Maximum Likelihood

 Read more

Finding Correlations in Data with Uncertainty: Classical Solution

Following up on my previous post as a result of an excellent suggestion from Andrej Spiess. The data are indeed very heteroscedastic! Andrej suggested that an alternative way to attack this problem would be to use weighted correlation with weights being the inverse of the measurement variance.

 Read more

Finding Correlations in Data with Uncertainty: Bootstrap Solution

A week or so ago a colleague of mine asked if I knew how to calculate correlations for data with uncertainties. Now, if we are going to be honest, then all data should have some level of experimental or measurement error. However, I suspect that in the majority of cases these uncertainties are ignored when considering correlations. To what degree are uncertainties important? A moment’s thought would suggest that if the uncertainties are large enough then they should have a rather significant effect on correlation, or more properly, the uncertainty measure associated with the correlation. So, what is the best (or at least correct) way to proceed? Somewhat surprisingly a quick Google search did not turn up anything too helpful.

Let’s make this problem a little more concrete. My colleague’s data are plotted below. The independent variable is assumed to be well known but the dependent variable has measurement error. For each value of the independent variable multiple measurements of the dependent variable have been made. The average (mu) and standard deviation (sigma) of these measurements have then been recorded. There is a systematic trend in the measurement uncertainties, with larger error bars generally occurring for larger values of the independent variable (although there are some obvious exceptions!).

  > head(original)
  mu.x sigma.x  mu.y sigma.y
1 93.7       0 56.80    83.5
2 48.5       0 15.20    24.0
3 85.0       0 79.10   150.0
4 38.3       0  9.17    11.2
5 97.3       0 43.10    77.0
6 44.2       0 36.70    76.2
>
> library(ggplot2)
> ggplot(original, aes(x=mu.x, y=mu.y)) + 
+     geom_errorbar(aes(ymin=mu.y-sigma.y, ymax=mu.y+sigma.y),
+                   colour="black", width = .05, alpha = 0.5) +
+     geom_point(size=3, shape=21, fill="lightgreen", alpha = 0.75) +
+     xlab("x") + ylab("y") +
+     theme_bw() +
+     scale_x_log10()

Now since I can’t publish those data, we will need to construct a synthetic data set in order to explore this issue.

> set.seed(1)
>
> N <- 100
> synthetic <- data.frame(mu.x = runif(N), sigma.x = 0)
> synthetic <- transform(synthetic, mu.y = mu.x + rnorm(N), sigma.y = runif(N))
> head(synthetic)
       mu.x sigma.x       mu.y    sigma.y
1 0.2655087       0  0.6636145 0.67371223
2 0.3721239       0 -0.2399025 0.09485786
3 0.5728534       0  0.9139731 0.49259612
4 0.9082078       0 -0.2211553 0.46155184
5 0.2016819       0  1.6347056 0.37521653
6 0.8983897       0  2.8787896 0.99109922

The direct approach to calculating the correlation would be to just use the average values for each measurement.

> attach(synthetic)
>
> cor.test(mu.x, mu.y)

	Pearson's product-moment correlation

data:  mu.x and mu.y
t = 3.7129, df = 98, p-value = 0.0003405
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1662162 0.5122542
sample estimates:
      cor
0.3511694

>
> detach(synthetic)

This looks eminently reasonable: a correlation coefficient of 0.351 (significant at the 1% level) and a 95% confidence interval extending from 0.166 to 0.512.

We can assess the influence of the uncertainties by performing a bootstrap calculation. Let’s keep things simple to start with, using only the mean values.

  > cor.mu <- function(df, n) {
+     df = df[n,]
+     x <- df$mu.x
+     y <- df$mu.y
+     return(cor(x, y))
+ }
> 
> (boot.cor.mu = boot(synthetic, cor.mu, R = 100000))

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = synthetic, statistic = cor.mu, R = 1e+05)


Bootstrap Statistics :
     original       bias    std. error
t1* 0.3511694 -0.001089294  0.09143083
> boot.ci(boot.cor.mu, type = c("norm", "basic", "perc"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot.cor.mu, type = c("norm", "basic", "perc"))

Intervals : 
Level      Normal              Basic              Percentile     
95%   ( 0.1731,  0.5315 )   ( 0.1827,  0.5394 )   ( 0.1630,  0.5196 )  
Calculations and Intervals on Original Scale

Note that we are still only using the measurement means! The new bootstrap values for the correlation coefficient and its confidence interval are in good agreement with the direct results above. But that is no surprise because nothing has really changed. Yet.

Next we will adapt the bootstrap function so that it generates data by random sampling from normal distributions with means and standard deviations extracted from the data.

> cor.mu.sigma <- function(df, n) {
+     df = df[n,]
+     x <- rnorm(nrow(df), mean = df$mu.x, sd = df$sigma.x)
+     y <- rnorm(nrow(df), mean = df$mu.y, sd = df$sigma.y)
+     return(cor(x, y))
+ }
> 
> (boot.cor.mu.sigma = boot(synthetic, cor.mu.sigma, R = 100000))

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = synthetic, statistic = cor.mu.sigma, R = 1e+05)


Bootstrap Statistics :
     original     bias    std. error
t1* 0.2699615 0.03208843  0.09144613

The bootstrap estimate of the correlation, 0.270, is quite different to the direct and simple bootstrap results. However, we now also have access to the bootstrap confidence intervals which take into account the uncertainty in the observations.

> boot.ci(boot.cor.mu.sigma, type = c("norm", "basic", "perc"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot.cor.mu.sigma, type = c("norm", "basic", 
    "perc"))

Intervals : 
Level      Normal              Basic              Percentile     
95%   ( 0.0586,  0.4171 )   ( 0.0672,  0.4232 )   ( 0.1167,  0.4727 )  
Calculations and Intervals on Original Scale

The 95% confidence interval for the correlation, taking into account uncertainties in the measurements, extends from 0.059 to 0.417. The correlation is still significant at the 5% level, but barely so!

Returning now to the original data and applying the same analysis. First we go the direct route.

> attach(original)
> 
> cor.test(mu.x, mu.y)

	Pearson's product-moment correlation

data:  mu.x and mu.y
t = 13.5402, df = 445, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4710180 0.6027208
sample estimates:
      cor 
0.5401686 

> 
> detach(original)

Next we look at the bootstrap approach.

 > (boot.cor.mu.sigma = boot(original, cor.mu.sigma, R = 100000))

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = original, statistic = cor.mu.sigma, R = 1e+05)

Bootstrap Statistics :
WARNING: All values of t1* are NA
There were 50 or more warnings (use warnings() to see the first 50)

Hmmmm. That’s no good: it breaks because there is a single record which has missing data for sigma.

> original[rowSums(is.na(original)) > 0,]
   mu.x sigma.x mu.y sigma.y
80 52.2       0 47.6     NaN

To deal with this small hitch we make a change to the bootstrap function to include only complete observations.

> cor.mu.sigma <- function(df, n) {
+     df = df[n,]
+     x <- rnorm(nrow(df), mean = df$mu.x, sd = df$sigma.x)
+     y <- rnorm(nrow(df), mean = df$mu.y, sd = df$sigma.y)
+     return(cor(x, y, use = "complete.obs"))
+ }
> (boot.cor.mu.sigma = boot(original, cor.mu.sigma, R = 100000))

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = original, statistic = cor.mu.sigma, R = 1e+05)


Bootstrap Statistics :
     original     bias    std. error
t1* 0.1938031 0.01834959  0.08378789
There were 50 or more warnings (use warnings() to see the first 50)

The warnings are generated because rnorm() is still producing NAs. Maybe a better approach would have been to only pass complete observations to boot() using complete.cases(). The bootstrap estimate of the correlation is quite different from what we obtained using the direct method!

> boot.ci(boot.cor.mu.sigma, type = c("norm", "basic", "perc"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot.cor.mu.sigma, type = c("norm", "basic", 
    "perc"))

Intervals : 
Level      Normal              Basic              Percentile     
95%   ( 0.0112,  0.3397 )   ( 0.0135,  0.3414 )   ( 0.0462,  0.3741 )  
Calculations and Intervals on Original Scale
>
> plot(boot.cor.mu.sigma)

The bootstrap 95% confidence interval for the correlation does not include zero, but it comes rather close! We can still conclude that the correlation is significant, although it might be a mistake to place too much faith in it.

I am not foolish enough to assert that this is the best (or even correct) way for dealing with this situation. But at least to me it seems to be feasible. I would be extremely pleased to receive feedback on problems with this approach and suggestions for how it might be improved.

 Read more

Finding Your MetaTrader Log Files

Debugging an indicator or expert advisor (EA) can be a tricky business. Especially when you are doing the debugging remotely. So I write my MQL code to include copious amounts of debugging information to log files. The contents of these log files can be used to diagnose any problems. This articles tells you where you can find those files.

 Read more

Ngoje Trail Run

Team Pro-Print/Exegetic at the early morning start of the 45 km Ngoje Trail run near Eshowe on 3 August 2013.

 Read more

A Chart of Recent Comrades Marathon Winners

Continuing on my quest to document the Comrades Marathon results, today I have put together a chart showing the winners of both the men and ladies races since 1980. Click on the image below to see a larger version.

 Read more

Modelling the Age of the Oldest Person You Know

The blog post How old is the oldest person you know? by Arthur Charpentier was inspired by Prudential’s stickers campaign which asks you to record the age of the oldest person you know by placing a blue sticker on a number line. The result is a histogram of ages. The original experiment was carried out using 400 real stickers in a park in Austin.

 Read more

Comrades Marathon Inference Trees

Following up on my previous posts regarding the results of the Comrades Marathon, I was planning on putting together a set of models which would predict likelihood to finish and probable finishing time. Along the way I got distracted by something else that is just as interesting and which produces results which readily yield to qualitative interpretation: Conditional Inference Trees as implemented in the R package party.

Just to recall what the data look like:

> head(splits.2013)
           gender age.category drummond.time race.time   status       medal
2013-10014   Male        50-59      5.510833        NA      DNF        <NA>
2013-10016   Male        60-69      6.070833        NA      DNF        <NA>
2013-10019   Male        20-29      5.335833  11.87361 Finished Vic Clapham
2013-10031   Male        20-29      4.910833  10.94833 Finished      Bronze
2013-10047   Male        50-59      5.076944  10.72778 Finished      Bronze
2013-10049   Male        50-59      5.729444        NA      DNF        <NA>

Here the drummond.time and finish.time fields are expressed in decimal hours and correspond to the time taken to reach the half-way mark and the finish respectively. The status field indicates whether a runner finished the race or did not finish (DNF).

I am going to consider two models. The first will look at the probability of finishing and the second will look at the distribution of medals. The features which will be used to predict these outcomes will be gender, age category and half-way time at Drummond. To build the first model, first load the party library and then call ctree.

> library(party)
> tree.status = ctree(status ~ gender + age.category + drummond.time, data = splits.2013,
+                     control = ctree_control(minsplit = 750))
> tree.status

	 Conditional inference tree with 17 terminal nodes

Response:  status 
Inputs:  gender, age.category, drummond.time 
Number of observations:  13917 

1) drummond.time <= 5.669167; criterion = 1, statistic = 2985.908
  2) drummond.time <= 5.4825; criterion = 1, statistic = 494.826
    3) age.category <= 40-49; criterion = 1, statistic = 191.12
      4) drummond.time <= 5.078611; criterion = 1, statistic = 76.962
        5) gender == {Male}; criterion = 1, statistic = 73.4
          6)*  weights = 5419 
        5) gender == {Female}
          7)*  weights = 836 
      4) drummond.time > 5.078611
        8) gender == {Male}; criterion = 1, statistic = 63.347
          9) drummond.time <= 5.379722; criterion = 1, statistic = 15.55
            10)*  weights = 1123 
          9) drummond.time > 5.379722
            11)*  weights = 447 
        8) gender == {Female}
          12)*  weights = 634 
    3) age.category > 40-49
      13) drummond.time <= 5.038056; criterion = 1, statistic = 68.556
        14) age.category <= 50-59; criterion = 1, statistic = 40.471
          15) gender == {Female}; criterion = 1, statistic = 32.419
            16)*  weights = 118 
          15) gender == {Male}
            17)*  weights = 886 
        14) age.category > 50-59
          18)*  weights = 170 
      13) drummond.time > 5.038056
        19)*  weights = 701 
  2) drummond.time > 5.4825
    20) gender == {Male}; criterion = 1, statistic = 56.149
      21) age.category <= 40-49; criterion = 0.995, statistic = 9.826
        22)*  weights = 636 
      21) age.category > 40-49
        23)*  weights = 259 
    20) gender == {Female}
      24)*  weights = 352 
1) drummond.time > 5.669167
  25) drummond.time <= 5.811389; criterion = 1, statistic = 301.482
    26) age.category <= 30-39; criterion = 1, statistic = 37.006
      27)*  weights = 315 
    26) age.category > 30-39
      28)*  weights = 553 
  25) drummond.time > 5.811389
    29) drummond.time <= 5.940556; criterion = 1, statistic = 75.164
      30) age.category <= 30-39; criterion = 1, statistic = 25.519
        31)*  weights = 299 
      30) age.category > 30-39
        32)*  weights = 475 
    29) drummond.time > 5.940556
      33)*  weights = 694 

There is a deluge of information in the textual representation of the model. Making sense of this is a lot easier with a plot.

> plot(tree.status)

The image below is a little small. You will want to click on it to bring up a larger version.

To interpret the tree, start at the top node (Node 1) labelled drummond.time, indicating that of the features considered, the most important variable in determining a successful outcome at the race is the time to the half-way mark. We are presented with two options: times that are either less than or greater than 5.669 hours. The cutoff time at Drummond is 6.167 hours (06:10:00), so runners reaching half-way after 5.669 hours are already getting quite close to the cutoff time. Suppose that we take the > 5.669 branch. The next node again depends on the half-way time, in this case dividing the population at 5.811 hours. If we take the left branch then we are considering runners who got to Drummond after 5.669 hours but before 5.811 hours. The next node depends on age category. The two branches here are for runners who are 39 and younger (left branch) and older runners (right branch). If we take the right branch then we reach the terminal node. There were 553 runners in this category and the spine plot indicates that around 35% of those runners successfully finished the race.

Rummaging around in this tree, there is a lot of interesting information to be found. For example, female runners who are aged less than 49 years and pass through Drummond in a time of between 5.079 and 5.482 hours are around 95% likely to finish the race. In fact, this is the most successful group of runners (there were 634 of them in the field). The next best group was male runners in the same age category who got to half-way in less than 5.079 hour: roughly 90% of the 5419 runners in this group finished the race.

Constructing a model for medal allocation is done in a similar fashion.

> splits.2013.finishers = subset(splits.2013, status == "Finished" & !is.na(medal))
> #
> levels(splits.2013.finishers$medal) <- c("G", "WH", "S", "BR", "B", "VC")

Here I first extracted the subset of runners who finished the race (and for whom I have information on the medal allocated). Then, to make the plotting a little easier, the names of the levels in the medal factor are changed to a more compact representation.

> tree.medal = ctree(medal ~ gender + age.category + drummond.time, data = splits.2013.finishers,
+                    control = ctree_control(minsplit = 750))
> tree.medal

	 Conditional inference tree with 19 terminal nodes

Response:  medal 
Inputs:  gender, age.category, drummond.time 
Number of observations:  10221 

1) drummond.time <= 4.124167; criterion = 1, statistic = 7452.85
  2) drummond.time <= 3.438889; criterion = 1, statistic = 1031.778
    3)*  weights = 571 
  2) drummond.time > 3.438889
    4) drummond.time <= 3.812222; criterion = 1, statistic = 342.628
      5) drummond.time <= 3.708056; criterion = 1, statistic = 53.658
        6)*  weights = 549 
      5) drummond.time > 3.708056
        7)*  weights = 250 
    4) drummond.time > 3.812222
      8) drummond.time <= 3.976111; criterion = 1, statistic = 37.853
        9)*  weights = 386 
      8) drummond.time > 3.976111
        10)*  weights = 431 
1) drummond.time > 4.124167
  11) drummond.time <= 5.043611; criterion = 1, statistic = 4144.845
    12) drummond.time <= 4.55; criterion = 1, statistic = 596.673
      13) drummond.time <= 4.288333; criterion = 1, statistic = 81.996
        14)*  weights = 603 
      13) drummond.time > 4.288333
        15) gender == {Male}; criterion = 0.996, statistic = 10.468
          16)*  weights = 993 
        15) gender == {Female}
          17)*  weights = 148 
    12) drummond.time > 4.55
      18) drummond.time <= 4.862778; criterion = 1, statistic = 77.052
        19) gender == {Male}; criterion = 1, statistic = 34.077
          20) drummond.time <= 4.653611; criterion = 0.994, statistic = 9.583
            21)*  weights = 353 
          20) drummond.time > 4.653611
            22)*  weights = 762 
        19) gender == {Female}
          23)*  weights = 237 
      18) drummond.time > 4.862778
        24) gender == {Male}; criterion = 1, statistic = 45.95
          25)*  weights = 756 
        24) gender == {Female}
          26)*  weights = 193 
  11) drummond.time > 5.043611
    27) drummond.time <= 5.265833; criterion = 1, statistic = 544.833
      28) gender == {Male}; criterion = 1, statistic = 54.559
        29) drummond.time <= 5.174444; criterion = 1, statistic = 26.917
          30)*  weights = 545 
        29) drummond.time > 5.174444
          31)*  weights = 402 
      28) gender == {Female}
        32)*  weights = 327 
    27) drummond.time > 5.265833
      33) drummond.time <= 5.409722; criterion = 1, statistic = 88.926
        34) gender == {Male}; criterion = 1, statistic = 40.693
          35)*  weights = 675 
        34) gender == {Female}
          36)*  weights = 277 
      33) drummond.time > 5.409722
        37)*  weights = 1763 

Apologies for the bit of information overload. A plot brings out the salient information though.

> plot(tree.medal)

Again you will want to click on the image below to make it legible.

Again the most important feature is the time at the half-way mark. If we look at the terminal node on the left (Node 3), which is the only one which contains athletes who received either Gold or Wally Hayward medals, then we see that they all passed through Drummond in a time of less than 3.439 hours. Almost all of the Silver medal athletes were also in this group, along with a good number of Bill Rowan runners. There are still a few Silver medal athletes in Node 6, which corresponds to runners who got to Drummond in less than 3.708 hours.

Shifting across to the other end of the plot and looking at runners who reached half-way in more than 5.266 hours. These are further divided into a group whose half-way time was more than 5.41 hours: these almost all got Vic Clapham medals. Interestingly, the outcome for athletes whose time at Drummond was greater than 5.266 hours but less than 5.41 hours depends on gender: the ladies achieved a higher proportion of Bronze medals than the men.

I could pore over these plots for hours. The take home message from this is that your outcome at the Comrades Marathon is most strongly determined by your pace in the first half of the race. Gender and age don’t seem to be particularly important, although they do exert an influence on your first half pace. Ladies who get to half-way at between 05:00 and 05:30 seem to have hit the sweet spot though with close to 100% success rate. Nice!

 Read more

Optimising a Noisy Objective Function

I am busy with a project where I need to calibrate the Heston Model to some Asian options data. The model has been implemented as a function which executes a Monte Carlo (MC) simulation. As a result, the objective function is rather noisy. There are a number of algorithms for dealing with this sort of problem, and here I simply give a brief overview of some of them.

 Read more

Categorically Variable