Data Science Teams

And even that insanely curious data scientist, if he or she insists on working alone, won’t be able to produce the most valuable insights. Those come from high-performing teams combining individuals who are individually curious and naturally creative, but also collaborative in their approach to the art and science of experimentation. A great data science team is like a jazz quartet, where individuals are always riffing off of one another, and each takes the music to a new and unexpected place.
Josh Sullivan, Get the Right Data Scientists Asking the “Wrong” Questions, Harvard Business Review, 2014

I’m a big believer in heterogeneous teams and I thrive when working alongside people with different education, background and experience. I think that Sullivan’s simile above is extremely apt.

Thanks to @rlnel who pointed me to this article which expresses similar sentiments.

Searching Database for Column Names

How to find a table which has a column name matching a certain pattern? I am trying to find my way around a new SQL Server database and I wanted to find all columns with the word “Default” in their name.

SELECT tables.name AS TableName, columns.name AS ColumnName
FROM
	sys.columns AS columns
INNER JOIN
	sys.tables AS tables
ON tables.OBJECT_ID = columns.OBJECT_ID
WHERE
	columns.name like '%Default%'

Constructing a Word Cloud for ICML 2015

Word clouds have become a bit cliché, but I still think that they have a place in giving a high level overview of the content of a corpus. Here are the steps I took in putting together the word cloud for the International Conference on Machine Learning (2015).

word-cloud

  1. Extract the hyperlinks to the PDFs of all of the papers from the Conference Programme web site using a pipeline of grep and uniq.
  2. In R, extract the text from each of these PDFs using the Rpoppler package.
  3. Split the text for each page into lines and remove the first line (corresponding to the running title) from every page except the first.
  4. Intelligently handle word breaks and then concatenate lines within each document.
  5. Transform text to lower case then remove punctuation, digits and stop words using the tm package.
  6. Compile the words for all of the documents into a single data.frame.
  7. Using the dplyr package count the number of times that each word occurs across the entire corpus as well as the number of documents which contain that word. This is what the top end of the resulting data.frame looks like:
    > head(word.counts)
           word count doc
    1       can  6496 270
    2  learning  5818 270
    3 algorithm  5762 252
    4      data  5687 254
    5     model  4956 242
    6       set  4040 269
    
  8. Finally, construct a word cloud with the tagcloud package using the word count to weight the word size and the document count to determine grey scale.

The first cloud above contains the top 300 words. The larger cloud below is the top 1000 words.

word-cloud-large

R Recipe: Reordering Columns in a Flexible Way

Suppose you have a data frame with a number of columns.

> names(trading)
 [1] "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"         "TP"         "OpenPrice" 
 [9] "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"       "Duration"   "Trader"    
[17] "System"

You want to put the Trader and System columns first but you also want to do this in a flexible way. One approach would be to specify column numbers.

> trading = trading[, c(16:17, 1:15)]
> names(trading)
 [1] "Trader"     "System"     "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"        
 [9] "TP"         "OpenPrice"  "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"      
[17] "Duration"

This does the job but it’s not very flexible. After all, the number of columns might change. Rather do it by specifying column names.

> refcols <- c("Trader", "System")
> #
> trading <- trading[, c(refcols, setdiff(names(trading), refcols))]
> names(trading)
 [1] "Trader"     "System"     "OpenDate"   "CloseDate"  "Symbol"     "Action"     "Lots"       "SL"        
 [9] "TP"         "OpenPrice"  "ClosePrice" "Commission" "Swap"       "Pips"       "Profit"     "Gain"      
[17] "Duration"

R Recipe: Making a Chord Diagram

With the circlize package, putting together a Chord Diagram is simple.

library(circlize)
library(RColorBrewer)

# Create a random adjacency matrix
#
adj = matrix(sample(c(1, 0), 26**2, replace = TRUE, prob = c(1, 9)),
             nrow = 26, dimnames = list(LETTERS, LETTERS))

adj = ifelse(adj == 1, runif(26**2), 0)

chordDiagram(adj, transparency = 0.4, grid.col = "midnightblue",
             col = colorRamp2(seq(0, 1, 0.2), brewer.pal(6, "Blues")))

chord-diagram

A Sankey Plot with Uniform Coloured Edges

Following up on my previous post about generating Sankey plots with the riverplot package. It’s also possible to generate plots which have constant coloured edges.

Here’s how (using some of the data structures from the previous post too):

edges$col = sample(palette, size = nrow(edges), replace = TRUE)
edges$edgecol <- "col"

river <- makeRiver(nodes, edges)
style <- list(nodestyle= "invisible")

riverplot-example-constant-edge

Bags, Balls and the Hypergeometric Distribution

A friend came to me with a question. The original question was a little complicated, but in essence it could be explained in terms of the familiar urn problem. So, here’s the problem: you have an urn with 50 white balls and 9 black balls. The black balls are individually numbered. Balls are drawn from the urn without replacement. What is the probability that

  1. the last ball drawn from the urn is a black ball (Event 1) and
  2. when the urn is refilled, the first ball drawn will be the
      same

    black ball (Event 2).

My colleague thought that this would be an extremely rare event. My intuition also suggested that it would be rather unlikely. As it turns out, we were both surprised.

Theoretical Approach

First we’ll set up the parameters for the problem.

> NBALLS = 59
> NWHITE = 50
> NBLACK = NBALLS - NWHITE

We can calculate the probabilities for each of the events independently. Let’s start with Event 2 since it’s simpler: the probability of selecting a particular ball from the urn.

> P2 = 1 / NBALLS
> P2
[1] 0.01694915

The probability of Event 1 is a little more complicated. The Hypergeometric distribution describes the probability of achieving a specific number of successes in a specific number of draws from a finite population without replacement. It’s precisely the distribution that we are after! We want to know the probability of drawing all of the white balls and all but one of the black balls, so that the last ball remaining is black.

> P1 = dhyper(NWHITE, NWHITE, NBLACK, NBALLS-1)
> P1
[1] 0.1525424

I think that’s where my intuition let me down, because my feeling was that this number should have been smaller. The joint probability for the two events is then around 0.26%.

> P1 * P2
[1] 0.002585464

Simulation Approach

Feeling a little uncomfortable with the theoretical result and suspecting that I might have done something rash with the Hypergeometric distribution, I decided to validate the result with a simulation.

> BAG <- c(rep(NA, NWHITE), 1:NBLACK)
> 
> NSIM <- 1000000
> 
> hits <- lapply(1:NSIM, function(n) {
+ 	last <- sample(BAG)[NBALLS]
+ 	first <- sample(BAG, 1)
+ 	#
+ 	last == first
+ })
> #
> hits <- unlist(hits)
> #
> # The resulting vector has entries with the following possible values
> #
> # TRUE -> last ball is black and is the same ball that comes out first from repopulated bag
> # FALSE -> last ball is black and first ball from repopulated ball is black but different
> # NA -> either of these balls is white
> #
> hits <- ifelse(is.na(hits), FALSE, hits)

I also wanted to generate Binomial confidence intervals for the simulated result.

> library(binom)
> stats <- binom.confint(x = cumsum(hits), n = 1:NSIM, methods = "wilson")

The simulated results are plotted below with the theoretical result indicated by the dashed line. The pale blue ribbon indicates the range of the 95% confidence interval.

black-white-balls

Looks like both theory and simulation confirm that my intuition was wrong. I’d prefer not to say just how badly wrong it was.

Postscript: after the excitement of the Hypergeometric distribution had subsided, it occurred to me that the symmetry of the problem allowed for a simpler solution. The probability of Event 1 is actually the same as the probability of drawing any one of the black balls first out of a full urn. Don’t believe me?

> NBLACK / NBALLS
[1] 0.1525424

The Price of Fuel: How Bad Could It Get?

The cost of fuel in South Africa (and I imagine pretty much everywhere else) is a contentious topic. It varies from month to month and, although it is clearly related to the price of crude oil and the exchange rate, various other forces play an influential role.

According to the Department of Energy the majority of South Africa’s fuel is refined from imported crude oil. The rest is synthetic fuel produced locally from coal and natural gas. The largest expense for a fuel refinery is for raw materials. So little wonder that the cost of fuel should be intimately linked to the price of crude oil. The price of crude oil on the international market is quoted in US Dollars (USD) per barrel, so that the exchange rate between the South African Rand and the US Dollar (ZAR/USD) also exerts a strong influence on the South African fuel price.

I am going to adopt a simplistic model in which I’ll assume that the price of South African fuel depends on only those two factors: crude oil price and the ZAR/USD exchange rate. We’ll look at each of them individually before building the model.

Exchange Rate

The Rand reached a weak point against the USD in January 2002 when $1 cost around R11.70. After that it recovered and was relatively strong in 2005. However, since then it has been systematically weakening (with the exception of some brief intervals of recovery). At present (end of March 2015) you will pay around R12 for $1.

I guess this is part of the reason why new running shoes are so damn expensive.

date-delta-ZARUSD

Crude Oil Price

Crude oil has been getting progressively more expensive. The price of around $25 per barrel in 2000 escalated to a peak of roughly $135 per barrel in August 2008. This was followed by a massive decrease in price, bringing it down to only approximately $40 per barrel at the beginning of 2009. We have recently seen a similar dramatic decrease in the crude oil price (unfortunately, not as dramatic as the one in 2008).

date-delta-crude-oil

Fuel Price

Back in 2000 you could get a litre of fuel in South Africa for around R3. Those were halcyon days. Since then the cost of fuel has been steadily increasing, with the exception of a sharp spike in 2008 and the recent decline starting towards the end of 2014.

diesel-petrol-time

Building Simple Models

If you’re not interested in the technical details of the model building process, feel free to skip forward to the results.

There is a clear linear relationship between the fuel price and the price of crude oil (plot on the left below, with Diesel in blue and Petrol in red). A power law relationship actually gives a tighter fit (see log-log plot on the right below). However, in the interests of simplicity we’ll stick to linear relationships.

diesel-petrol-crude-oil

We’ll put together three simple (and probably naive) linear models for the fuel price:

  • an additive linear model depending on exchange rate and crude oil price;
  • the same model but including the interaction between exchange rate and crude oil price; and
  • a model depending only on the above interaction.

The first model assumes that the influences of exchange rate and crude oil price are independent. A moment’s thought reveals that this cannot be a particularly good model, since the effects of these two quantities must be inextricably linked. However, as we will see, the model provides a relatively good fit to the data, but probably will not extrapolate effectively.

> fit.1 <- lm(ULP_Inland_95 ~ DollarPerBarrel + ZARUSD, data = fuel.data)
> summary(fit.1)

Call:
lm(formula = ULP_Inland_95 ~ DollarPerBarrel + ZARUSD, data = fuel.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1686 -0.2130  0.1724  0.4877  1.1874 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -6.210351   0.446422  -13.91   <2e-16 ***
DollarPerBarrel  0.075478   0.003091   24.42   <2e-16 ***
ZARUSD           1.093711   0.051210   21.36   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7103 on 105 degrees of freedom
  (72 observations deleted due to missingness)
Multiple R-squared:  0.9305,	Adjusted R-squared:  0.9292 
F-statistic: 702.9 on 2 and 105 DF,  p-value: < 2.2e-16

> AIC(fit.1)
[1] 237.5634

The second model considers both the independent (additive) effects of exchange rate and crude oil price as well as their joint (multiplicative) effect. The p-values in the summary output below indicate that only the multiplicative term is statistically significant. This provides a motivation for the third model.

> fit.2 <- lm(ULP_Inland_95 ~ DollarPerBarrel * ZARUSD, data = fuel.data)
> summary(fit.2)

Call:
lm(formula = ULP_Inland_95 ~ DollarPerBarrel * ZARUSD, data = fuel.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2018 -0.3598  0.1219  0.4863  1.0814 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.285115   1.599475   1.429    0.156    
DollarPerBarrel        -0.019958   0.017625  -1.132    0.260    
ZARUSD                  0.050663   0.195616   0.259    0.796    
DollarPerBarrel:ZARUSD  0.011591   0.002115   5.481 2.97e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6287 on 104 degrees of freedom
  (72 observations deleted due to missingness)
Multiple R-squared:  0.9461,	Adjusted R-squared:  0.9445 
F-statistic: 608.3 on 3 and 104 DF,  p-value: < 2.2e-16

> AIC(fit.2)
[1] 212.1551

The third and final model uses only the interaction between exchange rate and crude oil price. The summary output below indicates that for this model the coefficients associated with both the intercept and interaction terms are significant.

> fit.3 <- lm(ULP_Inland_95 ~ DollarPerBarrel:ZARUSD, data = fuel.data)
> summary(fit.3)

Call:
lm(formula = ULP_Inland_95 ~ DollarPerBarrel:ZARUSD, data = fuel.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.06270 -0.27476  0.08142  0.44747  1.35974 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.9443004  0.2043706   9.514 6.98e-16 ***
DollarPerBarrel:ZARUSD 0.0102018  0.0002601  39.224  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6808 on 106 degrees of freedom
  (72 observations deleted due to missingness)
Multiple R-squared:  0.9355,	Adjusted R-squared:  0.9349 
F-statistic:  1539 on 1 and 106 DF,  p-value: < 2.2e-16

> AIC(fit.3)
[1] 227.4323

Based on standard metrics all three of these models are reasonable approximations to the data. The principle of parsimony suggests that of Models 2 and 3 the latter should be preferred (especially in light of the fact that the coefficients for the extra terms in Model 2 are not significant).

The quantile-quantile plots for the standardised residuals of the three models are given below. Both Model 1 and Model 2 have residuals which are skewed to the left. The distribution of residuals for Model 3 is close to normal but with heavy tails (large kurtosis).

model-residuals-quantile

The next set of plots show the relative errors for Models 2 and 3. For Model 3 there is a gradient from negative errors in the top left corner of the plot to positive errors in the bottom right corner. This indicates that Model 3 is systematically failing to capture some of the underlying variation in the data. The relative error for Model 2 does not display a similar pattern.

model-relative-error-scatter

So, despite the simplicity of Model 3, Model 2 provides a better description of the data. The pattern of relative errors for Model 1 is similar to that for Model 2.

Model Projections

So what do these models tell us?

The plots below show the predictions based on Models 1 and 2. The points indicate the data used to derive the models. The dashed lines denote contours of constant fuel price. The linear and hyperbolic shapes of the contours reflect the form of the underlying models.

It’s dangerous to extrapolate too far beyond the domain covered by your data. However, the general trend of both models is obvious: increases in either the crude oil price or the ZAR/USD exchange rate will result in higher fuel costs. Let’s consider some individual test cases. Suppose the price of crude oil remained around 60 USD per barrel but the ZAR/USD exchange rate went up to R15 per $1. In this case Model 1 predicts that the fuel price would reach R14.70 per litre, while Model 2 predicts R13.45 per litre. We’ve been in this regime before and it wasn’t pleasant! Another scenario to consider would be the ZAR/USD exchange rate staying at around R12 per $1 but the price of crude oil going up to 160 USD per barrel. In this case Model 1 predicts that the fuel price would reach R19.00 per litre, while Model 2 predicts R21.10 per litre. Either of those is rather daunting. Finally, what would happen if the price of crude oil went up to 160 USD per barrel and the ZAR/USD exchange rate hit 15? Under these conditions Model 1 predicts R22.30 per litre, while Model 2 gives R26.20 per litre. Those are both pretty appalling!

model-predictions

Conclusions

These models only consider the effects of the international price of crude oil and exchange rates. Effects like the fuel price increases in March and April 2015, which were based on budgetary decisions, are of course not accounted for. Predictions should thus be treated with a healthy dose of skepticism. But as a general guide to the possible fuel prices we could expect in the future, it’s quite illuminating. And not a little bit scary.