# Simple School Maths Problem

A simple problem sent through to me by one of my running friends:

There are 6 red cards and 1 black card in a box. Busi and Khanha take turns to draw a card at random from the box, with Busi being the first one to draw. The first person who draws the black card will win the game (assume that the game can go on indefinitely). If the cards are drawn with replacement, determine the probability that Khanya will win, showing all working.

The problem was posed to matric school pupils and allocated 7 marks (which translates into 7 minutes).

## Per Game Analysis

Every time somebody plays the game they have a 1 in 7 chance of winning. The fact that the cards are drawn with replacement means that every time the game is played the odds are precisely the same.

## Series of Games

Busi plays first. On her first try she has a 1/7 probability of winning.

Khanha plays next. Her probability of winning is 6/7 * 1/7, where 6/7 is the probability that Busi did not win perviously and 1/7 is the probability that Khanha wins on her first try.

The next time that Busi plays her probability of winning is 6/7 * 6/7 * 1/7, where the first 6/7 is the probability that she did not win on her first try and the second 6/7 is the probability that Khanha didn’t win on the previous round either.

The process continues…

In the end the probability that Busi wins is

```1/7 + (6/7 * 6/7) * 1/7 + (6/7 * 6/7)^2 * 1/7 + (6/7 * 6/7)^3 * 1/7 + …
```

This is an infinite geometric series. We’ll simplify it a bit:

```1/7 * [1 + (6/7 * 6/7) + (6/7 * 6/7)^2 +  (6/7 * 6/7)^3 + …]
= 1/7 * [1 + r + r^2 + r^3 + …]
= 1/7 * [1 / (1-r)]
= 1/7 * [49/13]
= 0.5384615
```

where r = 6/7 * 6/7 = 36/49.

What about the probability that Khanha wins? By similar reasoning this is

```6/7 * 1/7 + (6/7 * 6/7) * 6/7 * 1/7 + (6/7 * 6/7)^2 * 6/7 * 1/7 + (6/7 * 6/7)^3 * 6/7 * 1/7 + …
= 6/7 * 1/7 * [1 + (6/7 * 6/7) + (6/7 * 6/7)^2 + (6/7 * 6/7)^3 + …]
= 6/49 * [49/13]
= 0.4615385
```

Importantly those two probabilities sum to one: 0.5384615 + 0.4615385 = 1.

The required answer would be 0.4615385. The calculation for Busi would not be necessary, but I’ve included it for completeness.

## Conclusion

Although every time they play the game either player has the same chance of winning, because Busi plays first she has a greater chance of winning overall (simply by virtue of the fact that she plays before her opponent). By the same token, Khanha playing second puts her at a slight disadvantage. If both players played at the same time (for example, each drawing from their own box) then the probability would be 0.5 for both of them. The sequence of play puts Khanha at a slight disadvantage.

Note that Busi’s edge gets smaller as the number of red cards in the box increases. This is because her probability of winning on every game gets smaller and so the “first play” advantage weakens.

It seems like a fairly challenging problem for matric maths. Especially for only 7 marks. Having said that, the fact that they are attacking these sorts of problems in school maths is great. We never did anything this practical when I was at school.

# Fitting a Statistical Distribution to Sampled Data

I’m generally not too interested in fitting analytical distributions to my data. With large enough samples (which I am normally fortunate enough to have!) I can safely assume normality for most statistics of interest.

Recently I had a relatively small chunk of data and finding a decent analytical approximation was important. So I had a look at the tools available in R for addressing this problem. The fitdistrplus package seemed like a good option. Here’s a sample workflow.

## Create some Data

To have something to work with, generate 1000 samples from a log-normal distribution.

```> N <- 1000
>
> set.seed(37)
> #
> x <- rlnorm(N, meanlog = 0, sdlog = 0.5)
```

## Skewness-Kurtosis Plot

Load up the package and generate a skewness-kurtosis plot.

```> library(fitdistrplus)
>
> descdist(x)
summary statistics
------
min:  0.2391517   max:  6.735326
median:  0.9831923
mean:  1.128276
estimated sd:  0.6239416
estimated skewness:  2.137708
estimated kurtosis:  12.91741
```

There’s nothing magical in those summary statistics, but the plot is most revealing. The data are represented by the blue point. Various distributions are represented by symbols, lines and shaded areas.

We can see that our data point is close to the log-normal curve (no surprises there!), which indicates that it is the most likely distribution.

We don’t need to take this at face value though because we can fit a few distributions and compare the results.

## Fitting Distributions

We’ll start out by fitting a log-normal distribution using `fitdist()`.

```> fit.lnorm = fitdist(x, "lnorm")
> fit.lnorm
Fitting of the distribution ' lnorm ' by maximum likelihood
Parameters:
estimate Std. Error
meanlog -0.009199794 0.01606564
sdlog    0.508040297 0.01135993
> plot(fit.lnorm)
```

The quantile-quantile plot indicates that, as expected, a log-normal distribution gives a pretty good representation of our data. We can compare this to the results of fitting a normal distribution, where we see that there is significant divergence of the tails of the quantile-quantile plot.

## Comparing Distributions

If we fit a selection of plausible distributions then we can objectively evaluate the quality of those fits.

```> fit.metrics <- lapply(ls(pattern = "fit\\."), function(variable) {
+   fit = get(variable, envir = .GlobalEnv)
+   with(fit, data.frame(name = variable, aic, loglik))
+ })
> do.call(rbind, fit.metrics)
name      aic     loglik
1   fit.exp 2243.382 -1120.6909
2 fit.gamma 1517.887  -756.9436
3 fit.lnorm 1469.088  -732.5442
4 fit.logis 1737.104  -866.5520
5  fit.norm 1897.480  -946.7398
```

According to these data the log-normal distribution is the optimal fit: smallest AIC and largest log-likelihood.

Of course, with real (as opposed to simulated) data, the situation will probably not be as clear cut. But with these tools it’s generally possible to select an appropriate distribution and derive appropriate parameters.

# Sportsbook Betting (Part 3): Evolving Odds

In previous instalments in this series I have not taken into account how odds can change over time. There are two main reasons for such a change:

1. a significant variation in the distribution of bets being placed on the various outcomes of the event (and the bookmakers’ thus trying to “balance” their books); and
2. other occurrences which have a direct effect on the probable outcome of the event.

The first of these is difficult to examine since bookmakers generally do not reveal the required data. The second is more accessible. We’ll consider one particular example.

## Olympic Women’s 800 metre Race

We’ll take a look at data for the women’s 800 metre race at the 2016 Olympic Games in Rio de Janeiro. Again the odds were scraped from Oddschecker using the gambleR package. I set up a batch job to grab those odds at 10 minute intervals. In retrospect that was overkill since the odds were static over much longer time scales. However, in principle, the odds for an event might change almost continuously as new information becomes available.

The plot below reflects how the bookmakers’ odds for the various athletes in contention for this event changed from the time that I started logging data on 15 August 2016 through to the final event on 20 August 2016. There were some problems with the scraping job on 15 and 16 August, which accounts for the periods of scarce data. Also there were periods before the heats as well as after the heats and the semi-finals where no odds data were available. Since there was a high degree of overplotting I have jittered the data to make the individual traces visible.

The vertical dashed lines indicate the time of the heats (10:55 on 17 August 2016), the semi-finals (21:15 on 18 August 2016) and the final (21:15 on 20 August 2016). All times are in UTC-3, the local time zone in Rio de Janiero.

A total of 64 athletes took part in eight heats, after which the field was reduced to 24 athletes. These remaining athletes competed in three semi-finals to leave a field of only 8 athletes for the final. The phenomenal Caster Semenya trounced her competitors to win the final in 1:55.28.

Looking at the odds plotted above it’s clear that Semenya was the favourite to win from the start. A wager on her was almost a sure win, but the rewards were pretty small. There was some variability in the remaining athletes. After the heats and semi-finals odds were no longer quoted for those athletes eliminated from the competition. The odds against Joanna Jóźwik, an outsider prior to the competition, dropped substantially after the heats and semi-finals based on her excellent performance in both. The odds against Margaret Wambui also dropped after the semi-finals based on her comfortable victory. The odds for the remaining athletes who competed in the finals increased somewhat after the heats and semi-finals.

It’s apparent from the stepwise revisions in the odds in this event that they are not being continuously adjusted to take into account changes in the betting preferences of punters. In this case it seems that only the relative performance of the athletes in the races leading up to the final event had any influence on the odds.

# Sportsbook Betting (Part 2): Bookmakers’ Odds

In the first instalment of this series we gained an understanding of the various types of odds used in Sportsbook betting and the link between those odds and implied probabilities. We noted that the implied probabilities for all possible outcomes in an event may sum to more than 100%. At first sight these seems a bit odd. It certainly appears to violate the basic principles of statistics. However, this anomaly is the mechanism by which bookmakers assure their profits. A similar principle applies in a casino.

## Casino House Edge

Because the true probabilities of each outcome in casino games are well defined, this is a good place to start. In a casino game a winning wager receives a payout which is not quite consistent with the game’s true odds (how this is achieved varies from game to game). As a result, casino games are not “fair” from a gambler’s perspective. If they were, then a casino would not be a very profitable enterprise! Instead every casino game is slightly biased in favour of the house. On each round a gambler still stands a chance of winning. However, over time, the effect of this bias accumulates and the gambler inevitably loses money.

Let’s look at a couple of examples. We’ll start with a super simple game.

### Example: Rolling a Dice

Consider a dice game in which the player wins if the dice lands on six. The odds for this game are 5/1 and the player would expect to receive 5 times his wager if he won.

```> odds.fractional = c(win = 5/1, lose = 1/5)
> (odds.decimal = odds.fractional + 1)
win lose
6.0  1.2
> (probability = 1 / odds.decimal)
win    lose
0.16667 0.83333
```

The probability of winning is 1/6. Would a gambler expect to profit if he played this game many times?

```> payout = c(5, -1)
> sum(probability * payout)
[1] 0.00
```

No! In the long run neither the gambler nor the casino would make money on a game like this. It’s a fair game: neither the house nor the gambler has any statistical advantage or “edge”.

If, however, the house paid out only 4 times the wager then the player’s expected profit would become

```> payout = c(4, -1)
> sum(probability * payout)
[1] -0.16667
```

Now the game is stacked in favour of the house, since on average the player would expect to lose around 17% of his stake. Of course, on any one game the gambler would either win 4 times his stake or lose the entire stake. However, if he played the game many times then on average he would lose 17% of his stake per game.

The game outlined above would not represent a very attractive proposition for a gambler. Obviously a casino could not afford to be this greedy and the usual house edge in any casino game is substantially smaller. Let’s move on to a real casino game.

### Example: European Roulette

A European Roulette wheel has one zero and 36 non-zero numbers (18 odd and 18 even; 18 red and 18 black), making a total of 37 positions. Consider a wager on even numbers. The number of losing outcomes is 19 (the zero is treated as neither odd nor even: it’s the “house number”!), while number of winning outcomes is 18. So the odds against are 19/18.

```> odds.fractional = c(win = 19/18, lose = 18/19)
> (odds.decimal = odds.fractional + 1)
win   lose
2.0556 1.9474
> (probability = 1 / odds.decimal)
win    lose
0.48649 0.51351
```

The probability of winning is 18/(19+18) = 18/37 = 0.48649. So this is almost an even money game.

Based on a wager of 1 coin, a win would result in a net profit of 1 coin, while a loss would forfeit the stake. The player’s expected outcome is then

```> payout = c(1, -1)
> sum(probability * payout)
[1] -0.027027
```

The house edge is 2.70%. On average a gambler would lose 2.7% of his stake per game. Of course, on any one game he would either win or lose, but this is the long term expectation. Another way of looking at this is to say that the Return To Player (RTP) is 97.3%, which means that on average a gambler would get back 97.3% of his stake on every game.

Below are the results of a simulation of 100 gamblers betting on even numbers. Each starts with an initial capital of 100. The red line represents the average for the cohort. After 1000 games two gamblers have lost all of their money. Of the remaining 98 players, only 24 have made money while the rest have lost some portion of their initial capital.

The code for this simulation is available here.

## Over-round, Vigorish and Juice

A bookmaker will aim to achieve an overall profit regardless of the outcome of the event. The general approach to doing this is to offer odds which are less than the true odds. As a result the payout on a successful wager is less than what would be mathematically dictated by the true odds. Because of the reciprocal relationship between odds and implied probabilities, this means that the corresponding implied probabilities are inflated. The margin by which the implied probabilities exceed 100% is known as the “over-round” (also vigorish or juice). The over-round determines the profit margin of the bookmaker. Bookmakers with a lower over-round also have a lower profit margin and hence offer a more equitable proposition to gamblers.

Since sports betting involves humans, there is no deterministic edge to the house or the gambler.

It’s useful to consider what we mean by “true odds” in the context of Sportsbook. Clearly for a casino game these odds can be calculated precisely (though with various degrees of difficulty, depending on the game). However, in Sportsbook the actual odds of each outcome cannot be known with great precision. This is simply a consequence of the fact that the events involve humans, and we are notoriously unpredictable.

Do bookmakers even care about the true odds? Not really. They are mostly just interested in offering odds which will provide them with an assured overall profit on an event.

There are a number of factors which contribute to determining the odds used in Sportsbook. Obviously there’s serious domain knowledge involved in deriving the initial odds on offer. But over time these odds should evolve to take into account the overall distribution of bets placed on the various outcomes (something like the wisdom of the crowd). It has been suggested that, as a result, Sportsbook odds are similar to an efficient market. Specifically, the distribution of wagers affect the odds, with the odds on the favourite get smaller while those on the underdog(s) get larger. Eventually the odds will settle at values which reflect the market’s perceived probability of the outcome of the event.

Rather, the odds are designed so that equal money is bet on both sides of the game. If more money is bet on one of the teams, the sports book runs the risk of losing money if that team were to win.

### Example: Horse Racing a Round Book

A bookmaker is offering fractional odds of 4/1 (or 5 decimal odds) on each horse in a five horse race. The implied probability of each horse winning is 20%. If the bookmaker accepted the same volume of wagers on each horse then he would not make any money since the implied probabilities sum to 100%. This is known as a “round” book.

From a gambler’s perspective, a wager of 10 on any one of the horses would have an expected return of zero. From the bookmaker’s perspective, if he accepted 100 in wagers on each horse, then he would profit 400 on the losing horses and pay out 400 on the winning horse, yielding zero net profit.

Since the expected return is zero, this represents a fair game. However, such odds would never obtain in practice: the bookmaker always stands to make money. Enter the over-round.

### Example: Horse Racing with Over-Round

If the bookmaker offered fractional odds of 3/1 (or 4.0) on each horse, then the implied probabilities would change from 20% to 25%. Summing the implied probabilities gives 125%, which is 25% over-round.

Suppose that the bookmaker accepted 100 in wagers on each horse, then he would profit 400 on the losing horses and pay out only 300 on the winning horse, yielding a net profit of 100.

Enough hypothetical examples, let’s look at something real.

### Example: Champions League

It’s been suggested that football squad prices can influence Sportsbook odds. Often the richer the franchise, the more likely it is that a club will prevail in the sport. This is supposed to be particularly true in European club football. We’ll try to validate this idea by scraping the data provided by Forbes for football club values.

```> library(rvest)
> library(dplyr)
+   html_nodes("table") %>% .[[1]] %>% html_table() %>% .[, c(2, 3, 4, 7)] %>%
+   setNames(c("team", "country", "value", "revenue")) %>%
+   mutate(
+     value = as.integer(sub(",", "", value)),
+     team = gsub("\\.", "", clubs\$team)
+     )
team country value revenue
1       Real Madrid   Spain  3650     694
2         Barcelona   Spain  3320     570
3 Manchester United England  3315     645
4     Bayern Munich Germany  2680     675
5           Arsenal England  2020     524
6   Manchester City England  1920     558
```

Well, those tabular data are great, but a visualisation would be helpful to make complete sense of the relationship between team value and revenue.

It’s apparent that Real Madrid, Barcelona, Manchester United and Bayern Munich are the four most expensive teams. There’s a general trend of increasing revenue with increasing value. Two conspicuous exceptions are Schalke 04 and Paris Saint-Germain, which produce revenues far higher than expected based on their values.

Although not reflected in the plot above, there’s a relationship between the value of the team and its performance. With only a few exceptions the previously mentioned four teams have dominated the Champions League in recent years. Does this make sense? The richest teams are able to attract the most talented players. The resulting pool of talent increases their chances of winning. This in turn translates into revenue and the cycle is complete.

We’ll grab the bookmakers’ odds for the Champions League.

```> library(gambleR)
> champions.league = oddschecker("football/champions-league/winner")
Ladbrokes Coral William Hill Winner Betfair Sportsbook BetBright Unibet Bwin
Barcelona           3/1   3/1         10/3    3/1                3/1       7/2    3/1 10/3
Bayern Munich       4/1   5/1          4/1    4/1                4/1       4/1    9/2  4/1
Real Madrid         5/1   5/1          4/1    9/2                9/2       9/2    5/1  5/1
Man City           12/1  12/1         11/1   10/1               10/1      12/1   12/1 12/1
Juventus           12/1  14/1         12/1   12/1               10/1      14/1    8/1 12/1
PSG                14/1  14/1         14/1   14/1               14/1      14/1   12/1 14/1
```

According to the selection of bookmakers above, Barcelona, Bayern Munich and Real Madrid are the major contenders in this competition. Betfair Sportsbook has Barcelona edging the current champions Real Madrid as favourites to win the competition. Bayern Munich and Real Madrid have slightly higher odds, with Bayern Munich perceived as the second most likely winner.

The decimal odds on offer at Betfair Sportsbook are

```> champions.decimal[, 15]
Barcelona     Bayern Munich       Real Madrid          Man City          Juventus
4.0               5.0               5.5              11.0              11.0
PSG   Atletico Madrid          Dortmund           Arsenal           Sevilla
15.0              17.0              26.0              26.0              51.0
Tottenham            Napoli         Leicester              Roma           Benfica
41.0              67.0              67.0             101.0             101.0
Porto  Bayer Leverkusen        Villarreal   Monchengladbach              Lyon
151.0              67.0             101.0             151.0             201.0
PSV   Sporting Lisbon       Dynamo Kiev          Besiktas             Basel
201.0             201.0             251.0             251.0             301.0
Club Brugge            Celtic     FC Copenhagen     PAOK Saloniki Red Star Belgrade
501.0             501.0                NA                NA                NA
Salzburg
NA
```

The corresponding implied probabilities are

```> champions.probability[, 15]
Barcelona     Bayern Munich       Real Madrid          Man City          Juventus
0.2500000         0.2000000         0.1818182         0.0909091         0.0909091
PSG   Atletico Madrid          Dortmund           Arsenal           Sevilla
0.0666667         0.0588235         0.0384615         0.0384615         0.0196078
Tottenham            Napoli         Leicester              Roma           Benfica
0.0243902         0.0149254         0.0149254         0.0099010         0.0099010
Porto  Bayer Leverkusen        Villarreal   Monchengladbach              Lyon
0.0066225         0.0149254         0.0099010         0.0066225         0.0049751
PSV   Sporting Lisbon       Dynamo Kiev          Besiktas             Basel
0.0049751         0.0049751         0.0039841         0.0039841         0.0033223
Club Brugge            Celtic     FC Copenhagen     PAOK Saloniki Red Star Belgrade
0.0019960         0.0019960                NA                NA                NA
Salzburg
NA
```

These sum to 1.178, giving an over-round of 17.8%.

Let’s focus on a football game between Anderlecht and Rostov. These are not major contenders, but they faced off last Saturday (3 August 2016), so the data are readily available.

### Example: Anderlecht versus Rostov

The odds for the football match between Anderlecht and Rostov are shown below.

The match odds are 2.0 for a win by Anderlecht, 4.1 for a win by Rostov and 3.55 for a draw. Let’s convert those to the corresponding implied probabilities:

```> decimal.odds = c(anderlecht = 2.0, rostov = 4.1, draw = 3.55)
> 1 / decimal.odds
anderlecht     rostov       draw
0.50000    0.24390    0.28169
```

According to those odds the implied probabilities of each of the outcomes are 50%, 24.4% and 28.2% respectively.

```> sum(1 / decimal.odds)
[1] 1.0256
```

Summing those probabilities gives an over-round of 2.6%, which is very competitive. However, including the 5% commission levied by Betfair, this increases to 7.6%.

Although Anderlecht were the favourites to win this game, it turns out that Rostov had a convincing victory.

The same principles apply when there are many possible outcomes for an event.

### Example: Horse Racing (18:20 at Stratford)

I scraped the odds for the 18:20 race at Stratford on 28 June 2016 from oddschecker. Here are the data for nine bookmakers.

```> odds[, 1:9]
Bet Victor Betway Marathon Bet Betdaq Bet 365 Ladbrokes Sky Bet 10Bet 188Bet
Deauville Dancer        6/4   13/8         13/8    7/5     6/4       6/4     6/4   6/4    6/4
Cest Notre Gris         7/4    7/4          7/4    7/4     7/4       7/4     7/4   7/4    7/4
Ross Kitty             15/2    7/1          7/1   41/5     7/1       7/1     7/1   7/1    7/1
Amber Spyglass         12/1   12/1         12/1   68/5    12/1      12/1    11/1  11/1   11/1
Venture Lagertha       20/1   22/1         20/1   89/5    16/1      20/1    20/1  20/1   20/1
Lucky Thirteen         22/1   22/1         20/1   21/1    22/1      20/1    22/1  20/1   20/1
Overrider              25/1   20/1         25/1   22/1    25/1      20/1    22/1  22/1   22/1
Kims Ocean             28/1   25/1         25/1   21/1    22/1      25/1    28/1  25/1   25/1
Rizal Park             80/1   66/1         50/1   82/1    80/1      66/1    50/1  66/1   66/1
Chitas Gamble         250/1  200/1        100/1  387/1   250/1     125/1   125/1 150/1  150/1
Irish Ranger          250/1  200/1        100/1  387/1   250/1     150/1   125/1 150/1  150/1
```

The decimal odds on offer at Bet Victor are

```> decimal.odds[,1]
Deauville Dancer  Cest Notre Gris       Ross Kitty   Amber Spyglass Venture Lagertha
2.50             2.75             8.50            13.00            21.00
Lucky Thirteen        Overrider       Kims Ocean       Rizal Park    Chitas Gamble
23.00            26.00            29.00            81.00           251.00
Irish Ranger
251.00
```

The corresponding implied probabilities are

```> probability[,1]
Deauville Dancer  Cest Notre Gris       Ross Kitty   Amber Spyglass Venture Lagertha
0.4000000        0.3636364        0.1176471        0.0769231        0.0476190
Lucky Thirteen        Overrider       Kims Ocean       Rizal Park    Chitas Gamble
0.0434783        0.0384615        0.0344828        0.0123457        0.0039841
Irish Ranger
0.0039841
```

The total implied probability per bookmaker is

```> sort(colSums(probability))
Bet Victor            Betway       Marathon Bet            Betdaq           Bet 365
1.1426            1.1444             1.1581            1.1623            1.1701
Ladbrokes           Sky Bet              10Bet            188Bet         Netbet UK
1.1764            1.1765             1.1773            1.1773            1.1773
Boylesports            Winner       William Hill        Stan James           Betfair
1.1797            1.1861             1.1890            1.1895            1.1935
Coral          RaceBets Betfair Sportsbook         BetBright       Sportingbet
1.1964            1.2003             1.2173            1.2229            1.2288
Betfred         Totesport          32Red Bet          888sport       Paddy Power
1.2303            1.2303             1.2392            1.2392            1.2636
```

It’s obvious that there is a wide range of value being offered by various bookmakers, extending from the competitive Bet Victor and Betway with an over-round of around 14% to the substantial over-round of 26% at Paddy Power.

From a gambler’s point of view, the best value is obtained by finding the bookmaker who is offering the largest odds for a particular outcome. It’s probable that this bookmaker will also have a relatively low over-round. Sites like oddschecker make it a simple matter to check the odds on offer from a range of bookmakers. If you have the time and patience it might even be possible to engage in betting arbitrage.

# Animated Mortality

Kyle Walker’s pyramid plots gave me a serious case of visualisation envy. Here’s something similar using the mortality data from the lifespan package.

The change in the mortality profile from year to year over two decades is evident. There’re unmistakable peaks which propagate up the plot, corresponding to babies born in 1943 and 1947, around the start and just after the Second World War.

# Sportsbook Betting (Part 1): Odds

This series of articles was written as support material for Statistics exercises in a course that I’m teaching for iXperience. In the series I’ll be using illustrative examples for wagering on a variety of Sportsbook events including Horse Racing, Rugby and Tennis. The same principles can be applied across essentially all betting markets.

## Odds

To make some sense of gambling we’ll need to understand the relationship between odds and probability. Odds can be expressed as either “odds on” or “odds against”. Whereas the former is the odds in favour of an event taking place, the latter reflects the odds that an event will not happen. Odds against is the form in which gambling odds are normally expressed, so we’ll focus on that. The odds against are defined as the ratio, L/W, of losing outcomes (L) to winning outcomes (W). To link these odds to probabilities we note that the winning probability, p, is W/(L+W). The odds against are thus equivalent to (1-p)/p.

To make this more concrete, consider the odds against rolling a 6 with a single die. The number of losing outcomes is L = 5 (for all of the other numbers on the die: 1, 2, 3, 4 and 5) while the number of winning outcomes is W = 1. The odds against are thus 5/1, while the winning probability is 1/(5+1) = 1/6.

### Fractional Odds

Fractional odds are quoted as L/W, L:W or L-W. From a gambler’s perspective these odds reflect the net winnings relative to the stake. For example, fractional odds of 5/1 imply that the gambler stands to make a profit of 50 on a stake of 10. In addition to the profit, a winning gambler gets the stake back too. So, in the previous scenario, the gambler would receive a total of 60. Conversely, factional odds of 1/2 would pay out 10 for a stake of 20. Odds of 1/1 are known as “even odds” or “even money”, and will pay out the same amount as was wagered.

The numerator and denominator in fractional odds are always integers.

In a fair game a player who placed a wager at fractional odds of L/W would reasonably expect to win L/W times his wager.

### Decimal Odds

Decimal odds quote the ratio of the full payout (including original stake) to the stake. Using the same symbols as above, this is equivalent to the ratio (L+W)/W or 1+L/W. The decimal odds are numerically equal to the fractional odds plus 1. In a fair game the decimal odds are also the inverse of the probability of a winning outcome. This makes sense because the inverse of the decimal odds is W/(L+W).

From a gambler’s perspective, decimal odds reflect the gross total which will be paid out relative to the stake. For example, decimal odds of 6.0 are equivalent to fractional odds of 5/1 and imply that the gambler stands to get back 60 on a stake of 10. Similarly, decimal odds of 1.5 are the same as fractional odds of 1/2, and a winning gambler would get back 30 on a wager of 20.

Decimal odds are quoted as a positive number greater than 1.

## Odds and Probability

As indicated above, there is a direct relationship between odds and probabilities. For a fair game, this relationship is simple: the probabilities are the reciprocal of the decimal odds. And for a fair game, the sum of the probabilities of all possible outcomes must be 1.

The reciprocal relationship between decimals odds and probabilities implies that outcomes with the lowest odds are the most likely to be realised. This might not tie up with the conventional understanding of odds, but is a consequence of the fact that we are looking at the odds against that outcome.

### Example: Fair Odds on Rugby

The Crusaders are playing the Hurricanes at the AMI Stadium. A bookmaker is offering 1/2 odds on the Crusaders and 2/1 odds on the Hurricanes. These fractional odds translate into decimal odds of 1.5 and 3.0 respectively. Based on these odds, the implied probabilities of either team winning are

```&gt; (odds = c(Crusaders = 1.5, Hurricanes = 3))
1.5        3.0
&gt; (probability = 1 / odds)
0.66667    0.33333
```

The Crusaders are perceived as being twice as likely to win. Since they are clearly the favourites for this match it stands to reason that there would be more wagers placed on the Crusaders than on the Hurricanes. In fact, on the basis of the odds we would expect there to be roughly twice as much money placed on the Crusaders.

A successful wager of 10 on the Crusaders would yield a net win of 5, while the same wager on the Hurricanes would stand to yield a net win of 20. If we include the initial stake then we get the corresponding gross payouts of 15 and 30.

```&gt; (odds - 1) * 10                                          # Net win
5         20
&gt; odds * 10                                                # Gross Win
15         30
```

In keeping with the reasoning above, suppose that a total of 2000 was wagered on the Crusaders and 1000 was wagered on the Hurricanes. In the event of a win by the Crusaders the bookmaker would keep the 1000 wagered on the Hurricanes, but pay out 1000 on the Crusaders wagers, leaving no net profit. Similarly, if the Hurricanes won then the bookmaker would pocket the 2000 wagered on the Crusaders but pay out 2000 on the Hurricanes wagers, again leaving no net profit. The bookmaker’s expected profit based on either outcome is zero. This does not represent a very lucrative scenario for a bookmaker. But, after all, this is a fair game.

From a punter’s perspective, a wager on the Crusaders is more likely to be successful, but is not particularly rewarding. By contrast, the likelihood of a wager on the Hurricanes paying out is lower, but the potential reward is appreciably higher. The choice of a side to bet on would then be dictated by the punter’s appetite for risk and excitement (or perhaps simply their allegiance to one team or the other).

The expected outcome, which weights the payout by its likelihood, of a wager on either the Crusaders or the Hurricanes is zero.

```&gt; (probability = c(win = 2, lose = 1) / 3)                 # Wager on Crusaders
win    lose
0.66667 0.33333
&gt; payout = c(win = 0.5, lose = -1)
&gt; sum(probability * payout)
[1] 0
&gt; (probability = c(win = 1, lose = 2) / 3)                 # Wager on Hurricanes
win    lose
0.33333 0.66667
&gt; payout = c(win = 2, lose = -1)
&gt; sum(probability * payout)
[1] 0
```

Again this is because the odds represent a fair game.

Most games of chance are not fair, so the situation above represents a special (and not very realistic) case. Let’s look at a second example which presents the actual odds being quoted by a real bookmaker.

### Example: Real Odds on Tennis

The odds below are from an online betting website for the tennis match between Madison Keys and Venus Williams. These are real, live odds and the implications for the player and the bookmaker are slightly different.

We’ll focus our attention on the overall winner, for which the decimal odds on Madison Keys are 1.83, while those on Venus Williams are 2.00.

```&gt; (odds = c(Madison = 1.83, Venus = 2.00))
1.83    2.00
&gt; (probability = 1 / odds)
0.54645 0.50000
```

The first thing that you’ve observed is that the implied probabilities do not sum to 1. We’ll return to this point in the next article.

The odds quoted for each player very similar, which implies that the bookmaker considers these players to be evenly matched. Madison Keys has slightly lower odds, which suggests that she is a slightly stronger contender. A wager on either player will not yield major rewards because of the low odds. However, at the same time, a wager on either player has a similar probability of being successful: both around 50%.

Let’s look at another match. Below are the odds from the same online betting website for the game between Novak Djokovic and Radek Stepanek.

The odds for this game are profoundly different to those for the ladies match above.

```&gt; (odds = c(Novak = 1.03, Radek = 16.00))
1.03 16.00
&gt; (probability = 1 / odds)
0.97087 0.06250
```

Novak Djokovic is considered to be the almost certain winner. A wager on him thus has the potential to produce only 3% winnings. Radek Stepanek, on the other hand, is a rank outsider in this match. His perceived chance of winning is low. As a result, the potential returns should he win are large.

To find out more about converting between different forms of odds and the corresponding implied probabilities, have a look at this tutorial.

In the next instalment we’ll examine how bookmakers’ odds ensure their profit yet provide a potentially rewarding (and entertaining) experience for gamblers.

# Arthur Benjamin: Teach statistics before calculus!

Arthur Benjamin thinks that the end goal of teaching Mathematics at school should be Statistics rather than Calculus. He has a point: in terms of understanding things in the real world, Statistics is definitely more powerful. These ideas are quite compatible with those of Conrad Wolfram, who thinks that we should be using computers more extensively in Mathematics education.

The mathematics curriculum that we have is based on a foundation of arithmetic and algebra. And everything we learn after that is building up towards one subject. And at top of that pyramid, it’s calculus. And I’m here to say that I think that that is the wrong summit of the pyramid… that the correct summit, that all of our students, every high school graduate should know, should be statistics: probability and statistics.
Arthur Benjamin

# Building a Life Table

After writing my previous post, Mortality by Year and Age, I’ve become progressively more interested in the mortality data. Perhaps those actuaries are onto something? I found this report, which has a wealth of pertinent information. On p. 13 the report gives details on constructing a Life Table, which is one of the fundamental tools in Actuarial Science. The lifespan package has all of the data required to construct a Life Table, so I created a `lifetable` data frame which has those data broken down by gender.

```> library(lifespan)
> subset(lifetable, sex == "M") %>% head
x sex     lx      dx         qx
1 0   M 100000 596.534 0.00596534
2 1   M  99403 256.848 0.00258389
3 2   M  99147 174.077 0.00175575
4 3   M  98973 114.213 0.00115398
5 4   M  98858  83.082 0.00084041
6 5   M  98775  71.536 0.00072423
> subset(lifetable, sex == "F") %>% head
x sex     lx      dx         qx
133 0   F 100000 452.585 0.00452585
134 1   F  99547 203.525 0.00204450
135 2   F  99344 130.223 0.00131083
136 3   F  99214  84.746 0.00085418
137 4   F  99129  62.055 0.00062600
138 5   F  99067  54.475 0.00054988
```

The columns in the data above should be interpreted as follows:

• `lx` is the number of people who have survive to age `x`, based on an initial cohort of 100 000 people;
• `dx` is the expected number of people in the cohort who die aged `x` on their last birthday; and
• `qx` is the probability that someone aged `x` will die before reaching age `x+1`.

A plot gives a high level overview of the data. Below `lx` is plotted as a function of age. Click on the image to access an interactive Plotly version. The cohort size has been renormalised so that `lx` is expressed as a percent. It’s readily apparent that the attrition rate is much higher for males than females, and that very few people survive beyond the age of 105.

Using these data we can also calculate some related conditional probabilities. For example, what is the probability that a person aged 70 will live for at least another 5 years?

```> survival(70, 5)
F       M
0.87916 0.80709
```

Another example, what is the probability that a person aged 70 will live for at least another 5 years but then die in the 10 years after that?

```> survival(70, 5, 10)
F       M
0.37472 0.46714
```

Interesting stuff! Everything indicates that in terms of longevity, females have the upper hand.

Somebody made the following witty comment on LinkedIn in response to my previous post:

Just good to know that death risk visibly decreases after 100y/o. This helps.

Well, yes and no. In an absolute sense your risk of dying after the age of 100 is relatively low. But the reason for this is that the probability of you actually making it to the age of 100 is low. If, however, you do manage to achieve this monumental age, then your risk of dying is rather high.

```> survival(100)
F       M
0.65813 0.60958
```

So men aged 100 have 39% probability of dying before reaching the age of 101, while the probability for women is 34%.

Note that there are also life table data in the babynames package.

# Calculating Pi using Buffon’s Needle

I put together this example to illustrate some general R programming principles for my Data Science class at iXperience. The idea is to use Buffon’s Needle to generate a stochastic estimate for pi.

```> #' Exploit symmetry to limit range of centre position and angle.
> #'
> #' @param l needle length.
> #' @param t line spacing.
> #'
> buffon <- function(l, t) {
+   # Sample the location of the needle's centre.
+   #
+   x <- runif(1, min = 0, max = t / 2)
+   #
+   # Sample angle of needle with respect to lines.
+   #
+   theta = runif(1, 0, pi / 2)
+   #
+   # Does the needle cross a line?
+   #
+   x <= l / 2 * sin(theta)
+ }
>
> L = 1
> T = 2
> #
> N = 10000
> #
> cross = replicate(N, buffon(L, T))
>
> library(dplyr)
> #
> estimates = data.frame(
+   n = 1:N,
+   pi = 2 * L / T / cumsum(cross) * (1:N)
+ ) %>% subset(is.finite(pi))
```

Here are the results (click on the image for an interactive version). The orange line is the reference value and the blue line represents the results of the computation.