PLOS Subject Keywords: Gathering Data

I’m putting together a couple of articles on Collaborative Filtering and Association Rules. Naturally, the first step is finding suitable data for illustrative purposes.

There are a number of standard data sources for these kinds of analyses:

I’d like to do something different though, so instead of using one of these, I’m going to build a data set based on subject keywords from articles published in PLOS journals. This has the advantage of presenting an additional data construction pipeline and the potential for revealing something new and interesting.

Before we get started, let’s establish some basic nomenclature.


Data used in the context of Collaborative Filtering or Association Rules analyses are normally thought of in the following terms:

– a “thing” which is rated.
– a “person” who either rates one or more Items or consumes ratings for Items.
– the evaluation of an Item by a User (can be a binary, integer or real valued rating or simply whether or not the User has interacted with the Item).

Sample Article

We’re going to retrieve a load of data from PLOS. But, just to set the scene, let’s start by looking at a specific article, Age and Sex Ratios in a High-Density Wild Red-Legged Partridge Population, recently published in PLOS ONE. You’ll notice that the article is in the public domain, so you can immediately download the PDF (no paywalls here!) and access a wide range of other data pertaining to the article. There’s a list of subject keywords on the right. This is where we will be focusing most of our attention, although we’ll also retrieve DOI, authors, publication date and journal information for good measure.


We’ll be using the rplos package to access data via the PLOS API. A search through the PLOS catalog is initiated using searchplos(). To access the article above we’d just specify the appropriate DOI using the q (query) argument, while the fields in the result are determined by the fl argument.

> library(rplos)
> partridge <- searchplos(q = "id:10.1371/journal.pone.0159765",
+                         fl = 'id,author,publication_date,subject,journal')$data

The journal, publication date and author data are easy to consume.

> partridge$id
[1] "10.1371/journal.pone.0159765"
> partridge[, 3:5]
   journal     publication_date                                       author
1 PLOS ONE 2016-08-10T00:00:00Z Jesús Nadal; Carolina Ponz; Antoni Margalida

The subject keywords are conflated into a single string, making them more difficult to digest.

> partridge$subject %>% cat
/Biology and life sciences/Population biology/Population dynamics;
/Ecology and environmental sciences/Conservation science;
/Earth sciences/Atmospheric science/Meteorology/Weather;
/Earth sciences/Atmospheric science/Meteorology/Rain;
/Ecology and environmental sciences/Ecology/Community ecology/Trophic interactions/Predation;
/Biology and life sciences/Ecology/Community ecology/Trophic interactions/Predation;
/Biology and life sciences/Population biology/Population metrics/Population density;
/Biology and life sciences/Organisms/Animals/Vertebrates/Amniotes/Birds;
/Biology and life sciences/Organisms/Animals/Vertebrates/Amniotes/Birds/Fowl/Gamefowl/Partridges

Here’s an extract from the documentation about subject keywords which helps make sense of that.

The Subject Area terms are related to each other with a system of broader/narrower term relationships. The thesaurus structure is a polyhierarchy, so for example the Subject Area “White blood cells” has two broader terms “Blood cells” and “Immune cells”. At its deepest the hierarchy is ten tiers deep, with all terms tracking back to one or more of the top tier Subject Areas, such as “Biology and life sciences” or “Social sciences.”

We’ll use the most specific terms in each of the subjects. It’d be handy to have a function to extract these systematically from a bunch of articles.

> library(dplyr)
> options(stringsAsFactors = FALSE)
> split.subject <- function(subject) {
+   data.frame(subject = sub(".*/", "", strsplit(subject, "; ")[[1]]),
+              stringsAsFactors = FALSE) %>%
+     group_by(subject) %>%
+     summarise(count = n()) %>%
+     ungroup
+ }

So, for the article above we get the following subjects:

> split.subject(partridge$subject)
# A tibble: 8 x 2
               subject count
                 <chr> <int>
1                Birds     1
2 Conservation science     1
3           Partridges     1
4   Population density     1
5  Population dynamics     1
6            Predation     2
7                 Rain     1
8              Weather     1

Those tie up well with what we saw on the home page for the article. We see that all of the terms except Predation appear only once. There are two entries for Predation, one in category “Ecology and environmental sciences” and the other in “Biology and life sciences”. We can’t really interpret these entries as ratings. They should rather be thought of as interactions. At some stage we might transform them into Boolean values, but for the moment we’ll leave them as interaction counts.

Some data is collected explicitly, perhaps by asking people to rate things, and some is collected casually, for example by watching what people buy.
Toby Segaran, Programming Collective Intelligence

Article Collection

We’ll need a lot more data to do anything meaningful. So let’s use the same infrastructure to grab a few thousand articles.

> dim(articles)
[1] 185984      5

Parsing the subject column and aggregating the results we get a data frame with counts of the number of times a particular subject keyword is associated with each article.

> subjects <- lapply(1:nrow(articles), function(n) {
+   cbind(doi = articles$id[n],
+         journal = articles$journal[n],
+         split.subject(articles$subject[n]))
+ }) %>% bind_rows %>% mutate(
+   subject = factor(subject)
+ )
> dim(subjects)
[1] 1433963       4

Here are the specific data for the article above:

> subset(subjects, doi == "10.1371/journal.pone.0159765")
                                 doi  journal              subject count
1425105 10.1371/journal.pone.0159765 PLOS ONE                Birds     1
1425106 10.1371/journal.pone.0159765 PLOS ONE Conservation science     1
1425107 10.1371/journal.pone.0159765 PLOS ONE           Partridges     1
1425108 10.1371/journal.pone.0159765 PLOS ONE   Population density     1
1425109 10.1371/journal.pone.0159765 PLOS ONE  Population dynamics     1
1425110 10.1371/journal.pone.0159765 PLOS ONE            Predation     2
1425111 10.1371/journal.pone.0159765 PLOS ONE                 Rain     1
1425112 10.1371/journal.pone.0159765 PLOS ONE              Weather     1

Note that we’ve delayed the conversion of the subject column into a factor until all of the required levels were known.

In future instalments I’ll be looking at analyses of these data using Association Rules and Collaborative Filtering.

Certificates with Ubuntu Guest on VirtualBox

Just made a sparkling new install of Ubuntu 16.04 in VirtualBox on a Windows machine. Feeling much better about having a decent environment to work in, but somewhat irritated by the certificates issues I encountered when I started to run the browser. The machine in question is sitting behind a gnarly firewall and proxy, which I suspect are the source of the problem.

Even more irritated by the fact that I don’t seem to be able to install R packages from GitHub.

> devtools::install_github("ropensci/RSelenium")
Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
  Peer certificate cannot be authenticated with given CA certificates

Clearly there’s a problem with certificates. I also need to use --ignore-certificate-errors when running chromium-browser, which points to the same issue.

Here’s a work-around. It does not resolve the general issue with certificates, but at least allows installs from GitHub in R.

> httr::set_config(httr::config(ssl_verifypeer = 0L))

Note that this is by no means a fix to the problem, but it’ll allow you to get your work done while you’re getting your hands on the required certificates to properly sort it out.

Sportsbook Betting (Part 3): Evolving Odds


In previous instalments in this series I have not taken into account how odds can change over time. There are two main reasons for such a change:

  1. a significant variation in the distribution of bets being placed on the various outcomes of the event (and the bookmakers’ thus trying to “balance” their books); and
  2. other occurrences which have a direct effect on the probable outcome of the event.

The first of these is difficult to examine since bookmakers generally do not reveal the required data. The second is more accessible. We’ll consider one particular example.

Olympic Women’s 800 metre Race

We’ll take a look at data for the women’s 800 metre race at the 2016 Olympic Games in Rio de Janeiro. Again the odds were scraped from Oddschecker using the gambleR package. I set up a batch job to grab those odds at 10 minute intervals. In retrospect that was overkill since the odds were static over much longer time scales. However, in principle, the odds for an event might change almost continuously as new information becomes available.

The plot below reflects how the bookmakers’ odds for the various athletes in contention for this event changed from the time that I started logging data on 15 August 2016 through to the final event on 20 August 2016. There were some problems with the scraping job on 15 and 16 August, which accounts for the periods of scarce data. Also there were periods before the heats as well as after the heats and the semi-finals where no odds data were available. Since there was a high degree of overplotting I have jittered the data to make the individual traces visible.


The vertical dashed lines indicate the time of the heats (10:55 on 17 August 2016), the semi-finals (21:15 on 18 August 2016) and the final (21:15 on 20 August 2016). All times are in UTC-3, the local time zone in Rio de Janiero.

A total of 64 athletes took part in eight heats, after which the field was reduced to 24 athletes. These remaining athletes competed in three semi-finals to leave a field of only 8 athletes for the final. The phenomenal Caster Semenya trounced her competitors to win the final in 1:55.28.


Looking at the odds plotted above it’s clear that Semenya was the favourite to win from the start. A wager on her was almost a sure win, but the rewards were pretty small. There was some variability in the remaining athletes. After the heats and semi-finals odds were no longer quoted for those athletes eliminated from the competition. The odds against Joanna Jóźwik, an outsider prior to the competition, dropped substantially after the heats and semi-finals based on her excellent performance in both. The odds against Margaret Wambui also dropped after the semi-finals based on her comfortable victory. The odds for the remaining athletes who competed in the finals increased somewhat after the heats and semi-finals.

It’s apparent from the stepwise revisions in the odds in this event that they are not being continuously adjusted to take into account changes in the betting preferences of punters. In this case it seems that only the relative performance of the athletes in the races leading up to the final event had any influence on the odds.

Garmin ANT on Ubuntu

I finally got tired of booting up Windows to download data from my Garmin 910XT. I tried to get my old Ubuntu 15.04 system to recognise my ANT stick but failed. Now that I have a stable Ubuntu 16.04 system the time seems ripe.



Install openant, a Python library for downloading and uploading files from ANT-FS compliant devices.

  1. Download the zip file from
  2. Unpack the archive and install using
    sudo python install


Install antfs-cli, which implements a Command Line Interface to ANT-FS.

  1. Download the zip file from
  2. Unpack the archive and install using
    sudo python install
  3. This will automatically install pyusb if necessary.

Connect Device

Connect your ANT stick and check that it is recognised by your system.

$ lsusb | grep Dynastream
Bus 003 Device 030: ID 0fcf:1008 Dynastream Innovations, Inc. ANTUSB2 Stick

The two hexadecimal numbers following ID in the output above are then used to load the appropriate kernel module.

$ sudo modprobe usbserial vendor=0x0fcf product=0x1008

You can also check that the corresponding device has been created.

$ ls -l /dev/ttyANT2 
lrwxrwxrwx 1 root root 15 Aug 21 11:33 /dev/ttyANT2 -> bus/usb/003/030

Pair and Enjoy

If the above has gone smoothly then you are ready to grab data from your device. Turn it on and…

$ antfs-cli --pair

You should find the resulting FIT files under a path like ~/.config/antfs-cli/3860872045/activities. The numeric folder name is uniquely linked to your advice, so that part of the path with differ.

If you’re like me then you’ll probably have a bunch of FIT files that need to be uploaded to Garmin Connect. Use this link and select the Manual Import tab to upload multiple files at once.

Anthony Goldbloom: The jobs we’ll lose to machines

The future state of any single job lies in the answer to a single question: To what extent is that job reducible to frequent, high-volume tasks, and to what extent does it involve tackling novel situations? On frequent, high-volume tasks, machines are getting smarter and smarter. Today they grade essays. They diagnose certain diseases. Over coming years, they’re going to conduct our audits, and they’re going to read boilerplate from legal contracts. Accountants and lawyers are still needed. They’re going to be needed for complex tax structuring, for pathbreaking litigation. But machines will shrink their ranks and make these jobs harder to come by.
Anthony Goldbloom

Sportsbook Betting (Part 2): Bookmakers’ Odds

In the first instalment of this series we gained an understanding of the various types of odds used in Sportsbook betting and the link between those odds and implied probabilities. We noted that the implied probabilities for all possible outcomes in an event may sum to more than 100%. At first sight these seems a bit odd. It certainly appears to violate the basic principles of statistics. However, this anomaly is the mechanism by which bookmakers assure their profits. A similar principle applies in a casino.

Casino House Edge

Because the true probabilities of each outcome in casino games are well defined, this is a good place to start. In a casino game a winning wager receives a payout which is not quite consistent with the game’s true odds (how this is achieved varies from game to game). As a result, casino games are not “fair” from a gambler’s perspective. If they were, then a casino would not be a very profitable enterprise! Instead every casino game is slightly biased in favour of the house. On each round a gambler still stands a chance of winning. However, over time, the effect of this bias accumulates and the gambler inevitably loses money.

Let’s look at a couple of examples. We’ll start with a super simple game.

Example: Rolling a Dice

Consider a dice game in which the player wins if the dice lands on six. The odds for this game are 5/1 and the player would expect to receive 5 times his wager if he won.

> odds.fractional = c(win = 5/1, lose = 1/5)
> (odds.decimal = odds.fractional + 1)
 win lose 
 6.0  1.2 
> (probability = 1 / odds.decimal)
    win    lose 
0.16667 0.83333 

The probability of winning is 1/6. Would a gambler expect to profit if he played this game many times?

> payout = c(5, -1)
> sum(probability * payout)
[1] 0.00

No! In the long run neither the gambler nor the casino would make money on a game like this. It’s a fair game: neither the house nor the gambler has any statistical advantage or “edge”.

If, however, the house paid out only 4 times the wager then the player’s expected profit would become

> payout = c(4, -1)
> sum(probability * payout)
[1] -0.16667

Now the game is stacked in favour of the house, since on average the player would expect to lose around 17% of his stake. Of course, on any one game the gambler would either win 4 times his stake or lose the entire stake. However, if he played the game many times then on average he would lose 17% of his stake per game.

The game outlined above would not represent a very attractive proposition for a gambler. Obviously a casino could not afford to be this greedy and the usual house edge in any casino game is substantially smaller. Let’s move on to a real casino game.

Example: European Roulette

A European Roulette wheel has one zero and 36 non-zero numbers (18 odd and 18 even; 18 red and 18 black), making a total of 37 positions. Consider a wager on even numbers. The number of losing outcomes is 19 (the zero is treated as neither odd nor even: it’s the “house number”!), while number of winning outcomes is 18. So the odds against are 19/18.

> odds.fractional = c(win = 19/18, lose = 18/19)
> (odds.decimal = odds.fractional + 1)
   win   lose 
2.0556 1.9474 
> (probability = 1 / odds.decimal)
    win    lose 
0.48649 0.51351 

The probability of winning is 18/(19+18) = 18/37 = 0.48649. So this is almost an even money game.

Based on a wager of 1 coin, a win would result in a net profit of 1 coin, while a loss would forfeit the stake. The player’s expected outcome is then

> payout = c(1, -1)
> sum(probability * payout)
[1] -0.027027

The house edge is 2.70%. On average a gambler would lose 2.7% of his stake per game. Of course, on any one game he would either win or lose, but this is the long term expectation. Another way of looking at this is to say that the Return To Player (RTP) is 97.3%, which means that on average a gambler would get back 97.3% of his stake on every game.

Below are the results of a simulation of 100 gamblers betting on even numbers. Each starts with an initial capital of 100. The red line represents the average for the cohort. After 1000 games two gamblers have lost all of their money. Of the remaining 98 players, only 24 have made money while the rest have lost some portion of their initial capital.


The code for this simulation is available here.

Over-round, Vigorish and Juice

A bookmaker will aim to achieve an overall profit regardless of the outcome of the event. The general approach to doing this is to offer odds which are less than the true odds. As a result the payout on a successful wager is less than what would be mathematically dictated by the true odds. Because of the reciprocal relationship between odds and implied probabilities, this means that the corresponding implied probabilities are inflated. The margin by which the implied probabilities exceed 100% is known as the “over-round” (also vigorish or juice). The over-round determines the profit margin of the bookmaker. Bookmakers with a lower over-round also have a lower profit margin and hence offer a more equitable proposition to gamblers.

Since sports betting involves humans, there is no deterministic edge to the house or the gambler.

It’s useful to consider what we mean by “true odds” in the context of Sportsbook. Clearly for a casino game these odds can be calculated precisely (though with various degrees of difficulty, depending on the game). However, in Sportsbook the actual odds of each outcome cannot be known with great precision. This is simply a consequence of the fact that the events involve humans, and we are notoriously unpredictable.

Do bookmakers even care about the true odds? Not really. They are mostly just interested in offering odds which will provide them with an assured overall profit on an event.

There are a number of factors which contribute to determining the odds used in Sportsbook. Obviously there’s serious domain knowledge involved in deriving the initial odds on offer. But over time these odds should evolve to take into account the overall distribution of bets placed on the various outcomes (something like the wisdom of the crowd). It has been suggested that, as a result, Sportsbook odds are similar to an efficient market. Specifically, the distribution of wagers affect the odds, with the odds on the favourite get smaller while those on the underdog(s) get larger. Eventually the odds will settle at values which reflect the market’s perceived probability of the outcome of the event.

Rather, the odds are designed so that equal money is bet on both sides of the game. If more money is bet on one of the teams, the sports book runs the risk of losing money if that team were to win.

Example: Horse Racing a Round Book

A bookmaker is offering fractional odds of 4/1 (or 5 decimal odds) on each horse in a five horse race. The implied probability of each horse winning is 20%. If the bookmaker accepted the same volume of wagers on each horse then he would not make any money since the implied probabilities sum to 100%. This is known as a “round” book.

From a gambler’s perspective, a wager of 10 on any one of the horses would have an expected return of zero. From the bookmaker’s perspective, if he accepted 100 in wagers on each horse, then he would profit 400 on the losing horses and pay out 400 on the winning horse, yielding zero net profit.

Since the expected return is zero, this represents a fair game. However, such odds would never obtain in practice: the bookmaker always stands to make money. Enter the over-round.

Example: Horse Racing with Over-Round

If the bookmaker offered fractional odds of 3/1 (or 4.0) on each horse, then the implied probabilities would change from 20% to 25%. Summing the implied probabilities gives 125%, which is 25% over-round.

Suppose that the bookmaker accepted 100 in wagers on each horse, then he would profit 400 on the losing horses and pay out only 300 on the winning horse, yielding a net profit of 100.

Enough hypothetical examples, let’s look at something real.

Example: Champions League

It’s been suggested that football squad prices can influence Sportsbook odds. Often the richer the franchise, the more likely it is that a club will prevail in the sport. This is supposed to be particularly true in European club football. We’ll try to validate this idea by scraping the data provided by Forbes for football club values.

> library(rvest)
> library(dplyr)
> clubs <- read_html("") %>%
+   html_nodes("table") %>% .[[1]] %>% html_table() %>% .[, c(2, 3, 4, 7)] %>%
+   setNames(c("team", "country", "value", "revenue")) %>%
+   mutate(
+     value = as.integer(sub(",", "", value)),
+     team = gsub("\\.", "", clubs$team)
+     )
> head(clubs)
               team country value revenue
1       Real Madrid   Spain  3650     694
2         Barcelona   Spain  3320     570
3 Manchester United England  3315     645
4     Bayern Munich Germany  2680     675
5           Arsenal England  2020     524
6   Manchester City England  1920     558

Well, those tabular data are great, but a visualisation would be helpful to make complete sense of the relationship between team value and revenue.


It’s apparent that Real Madrid, Barcelona, Manchester United and Bayern Munich are the four most expensive teams. There’s a general trend of increasing revenue with increasing value. Two conspicuous exceptions are Schalke 04 and Paris Saint-Germain, which produce revenues far higher than expected based on their values.

Although not reflected in the plot above, there’s a relationship between the value of the team and its performance. With only a few exceptions the previously mentioned four teams have dominated the Champions League in recent years. Does this make sense? The richest teams are able to attract the most talented players. The resulting pool of talent increases their chances of winning. This in turn translates into revenue and the cycle is complete.

We’ll grab the bookmakers’ odds for the Champions League.

> library(gambleR)
> champions.league = oddschecker("football/champions-league/winner")
> head(champions.league[, 11:18])
              Ladbrokes Coral William Hill Winner Betfair Sportsbook BetBright Unibet Bwin
Barcelona           3/1   3/1         10/3    3/1                3/1       7/2    3/1 10/3
Bayern Munich       4/1   5/1          4/1    4/1                4/1       4/1    9/2  4/1
Real Madrid         5/1   5/1          4/1    9/2                9/2       9/2    5/1  5/1
Man City           12/1  12/1         11/1   10/1               10/1      12/1   12/1 12/1
Juventus           12/1  14/1         12/1   12/1               10/1      14/1    8/1 12/1
PSG                14/1  14/1         14/1   14/1               14/1      14/1   12/1 14/1

According to the selection of bookmakers above, Barcelona, Bayern Munich and Real Madrid are the major contenders in this competition. Betfair Sportsbook has Barcelona edging the current champions Real Madrid as favourites to win the competition. Bayern Munich and Real Madrid have slightly higher odds, with Bayern Munich perceived as the second most likely winner.

The decimal odds on offer at Betfair Sportsbook are

> champions.decimal[, 15]
        Barcelona     Bayern Munich       Real Madrid          Man City          Juventus 
              4.0               5.0               5.5              11.0              11.0 
              PSG   Atletico Madrid          Dortmund           Arsenal           Sevilla 
             15.0              17.0              26.0              26.0              51.0 
        Tottenham            Napoli         Leicester              Roma           Benfica 
             41.0              67.0              67.0             101.0             101.0 
            Porto  Bayer Leverkusen        Villarreal   Monchengladbach              Lyon 
            151.0              67.0             101.0             151.0             201.0 
              PSV   Sporting Lisbon       Dynamo Kiev          Besiktas             Basel 
            201.0             201.0             251.0             251.0             301.0 
      Club Brugge            Celtic     FC Copenhagen     PAOK Saloniki Red Star Belgrade 
            501.0             501.0                NA                NA                NA 

The corresponding implied probabilities are

> champions.probability[, 15]
        Barcelona     Bayern Munich       Real Madrid          Man City          Juventus 
        0.2500000         0.2000000         0.1818182         0.0909091         0.0909091 
              PSG   Atletico Madrid          Dortmund           Arsenal           Sevilla 
        0.0666667         0.0588235         0.0384615         0.0384615         0.0196078 
        Tottenham            Napoli         Leicester              Roma           Benfica 
        0.0243902         0.0149254         0.0149254         0.0099010         0.0099010 
            Porto  Bayer Leverkusen        Villarreal   Monchengladbach              Lyon 
        0.0066225         0.0149254         0.0099010         0.0066225         0.0049751 
              PSV   Sporting Lisbon       Dynamo Kiev          Besiktas             Basel 
        0.0049751         0.0049751         0.0039841         0.0039841         0.0033223 
      Club Brugge            Celtic     FC Copenhagen     PAOK Saloniki Red Star Belgrade 
        0.0019960         0.0019960                NA                NA                NA 

These sum to 1.178, giving an over-round of 17.8%.

Let’s focus on a football game between Anderlecht and Rostov. These are not major contenders, but they faced off last Saturday (3 August 2016), so the data are readily available.

Example: Anderlecht versus Rostov

The odds for the football match between Anderlecht and Rostov are shown below.


The match odds are 2.0 for a win by Anderlecht, 4.1 for a win by Rostov and 3.55 for a draw. Let’s convert those to the corresponding implied probabilities:

> decimal.odds = c(anderlecht = 2.0, rostov = 4.1, draw = 3.55)
> 1 / decimal.odds
anderlecht     rostov       draw 
   0.50000    0.24390    0.28169 

According to those odds the implied probabilities of each of the outcomes are 50%, 24.4% and 28.2% respectively.

> sum(1 / decimal.odds)
[1] 1.0256

Summing those probabilities gives an over-round of 2.6%, which is very competitive. However, including the 5% commission levied by Betfair, this increases to 7.6%.

Although Anderlecht were the favourites to win this game, it turns out that Rostov had a convincing victory.


The same principles apply when there are many possible outcomes for an event.

Example: Horse Racing (18:20 at Stratford)

I scraped the odds for the 18:20 race at Stratford on 28 June 2016 from oddschecker. Here are the data for nine bookmakers.

> odds[, 1:9]
                 Bet Victor Betway Marathon Bet Betdaq Bet 365 Ladbrokes Sky Bet 10Bet 188Bet
Deauville Dancer        6/4   13/8         13/8    7/5     6/4       6/4     6/4   6/4    6/4
Cest Notre Gris         7/4    7/4          7/4    7/4     7/4       7/4     7/4   7/4    7/4
Ross Kitty             15/2    7/1          7/1   41/5     7/1       7/1     7/1   7/1    7/1
Amber Spyglass         12/1   12/1         12/1   68/5    12/1      12/1    11/1  11/1   11/1
Venture Lagertha       20/1   22/1         20/1   89/5    16/1      20/1    20/1  20/1   20/1
Lucky Thirteen         22/1   22/1         20/1   21/1    22/1      20/1    22/1  20/1   20/1
Overrider              25/1   20/1         25/1   22/1    25/1      20/1    22/1  22/1   22/1
Kims Ocean             28/1   25/1         25/1   21/1    22/1      25/1    28/1  25/1   25/1
Rizal Park             80/1   66/1         50/1   82/1    80/1      66/1    50/1  66/1   66/1
Chitas Gamble         250/1  200/1        100/1  387/1   250/1     125/1   125/1 150/1  150/1
Irish Ranger          250/1  200/1        100/1  387/1   250/1     150/1   125/1 150/1  150/1

The decimal odds on offer at Bet Victor are

> decimal.odds[,1]
Deauville Dancer  Cest Notre Gris       Ross Kitty   Amber Spyglass Venture Lagertha 
            2.50             2.75             8.50            13.00            21.00 
  Lucky Thirteen        Overrider       Kims Ocean       Rizal Park    Chitas Gamble 
           23.00            26.00            29.00            81.00           251.00 
    Irish Ranger 

The corresponding implied probabilities are

> probability[,1]
Deauville Dancer  Cest Notre Gris       Ross Kitty   Amber Spyglass Venture Lagertha 
       0.4000000        0.3636364        0.1176471        0.0769231        0.0476190 
  Lucky Thirteen        Overrider       Kims Ocean       Rizal Park    Chitas Gamble 
       0.0434783        0.0384615        0.0344828        0.0123457        0.0039841 
    Irish Ranger 

The total implied probability per bookmaker is

> sort(colSums(probability))
       Bet Victor            Betway       Marathon Bet            Betdaq           Bet 365 
           1.1426            1.1444             1.1581            1.1623            1.1701 
        Ladbrokes           Sky Bet              10Bet            188Bet         Netbet UK 
           1.1764            1.1765             1.1773            1.1773            1.1773 
      Boylesports            Winner       William Hill        Stan James           Betfair 
           1.1797            1.1861             1.1890            1.1895            1.1935 
            Coral          RaceBets Betfair Sportsbook         BetBright       Sportingbet 
           1.1964            1.2003             1.2173            1.2229            1.2288 
          Betfred         Totesport          32Red Bet          888sport       Paddy Power 
           1.2303            1.2303             1.2392            1.2392            1.2636 

It’s obvious that there is a wide range of value being offered by various bookmakers, extending from the competitive Bet Victor and Betway with an over-round of around 14% to the substantial over-round of 26% at Paddy Power.

From a gambler’s point of view, the best value is obtained by finding the bookmaker who is offering the largest odds for a particular outcome. It’s probable that this bookmaker will also have a relatively low over-round. Sites like oddschecker make it a simple matter to check the odds on offer from a range of bookmakers. If you have the time and patience it might even be possible to engage in betting arbitrage.

Animated Mortality

Kyle Walker’s pyramid plots gave me a serious case of visualisation envy. Here’s something similar using the mortality data from the lifespan package.

Animated mortality pyramid plot.

The change in the mortality profile from year to year over two decades is evident. There’re unmistakable peaks which propagate up the plot, corresponding to babies born in 1943 and 1947, around the start and just after the Second World War.

feedeR: Reading RSS and Atom Feeds from R

I’m working on a project in which I need to systematically parse a number of RSS and Atom feeds from within R. I was somewhat surprised to find that no package currently exists on CRAN to handle this task. So this presented the opportunity for a bit of DIY.

You can find the fruits of my morning’s labour here.

Installing and Loading

The package is currently hosted on GitHub.

> devtools::install_github("DataWookie/feedeR")
> library(feedeR)

Reading a RSS Feed

Although Atom is supposed to be a better format from a technical perspective, RSS is relatively ubiquitous. The vast majority of blogs provide an RSS feed. We’ll look at the feed exposed by R-bloggers.

> rbloggers <- feed.extract("")
> names(rbloggers)
[1] "title"   "link"    "updated" "items"

There are three metadata elements pertaining to the feed.

> rbloggers[1:3]
[1] "R-bloggers"

[1] ""

[1] "2016-08-06 09:15:54 UTC"

The actual entries on the feed are captured in the items element. For each entry the title, publication date and link are captured. There are often more fields available for each entry, but these three are generally present.

> nrow(rbloggers$items)
[1] 8
> head(rbloggers$items, 3)
                                                              title                date
1                                                       readr 1.0.0 2016-08-05 20:25:05
2 Map the Life Expectancy in United States with data from Wikipedia 2016-08-05 19:48:53
3 Creating Annotated Data Frames from GEO with the GEOquery package 2016-08-05 19:35:45

Reading an Atom Feed

Atom feeds are definitely in the minority, but this format is still used by a number of popular sites. We’ll look at the feed from The R Journal.

> rjournal <- feed.extract("")

The same three elements of metadata are present.

> rjournal[1:3]
[1] "The R Journal"

[1] ""

[1] "2016-07-23 13:16:08 UTC"

Atom feeds do not appear to consistently provide the date on which each of the entries was originally published. The title and link fields are always present though!

> head(rjournal$items, 3)
                                                                                title date
1                         Heteroscedastic Censored and Truncated Regression with crch   NA
2 An Interactive Survey Application for Validating Social Network Analysis Techniques   NA
3            quickpsy: An R Package to Fit Psychometric Functions for Multiple Groups   NA


I’m still testing this across a selection of feeds. If you find a feed that breaks the package, please let me known and I’ll debug as necessary.

Web Scraping and “invalid multibyte string”

A couple of my collaborators have had trouble using read_html() from the xml2 package to access this Wikipedia page. Specifically they have been getting errors like this:

Error in utils::type.convert(out[, i], = TRUE, dec = dec) :
  invalid multibyte string at '<e2>€<94>'

Since I couldn’t reproduce these errors on my machine it appeared to be something relating to their particular machine setup. Looking at their locale provided a clue:

> Sys.getlocale()
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;

whereas on my machine I have:

> Sys.getlocale()

The document that they were trying to scrape is encoded in UTF-8, which I see in my locale but not in theirs. Perhaps changing locale will sort out the problem? Since the en_ZA locale is a bit of an acquired taste (unless you’re South African, in which case it’s de rigueur!), the following should resolve the problem:

> Sys.setlocale("LC_CTYPE", "en_US.UTF-8")

It’s possible that command might produce an error stating that it cannot be honoured by your system. Do not fear. Try the following (which seems to work almost universally!):

Sys.setlocale("LC_ALL", "English")

Try scraping again. Your issues should be resolved.