satRday Cape Town: Call for Submissions

satrday-cape-town-banner

satRday Cape Town will happen on 18 February 2017 at Workshop 17, Victoria & Alfred Waterfront, Cape Town, South Africa.

Keynotes and Workshops

We have a trio of fantastic keynote speakers: Hilary Parker, Jennifer Bryan and Julia Silge, who’ll be dazzling you on the day as well as presenting workshops on the two days prior to the satRday.

Call for Submissions

We’re accepting submissions in four categories:

  • Workshop [90 min],
  • Lightning Talk [5 min],
  • Standard Talk [20 min] and
  • Poster.

Submit your proposals here. The deadline is 16 December, but there’s no reason to wait for the last minute: send us your ideas right now so that we can add you to the killer programme.

Registration

Register for the conference and workshops here. The tickets are affordable and you’re going to get extraordinary value for money.

PLOS Subject Keywords: Association Rules

In a previous post I detailed the process of compiling data on subject keywords used in articles published in PLOS journals. In this instalment I’ll be using those data to mine Association Rules with the arules package. Good references on the topic of Association Rules are

association-rule-image

Terminology

Data suitable for mining Association Rules should consist of:

  • a set of uniquely identified transactions, where
  • each transaction should have one or more items, where
  • items are binary attributes.

The derived rules take the form X → Y, where X and Y are disjoint itemsets, each consisting of one or more items. The itemsets X and Y are known as the antecedent (lhs or left hand side) and consequent (rhs or right hand side). The rules should be interpreted as “If X is present then so too is Y”.

An Association Rules analysis aims to identify pairs (or groups) of items that are commonly found together. A shopping analogy would be that bread is often purchased with milk, while peanuts are often purchased with beer. This kind of analysis is not only confined to this sort of “consumption” scenario: it can be applied in any situation where a discrete set of items is associated with individual transactions.

In the context of the data we previously gathered from PLOS, where every article is tagged with one or more subjects, each of the articles is a “transaction” and the subjects are then the “items”. We’ll be deriving rules for which subjects commonly occur together. Or, more specifically, we’ll be generating rules like “If an article is tagged with subject X then it is probably also tagged with subject Y”.

Transaction Matrix

The arules package derives Association Rules from a Transaction Matrix. The form in which we have the subjects data is particularly convenient for building a sparse matrix (class ngCMatrix from the Matrix package).

> head(subjects)
                           doi      journal              subject count
1 10.1371/journal.pbio.0000007 PLOS Biology               Borneo     1
2 10.1371/journal.pbio.0000007 PLOS Biology Conservation science     1
3 10.1371/journal.pbio.0000007 PLOS Biology            Elephants     1
4 10.1371/journal.pbio.0000007 PLOS Biology   Endangered species     2
5 10.1371/journal.pbio.0000007 PLOS Biology        Plant fossils     2
6 10.1371/journal.pbio.0000007 PLOS Biology    Pleistocene epoch     1

For the purposes of the arules package the ngCMatrix needs to have items (in this case case subjects) as rows and transactions (in this case articles) as columns.

> library(Matrix)
> 
> subjects.matrix <- with(subjects, sparseMatrix(i = as.integer(subject),
+                                                j = as.integer(doi),
+                                                dimnames = list(levels(subject),
+                                                                      levels(doi))))
> dim(subjects.matrix)
[1]   9357 185984
> class(subjects.matrix)
[1] "ngCMatrix"
attr(,"package")
[1] "Matrix"

There are 185984 articles and 9357 subjects. Next we coerce this into a Transactions Matrix.

> library(arules)
> 
> subjects.matrix <- as(subjects.matrix, "transactions")
> class(subjects.matrix)
[1] "transactions"
attr(,"package")
[1] "arules"

Here’s some summary information. We see that the vast majority of articles are associated with eight subjects.

> summary(subjects.matrix)
transactions as itemMatrix in sparse format with
 185984 rows (elements/itemsets/transactions) and
 9357 columns (items) and a density of 0.000824 

most frequent items:
          Gene expression Polymerase chain reaction          Mouse models             Apoptosis                none 
                    17819                      9458                  8773                  7630                7593 
                  (Other) 
                  1382690 

element (itemset/transaction) length distribution:
sizes
     1      2      3      4      5      6      7      8 
  7625     16     22     34     26     30     54 178177 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00    8.00    7.71    8.00    8.00 

includes extended item information - examples:
          labels
1     293T cells
2 3D bioprinting
3    3D printing

includes extended transaction information - examples:
                                                itemsetID
1 10.1371/annotation/008b05a8-229b-4aca-94ae-91e6dd5ca5ba
2 10.1371/annotation/00a3b22e-36a9-4d51-89e5-1e6561e7a1e9
3 10.1371/annotation/00d17a45-7b78-4fd5-9a9a-0f2e49b04eee

Generating Rules (Default Settings)

There are two major algorithms for generating Association Rules: Apriori and Eclat. We’ll be using the former here. We’ll try to derive some rules using apriori() with default parameters.

> rules <- apriori(subjects.matrix)
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
        0.8    0.1    1 none FALSE            TRUE     0.1      1     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 18598 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9357 item(s), 185984 transaction(s)] done [0.13s].
sorting and recoding items ... [0 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object  ... done [0.01s].

Zero rules! Well, that’s a little disappointing. But not entirely surprising: the default minimum thresholds on support (0.1) and confidence (0.8) are rather conservative. (I’ll explain what support and confidence mean shortly.) We’ll relax them in order to at least generate a decent selection of rules.

Generating Rules (Relaxed Settings)

Reducing the thresholds on support and confidence to 0.002 and 0.75 respectively results in 35 rules. Lower values for these thresholds are necessary because there is a relatively small degree of subject overlap between the articles in the collection. Not surprising since they are derived from a wide range of disciplines!

> rules <- apriori(subjects.matrix, parameter = list(support = 0.002, confidence = 0.75))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
       0.75    0.1    1 none FALSE            TRUE   0.002      1     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 371 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9357 item(s), 185984 transaction(s)] done [0.14s].
sorting and recoding items ... [838 item(s)] done [0.01s].
creating transaction tree ... done [0.10s].
checking subsets of size 1 2 3 4 done [0.04s].
writing ... [35 rule(s)] done [0.00s].
creating S4 object  ... done [0.04s].

Below is some summary information on those rules. We see that the largest rule length (total number of items on lhs and rhs of rule) is only 3.

> summary(rules)
set of 35 rules

rule length distribution (lhs + rhs):sizes
 2  3 
25 10 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.00    2.00    2.29    3.00    3.00 

summary of quality measures:
    support          confidence         lift       
 Min.   :0.00203   Min.   :0.756   Min.   :  9.34  
 1st Qu.:0.00233   1st Qu.:0.795   1st Qu.: 27.89  
 Median :0.00290   Median :0.858   Median : 39.60  
 Mean   :0.00337   Mean   :0.859   Mean   : 66.18  
 3rd Qu.:0.00365   3rd Qu.:0.923   3rd Qu.: 90.97  
 Max.   :0.01154   Max.   :1.000   Max.   :233.23  

mining info:
            data ntransactions support confidence
 subjects.matrix        185984   0.002       0.75

This seems like a good time to peruse the rules themselves. To see the details we need to use inspect().

> inspect(rules)
   lhs                                        rhs                                        support   confidence lift    
1  {Memory T cells}                        => {T cells}                                  0.0021292 0.98020     28.1633
2  {Circadian oscillators}                 => {Circadian rhythms}                        0.0021292 0.85900    233.2272
3  {HbA1c}                                 => {Diabetes mellitus}                        0.0023550 0.90123     38.5500
4  {Secondary lung tumors}                 => {Lung and intrathoracic tumors}            0.0026346 1.00000    123.9067
5  {H1N1}                                  => {Influenza}                                0.0020432 0.76923     90.7770
6  {Corals}                                => {Coral reefs}                              0.0020271 0.75551    199.5923
7  {T helper cells}                        => {T cells}                                  0.0022314 0.82341     23.6585
8  {Face recognition}                      => {Face}                                     0.0021400 0.77282    191.3866
9  {T cell receptors}                      => {T cells}                                  0.0026669 0.91176     26.1971
10 {Breast tumors}                         => {Breast cancer}                            0.0024142 0.79610     57.2330
11 {Forest ecology}                        => {Forests}                                  0.0030271 0.94147     93.3362
12 {Antibody response}                     => {Antibodies}                               0.0028336 0.87542     39.0439
13 {Surgical oncology}                     => {Surgical and invasive medical procedures} 0.0024626 0.75702     50.3737
14 {Prostate gland}                        => {Prostate cancer}                          0.0030218 0.79603    130.7859
15 {HIV prevention}                        => {HIV}                                      0.0035003 0.90669     42.3372
16 {Tuberculosis diagnosis and management} => {Tuberculosis}                             0.0036885 0.93333     91.1686
17 {Geriatrics}                            => {Elderly}                                  0.0033229 0.80469    125.9756
18 {HIV diagnosis and management}          => {HIV}                                      0.0036186 0.81675     38.1376
19 {HIV epidemiology}                      => {HIV}                                      0.0038390 0.85817     40.0719
20 {Regulatory T cells}                    => {T cells}                                  0.0040971 0.87687     25.1945
21 {Chemotherapy}                          => {Cancer treatment}                         0.0040917 0.77259     22.3432
22 {Multiple alignment calculation}        => {Sequence alignment}                       0.0050811 0.85987     30.8255
23 {Malarial parasites}                    => {Malaria}                                  0.0051617 0.79668     74.8332
24 {HIV infections}                        => {HIV}                                      0.0076888 0.84816     39.6044
25 {Cytotoxic T cells}                     => {T cells}                                  0.0115386 0.95846     27.5388
26 {HIV epidemiology,HIV infections}       => {HIV}                                      0.0020432 0.95000     44.3597
27 {Gene expression,Regulator genes}       => {Gene regulation}                          0.0023174 0.77798     26.6075
28 {Malarial parasites,Parasitic diseases} => {Malaria}                                  0.0030863 0.78415     73.6565
29 {Malaria,Parasitic diseases}            => {Malarial parasites}                       0.0030863 0.82117    126.7428
30 {Phylogenetics,Sequence alignment}      => {Phylogenetic analysis}                    0.0022260 0.79310     33.2517
31 {Cytotoxic T cells,Flow cytometry}      => {T cells}                                  0.0023497 0.98202     28.2157
32 {Cytotoxic T cells,Immune response}     => {T cells}                                  0.0033121 0.96100     27.6117
33 {Cytokines,Cytotoxic T cells}           => {T cells}                                  0.0028981 0.96078     27.6055
34 {Enzyme-linked immunoassays,Vaccines}   => {Antibodies}                               0.0026293 0.77619     34.6185
35 {Gene regulation,Microarrays}           => {Gene expression}                          0.0041885 0.89437      9.3349

Each of the rules consists of two itemsets, a lhs and a rhs, with the implication that if the lhs itemset is selected then so too is the rhs itemset.

Rule Metrics

Naturally some rules are stronger than others. Their relative quality is measured by a set of metrics: support, confidence and lift.

Support

The support for an itemset is the proportion of transactions which contain that itemset. The support for a rule is the proportion of transactions which contain both the antecedent and consequent itemsets.

The five rules below are those with the highest support. The rule {Cytotoxic T cells} → {T cells} has support of 0.0115386, which means that “Cytotoxic T cells” and “T cells” are present in 1.15% of transactions. Likewise, the rule {Gene regulation,Microarrays} → {Gene expression} has support of 0.0041885, indicating that “Gene regulation”, “Microarrays” and “Gene expression” appear in 0.4% of transactions.

> inspect(head(sort(rules, by = "support"), n = 5))
   lhs                                 rhs                  support   confidence lift   
25 {Cytotoxic T cells}              => {T cells}            0.0115386 0.95846    27.5388
24 {HIV infections}                 => {HIV}                0.0076888 0.84816    39.6044
23 {Malarial parasites}             => {Malaria}            0.0051617 0.79668    74.8332
22 {Multiple alignment calculation} => {Sequence alignment} 0.0050811 0.85987    30.8255
35 {Gene regulation,Microarrays}    => {Gene expression}    0.0041885 0.89437     9.3349

Support does not directly indicate the strength of the rule, just how often the components of the rule are present in the data. However, having a decent level of support for a rule is important because it indicates what proportion of the data contributed to deriving that rule. Obviously if a rule is based on only a few transactions then it is not likely to be very robust.

Confidence

The confidence assigned to a rule is the proportion of transactions which contain both the antecedent and consequent relative to those which contain the antecedent. Equivalently, this is the ratio of the support for the rule to the support for the antecedent. Alternatively confidence is the probability of the rhs being present in a transaction conditional on the lhs also being present.

The five rules below are those with the highest confidence. For example, we see that articles with subjects which include “Secondary lung tumors” will certainly also contain “Lung and intrathoracic tumors”. Similarly, articles which have both “Cytokines” and “Cytotoxic T cells” as subjects will very likely also have “T cells”.

> inspect(head(sort(rules, by = "confidence"), n = 5))
   lhs                                    rhs                             support   confidence lift   
4  {Secondary lung tumors}             => {Lung and intrathoracic tumors} 0.0026346 1.00000    123.907
31 {Cytotoxic T cells,Flow cytometry}  => {T cells}                       0.0023497 0.98202     28.216
1  {Memory T cells}                    => {T cells}                       0.0021292 0.98020     28.163
32 {Cytotoxic T cells,Immune response} => {T cells}                       0.0033121 0.96100     27.612
33 {Cytokines,Cytotoxic T cells}       => {T cells}                       0.0028981 0.96078     27.606

Lift

The lift of a rule indicates how much greater the support for the rule is relative to the support for the antecedent and consequent, treated as if they are independent. Equivalently, lift is the ratio of confidence in a rule to the expected confidence were the antecedent and consequent independent. A useful interpretation of the lift is the increase in probability for the consequent if the antecedent is present. Alternatively, it’s the ratio of the conditional probability of the consequent given the antecedent to the marginal probability of the consequent. Write out the expression and it’ll make more sense.

A lift of 1 indicates that the antecedent and consequent are independent. In the rules below we see that the presence of “Circadian oscillators” results in a massive increase in the likelihood of the presence of “Circadian rhythms”. Similarly, if both “Malaria” and “Parasitic diseases” are present then the probability of “Malarial parasites” being used increases by over one hundred fold.

> inspect(head(sort(rules, by = "lift"), n = 5))
   lhs                             rhs                  support   confidence lift  
2  {Circadian oscillators}      => {Circadian rhythms}  0.0021292 0.85900    233.23
6  {Corals}                     => {Coral reefs}        0.0020271 0.75551    199.59
8  {Face recognition}           => {Face}               0.0021400 0.77282    191.39
14 {Prostate gland}             => {Prostate cancer}    0.0030218 0.79603    130.79
29 {Malaria,Parasitic diseases} => {Malarial parasites} 0.0030863 0.82117    126.74

Rule Selection

Before we look at rule selection we’ll generate a more extensive assortment of rules by further lowering the thresholds for support and confidence.

> rules <- apriori(subjects.matrix, parameter = list(support = 0.001, confidence = 0.25))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
       0.25    0.1    1 none FALSE            TRUE   0.001      1     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 185 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9357 item(s), 185984 transaction(s)] done [0.15s].
sorting and recoding items ... [1600 item(s)] done [0.02s].
creating transaction tree ... done [0.13s].
checking subsets of size 1 2 3 4 done [0.07s].
writing ... [984 rule(s)] done [0.00s].
creating S4 object  ... done [0.04s].

You can use the subset() function to focus on particular rules of interest. What other subjects are commonly associated with “HIV”?

> subset(rules, subset = lhs %in% "HIV")
set of 19 rules 

So there is quite a selection of them. We can narrow that down by focusing on those which have a relatively high level of confidence.

> inspect(subset(rules, subset = lhs %in% "HIV" & confidence > 0.5))
    lhs                                            rhs              support   confidence lift  
677 {HIV,HIV prevention}                        => {HIV infections} 0.0018174 0.51920    57.274
685 {HIV,Tuberculosis diagnosis and management} => {Tuberculosis}   0.0010646 0.92958    90.802
694 {HIV,HIV epidemiology}                      => {HIV infections} 0.0020432 0.53221    58.709
844 {Cytotoxic T cells,HIV}                     => {T cells}        0.0010108 0.97917    28.134

Selection criteria are applied in subset() using the operators %in% (does item appear in itemset?), %pin% (like %in% but with partial matching) and %ain% (match all items specified) to operate on lhs and rhs. Arithmetic comparisons are used on the rule metrics.

Here’s another example which indicates that articles with subject “Dementia” will also have either “Alzheimer disease” or “Cognitive impairment” as a subject roughly 50% of the time.

> inspect(subset(rules, subset = lhs %in% "Dementia"))
    lhs           rhs                    support   confidence lift  
101 {Dementia} => {Alzheimer disease}    0.0011022 0.53385    67.681
102 {Dementia} => {Cognitive impairment} 0.0011399 0.55208    55.502

Symmetry

It might have occurred to you that these rules should be symmetric: if X → Y then surely Y → X too? This is certainly the case. Consider the four rules below, which consist of two symmetric pairs.

> inspect(subset(rules, subset = lhs %in% "Fungal genetics" | rhs %in% "Fungal genetics"))
    lhs                  rhs               support   confidence lift   
165 {Fungal genetics} => {Fungal genomics} 0.0010646 0.38900    135.482
166 {Fungal genomics} => {Fungal genetics} 0.0010646 0.37079    135.482
167 {Fungal genetics} => {Fungi}           0.0016453 0.60118     94.195
168 {Fungi}           => {Fungal genetics} 0.0016453 0.25779     94.195

The highlighted rules {Fungal genetics} → {Fungi} and {Fungi} → {Fungal genetics} form one symmetric pair. Note that the support and lift are the same for both rules, but that the first rule has a higher confidence than the second. This is due to different supports for the antecedent in each case. Whereas “Fungal genetics” is a subject for 509 articles, “Fungi” appears as a subject for 1187 articles. The corresponding values of support are 0.0027368 and 0.0063823 respectively. Since “Fungi” is more than twice as common, the confidence in the second rule is diminished significantly. This emphasises the fact that rules with the highest confidence are those for which the support for the antecedent is almost as high as the support for the rule itself.

Conclusion

Although the example presented here has been confined to binary (presence/absence) data, Association Rules can also be applied effectively to categorical data, as illustrated in Example 2 of the paper by Hahsler et al. cited above.

Association Rules are a powerful unsupervised learning technique which can be fruitfully applied in data mining. The resulting rules are often instrumental in identifying actionable insights from the data.

ubeR: A Package for the Uber API

Uber exposes an extensive API for interacting with their service. ubeR is a R package for working with that API which Arthur Wu and I put together during a Hackathon at iXperience.

alt-text

Installation

The package is currently hosted on GitHub. Installation is simple using the devtools package.

> devtools::install_github("DataWookie/ubeR")
> library(ubeR)

Authentication

To work with the API you’ll need to create a new application for the Rides API.

  • Set Redirect URL to http://localhost:1410/.
  • Enable the profile, places, ride_widgets, history_lite and history scopes.

With the resulting Client ID and Client Secret you’ll be ready to authenticate. I’ve stored mine as environment variables but you can just hard code them into the script for starters.

> UBER_CLIENTID = Sys.getenv("UBER_CLIENTID")
> UBER_CLIENTSECRET = Sys.getenv("UBER_CLIENTSECRET")
> 
> uber_oauth(UBER_CLIENTID, UBER_CLIENTSECRET)

Identity

We can immediately use uber_me() to retrieve information about the authenticated user.

> identity <- uber_me()
> names(identity)
[1] "picture"         "first_name"      "last_name"       "uuid"            "rider_id"       
[6] "email"           "mobile_verified" "promo_code"   
> identity$first_name
[1] "Andrew"
> identity$picture
[1] "https://d1w2poirtb3as9.cloudfront.net/default.jpeg"

Clearly I haven’t made enough effort in personalising my Uber account.

Designated Places

Uber allows you to specify predefined locations for “home” and “work”. These are accessible via uber_places_get().

> uber_places_get("home")
$address
[1] "St Andrews Dr, Durban North, 4051, South Africa"

> uber_places_get("work")
$address
[1] "Dock Rd, V & A Waterfront, Cape Town, 8002, South Africa"

These addresses can be modified using uber_places_put().

History

You can access data for recent rides using uber_history().

> history <- uber_history(50, 0)
> names(history)
 [1] "status"       "distance"     "request_time" "start_time"   "end_time"     "request_id"  
 [7] "product_id"   "latitude"     "display_name" "longitude"

The response includes a wide range of fields, we’ll just pick out just a few of them for closer inspection.

> head(history)[, c(2, 4:5, 9)]
  distance          start_time            end_time  display_name
1   1.3140 2016-08-15 17:35:24 2016-08-15 17:48:54 New York City
2  13.6831 2016-08-11 15:29:58 2016-08-11 16:04:22     Cape Town
3   2.7314 2016-08-11 09:09:25 2016-08-11 09:23:51     Cape Town
4   3.2354 2016-08-10 19:28:41 2016-08-10 19:38:07     Cape Town
5   7.3413 2016-08-10 16:37:30 2016-08-10 17:21:16     Cape Town
6   4.3294 2016-08-10 13:38:49 2016-08-10 13:59:00     Cape Town

Product Descriptions

We can get a list of cars near to a specified location using uber_products().

> cars <- uber_products(latitude = -33.925278, longitude = 18.423889)
> names(cars)
[1] "capacity"          "product_id"        "price_details"     "image"            
[5] "cash_enabled"      "shared"            "short_description" "display_name"     
[9] "description"  
> cars[, c(1, 2, 7)]
  capacity                           product_id short_description
1        4 91901472-f30d-4614-8ba7-9fcc937cebf5             uberX
2        6 419f6bdc-7307-4ea8-9bb0-2c7d852b616a            uberXL
3        4 1dd39914-a689-4b27-a59d-a74e9be559a4         UberBLACK

Information for a particular car can also be accessed.

> product <- uber_products(product_id = "91901472-f30d-4614-8ba7-9fcc937cebf5")
> names(product)
[1] "capacity"          "product_id"        "price_details"     "image"            
[5] "cash_enabled"      "shared"            "short_description" "display_name"     
[9] "description"      
> product$price_details
$service_fees
list()

$cost_per_minute
[1] 0.7

$distance_unit
[1] "km"

$minimum
[1] 20

$cost_per_distance
[1] 7

$base
[1] 5

$cancellation_fee
[1] 25

$currency_code
[1] "ZAR"

Estimates

It’s good to have a rough idea of how much a ride is going to cost you. What about a trip from Mouille Point to the Old Biscuit Mill?

old-biscuit-mill

> estimate <- uber_requests_estimate(start_latitude = -33.899656, start_longitude = 18.407663,
+                                    end_latitude = -33.927443, end_longitude = 18.457557)
> estimate$trip
$distance_unit
[1] "mile"

$duration_estimate
[1] 600

$distance_estimate
[1] 4.15

> estimate$pickup_estimate
[1] 4
> estimate$price
  high_amount display_amount display_name low_amount surge_multiplier currency_code
1        5.00           5.00    Base Fare       5.00                1           ZAR
2       56.12    42.15-56.12     Distance      42.15                1           ZAR
3        8.30      6.23-8.30         Time       6.23                1           ZAR

Not quite sure why the API is returning the distance in such obscure units. (Note to self: convert those to metric equivalent in next release!) The data above are based on the car nearest to the start location. What about prices for a selection of other cars?

> estimate <- uber_estimate_price(start_latitude = -33.899656, start_longitude = 18.407663,
+                     end_latitude = -33.927443, end_longitude = 18.457557)
> names(estimate)
 [1] "localized_display_name" "high_estimate"          "minimum"                "duration"
 [5] "estimate"               "distance"               "display_name"           "product_id"
 [9] "low_estimate"           "surge_multiplier"       "currency_code"         
> estimate[, c(1, 5)]
  localized_display_name  estimate
1                  uberX  ZAR53-69
2                 uberXL  ZAR68-84
3              uberBLACK ZAR97-125

The time of arrival for each of those cars can be accessed via uber_estimate_time().

> uber_estimate_time(start_latitude = -33.899656, start_longitude = 18.407663)
  localized_display_name estimate display_name                           product_id
1                  uberX      180        uberX 91901472-f30d-4614-8ba7-9fcc937cebf5
2                 uberXL      420       uberXL 419f6bdc-7307-4ea8-9bb0-2c7d852b616a
3              uberBLACK      300    uberBLACK 1dd39914-a689-4b27-a59d-a74e9be559a4

So, for example, the uberXL would be expected to arrive in 7 minutes, while the uberX would pick you up in only 3 minutes.

Requesting a Ride

It’s also possible to request a ride. At present these requests are directed to the Uber API Sandbox. After we have done further testing we’ll retarget the requests to the API proper.

A new ride is requested using uber_requests().

> ride <- uber_requests(start_address = "37 Beach Road, Mouille Point, Cape Town",
+                       end_address = "100 St Georges Mall, Cape Town City Centre, Cape Town")

Let’s find out the details of the result.

> names(ride)
 [1] "status"           "destination"      "product_id"       "request_id"
 [5] "driver"           "pickup"           "eta"              "location"
 [9] "vehicle"          "surge_multiplier" "shared"     
> ride$pickup
$latitude
[1] -33.9

$longitude
[1] 18.406
> ride$destination
$latitude
[1] -33.924

$longitude
[1] 18.42

Information about the currently requested ride can be accessed using uber_requests_current(). If we decide to walk instead, then it’s also possible to cancel the pickup.

> uber_requests_current_delete()

Future

For more information about units of measurement, limits and parameters of the Uber API, have a look at the API Overview.

We’ll be extending the package to cover the remaining API endpoints. But, for the moment, most of the core functionality is already covered.

Also Relevant

A nice blog post by Simon Jackson who used ubeR to plot his recent trips.

PLOS Subject Keywords: Gathering Data

I’m putting together a couple of articles on Collaborative Filtering and Association Rules. Naturally, the first step is finding suitable data for illustrative purposes.

There are a number of standard data sources for these kinds of analyses:

I’d like to do something different though, so instead of using one of these, I’m going to build a data set based on subject keywords from articles published in PLOS journals. This has the advantage of presenting an additional data construction pipeline and the potential for revealing something new and interesting.

Before we get started, let’s establish some basic nomenclature.

Terms

Data used in the context of Collaborative Filtering or Association Rules analyses are normally thought of in the following terms:

Item
– a “thing” which is rated.
User
– a “person” who either rates one or more Items or consumes ratings for Items.
Rating
– the evaluation of an Item by a User (can be a binary, integer or real valued rating or simply whether or not the User has interacted with the Item).

Sample Article

We’re going to retrieve a load of data from PLOS. But, just to set the scene, let’s start by looking at a specific article, Age and Sex Ratios in a High-Density Wild Red-Legged Partridge Population, recently published in PLOS ONE. You’ll notice that the article is in the public domain, so you can immediately download the PDF (no paywalls here!) and access a wide range of other data pertaining to the article. There’s a list of subject keywords on the right. This is where we will be focusing most of our attention, although we’ll also retrieve DOI, authors, publication date and journal information for good measure.

journal.pone.0159765

We’ll be using the rplos package to access data via the PLOS API. A search through the PLOS catalog is initiated using searchplos(). To access the article above we’d just specify the appropriate DOI using the q (query) argument, while the fields in the result are determined by the fl argument.

> library(rplos)
> partridge <- searchplos(q = "id:10.1371/journal.pone.0159765",
+                         fl = 'id,author,publication_date,subject,journal')$data

The journal, publication date and author data are easy to consume.

> partridge$id
[1] "10.1371/journal.pone.0159765"
> partridge[, 3:5]
   journal     publication_date                                       author
1 PLOS ONE 2016-08-10T00:00:00Z Jesús Nadal; Carolina Ponz; Antoni Margalida

The subject keywords are conflated into a single string, making them more difficult to digest.

> partridge$subject %>% cat
/Biology and life sciences/Population biology/Population dynamics;
/Ecology and environmental sciences/Conservation science;
/Earth sciences/Atmospheric science/Meteorology/Weather;
/Earth sciences/Atmospheric science/Meteorology/Rain;
/Ecology and environmental sciences/Ecology/Community ecology/Trophic interactions/Predation;
/Biology and life sciences/Ecology/Community ecology/Trophic interactions/Predation;
/Biology and life sciences/Population biology/Population metrics/Population density;
/Biology and life sciences/Organisms/Animals/Vertebrates/Amniotes/Birds;
/Biology and life sciences/Organisms/Animals/Vertebrates/Amniotes/Birds/Fowl/Gamefowl/Partridges

Here’s an extract from the documentation about subject keywords which helps make sense of that.

The Subject Area terms are related to each other with a system of broader/narrower term relationships. The thesaurus structure is a polyhierarchy, so for example the Subject Area “White blood cells” has two broader terms “Blood cells” and “Immune cells”. At its deepest the hierarchy is ten tiers deep, with all terms tracking back to one or more of the top tier Subject Areas, such as “Biology and life sciences” or “Social sciences.”

We’ll use the most specific terms in each of the subjects. It’d be handy to have a function to extract these systematically from a bunch of articles.

> library(dplyr)
> 
> options(stringsAsFactors = FALSE)
> split.subject <- function(subject) {
+   data.frame(subject = sub(".*/", "", strsplit(subject, "; ")[[1]]),
+              stringsAsFactors = FALSE) %>%
+     group_by(subject) %>%
+     summarise(count = n()) %>%
+     ungroup
+ }

So, for the article above we get the following subjects:

> split.subject(partridge$subject)
# A tibble: 8 x 2
               subject count
                 <chr> <int>
1                Birds     1
2 Conservation science     1
3           Partridges     1
4   Population density     1
5  Population dynamics     1
6            Predation     2
7                 Rain     1
8              Weather     1

Those tie up well with what we saw on the home page for the article. We see that all of the terms except Predation appear only once. There are two entries for Predation, one in category “Ecology and environmental sciences” and the other in “Biology and life sciences”. We can’t really interpret these entries as ratings. They should rather be thought of as interactions. At some stage we might transform them into Boolean values, but for the moment we’ll leave them as interaction counts.

Some data is collected explicitly, perhaps by asking people to rate things, and some is collected casually, for example by watching what people buy.
Toby Segaran, Programming Collective Intelligence

Article Collection

We’ll need a lot more data to do anything meaningful. So let’s use the same infrastructure to grab a few thousand articles.

> dim(articles)
[1] 185984      5

Parsing the subject column and aggregating the results we get a data frame with counts of the number of times a particular subject keyword is associated with each article.

> subjects <- lapply(1:nrow(articles), function(n) {
+   cbind(doi = articles$id[n],
+         journal = articles$journal[n],
+         split.subject(articles$subject[n]))
+ }) %>% bind_rows %>% mutate(
+   subject = factor(subject)
+ )
> dim(subjects)
[1] 1433963       4

Here are the specific data for the article above:

> subset(subjects, doi == "10.1371/journal.pone.0159765")
                                 doi  journal              subject count
1425105 10.1371/journal.pone.0159765 PLOS ONE                Birds     1
1425106 10.1371/journal.pone.0159765 PLOS ONE Conservation science     1
1425107 10.1371/journal.pone.0159765 PLOS ONE           Partridges     1
1425108 10.1371/journal.pone.0159765 PLOS ONE   Population density     1
1425109 10.1371/journal.pone.0159765 PLOS ONE  Population dynamics     1
1425110 10.1371/journal.pone.0159765 PLOS ONE            Predation     2
1425111 10.1371/journal.pone.0159765 PLOS ONE                 Rain     1
1425112 10.1371/journal.pone.0159765 PLOS ONE              Weather     1

Note that we’ve delayed the conversion of the subject column into a factor until all of the required levels were known.

In future instalments I’ll be looking at analyses of these data using Association Rules and Collaborative Filtering.

Sportsbook Betting (Part 2): Bookmakers’ Odds

In the first instalment of this series we gained an understanding of the various types of odds used in Sportsbook betting and the link between those odds and implied probabilities. We noted that the implied probabilities for all possible outcomes in an event may sum to more than 100%. At first sight these seems a bit odd. It certainly appears to violate the basic principles of statistics. However, this anomaly is the mechanism by which bookmakers assure their profits. A similar principle applies in a casino.

Casino House Edge

Because the true probabilities of each outcome in casino games are well defined, this is a good place to start. In a casino game a winning wager receives a payout which is not quite consistent with the game’s true odds (how this is achieved varies from game to game). As a result, casino games are not “fair” from a gambler’s perspective. If they were, then a casino would not be a very profitable enterprise! Instead every casino game is slightly biased in favour of the house. On each round a gambler still stands a chance of winning. However, over time, the effect of this bias accumulates and the gambler inevitably loses money.

Let’s look at a couple of examples. We’ll start with a super simple game.

Example: Rolling a Dice

Consider a dice game in which the player wins if the dice lands on six. The odds for this game are 5/1 and the player would expect to receive 5 times his wager if he won.

> odds.fractional = c(win = 5/1, lose = 1/5)
> (odds.decimal = odds.fractional + 1)
 win lose 
 6.0  1.2 
> (probability = 1 / odds.decimal)
    win    lose 
0.16667 0.83333 

The probability of winning is 1/6. Would a gambler expect to profit if he played this game many times?

> payout = c(5, -1)
> sum(probability * payout)
[1] 0.00

No! In the long run neither the gambler nor the casino would make money on a game like this. It’s a fair game: neither the house nor the gambler has any statistical advantage or “edge”.

If, however, the house paid out only 4 times the wager then the player’s expected profit would become

> payout = c(4, -1)
> sum(probability * payout)
[1] -0.16667

Now the game is stacked in favour of the house, since on average the player would expect to lose around 17% of his stake. Of course, on any one game the gambler would either win 4 times his stake or lose the entire stake. However, if he played the game many times then on average he would lose 17% of his stake per game.

The game outlined above would not represent a very attractive proposition for a gambler. Obviously a casino could not afford to be this greedy and the usual house edge in any casino game is substantially smaller. Let’s move on to a real casino game.

Example: European Roulette

A European Roulette wheel has one zero and 36 non-zero numbers (18 odd and 18 even; 18 red and 18 black), making a total of 37 positions. Consider a wager on even numbers. The number of losing outcomes is 19 (the zero is treated as neither odd nor even: it’s the “house number”!), while number of winning outcomes is 18. So the odds against are 19/18.

> odds.fractional = c(win = 19/18, lose = 18/19)
> (odds.decimal = odds.fractional + 1)
   win   lose 
2.0556 1.9474 
> (probability = 1 / odds.decimal)
    win    lose 
0.48649 0.51351 

The probability of winning is 18/(19+18) = 18/37 = 0.48649. So this is almost an even money game.

Based on a wager of 1 coin, a win would result in a net profit of 1 coin, while a loss would forfeit the stake. The player’s expected outcome is then

> payout = c(1, -1)
> sum(probability * payout)
[1] -0.027027

The house edge is 2.70%. On average a gambler would lose 2.7% of his stake per game. Of course, on any one game he would either win or lose, but this is the long term expectation. Another way of looking at this is to say that the Return To Player (RTP) is 97.3%, which means that on average a gambler would get back 97.3% of his stake on every game.

Below are the results of a simulation of 100 gamblers betting on even numbers. Each starts with an initial capital of 100. The red line represents the average for the cohort. After 1000 games two gamblers have lost all of their money. Of the remaining 98 players, only 24 have made money while the rest have lost some portion of their initial capital.

roulette-odd-simulation

The code for this simulation is available here.

Over-round, Vigorish and Juice

A bookmaker will aim to achieve an overall profit regardless of the outcome of the event. The general approach to doing this is to offer odds which are less than the true odds. As a result the payout on a successful wager is less than what would be mathematically dictated by the true odds. Because of the reciprocal relationship between odds and implied probabilities, this means that the corresponding implied probabilities are inflated. The margin by which the implied probabilities exceed 100% is known as the “over-round” (also vigorish or juice). The over-round determines the profit margin of the bookmaker. Bookmakers with a lower over-round also have a lower profit margin and hence offer a more equitable proposition to gamblers.

Since sports betting involves humans, there is no deterministic edge to the house or the gambler.

It’s useful to consider what we mean by “true odds” in the context of Sportsbook. Clearly for a casino game these odds can be calculated precisely (though with various degrees of difficulty, depending on the game). However, in Sportsbook the actual odds of each outcome cannot be known with great precision. This is simply a consequence of the fact that the events involve humans, and we are notoriously unpredictable.

Do bookmakers even care about the true odds? Not really. They are mostly just interested in offering odds which will provide them with an assured overall profit on an event.

There are a number of factors which contribute to determining the odds used in Sportsbook. Obviously there’s serious domain knowledge involved in deriving the initial odds on offer. But over time these odds should evolve to take into account the overall distribution of bets placed on the various outcomes (something like the wisdom of the crowd). It has been suggested that, as a result, Sportsbook odds are similar to an efficient market. Specifically, the distribution of wagers affect the odds, with the odds on the favourite get smaller while those on the underdog(s) get larger. Eventually the odds will settle at values which reflect the market’s perceived probability of the outcome of the event.

Rather, the odds are designed so that equal money is bet on both sides of the game. If more money is bet on one of the teams, the sports book runs the risk of losing money if that team were to win.

Example: Horse Racing a Round Book

A bookmaker is offering fractional odds of 4/1 (or 5 decimal odds) on each horse in a five horse race. The implied probability of each horse winning is 20%. If the bookmaker accepted the same volume of wagers on each horse then he would not make any money since the implied probabilities sum to 100%. This is known as a “round” book.

From a gambler’s perspective, a wager of 10 on any one of the horses would have an expected return of zero. From the bookmaker’s perspective, if he accepted 100 in wagers on each horse, then he would profit 400 on the losing horses and pay out 400 on the winning horse, yielding zero net profit.

Since the expected return is zero, this represents a fair game. However, such odds would never obtain in practice: the bookmaker always stands to make money. Enter the over-round.

Example: Horse Racing with Over-Round

If the bookmaker offered fractional odds of 3/1 (or 4.0) on each horse, then the implied probabilities would change from 20% to 25%. Summing the implied probabilities gives 125%, which is 25% over-round.

Suppose that the bookmaker accepted 100 in wagers on each horse, then he would profit 400 on the losing horses and pay out only 300 on the winning horse, yielding a net profit of 100.

Enough hypothetical examples, let’s look at something real.

Example: Champions League

It’s been suggested that football squad prices can influence Sportsbook odds. Often the richer the franchise, the more likely it is that a club will prevail in the sport. This is supposed to be particularly true in European club football. We’ll try to validate this idea by scraping the data provided by Forbes for football club values.

> library(rvest)
> library(dplyr)
> clubs <- read_html("http://bit.ly/2aDa3ad") %>%
+   html_nodes("table") %>% .[[1]] %>% html_table() %>% .[, c(2, 3, 4, 7)] %>%
+   setNames(c("team", "country", "value", "revenue")) %>%
+   mutate(
+     value = as.integer(sub(",", "", value)),
+     team = gsub("\\.", "", clubs$team)
+     )
> head(clubs)
               team country value revenue
1       Real Madrid   Spain  3650     694
2         Barcelona   Spain  3320     570
3 Manchester United England  3315     645
4     Bayern Munich Germany  2680     675
5           Arsenal England  2020     524
6   Manchester City England  1920     558

Well, those tabular data are great, but a visualisation would be helpful to make complete sense of the relationship between team value and revenue.

football-club-values

It’s apparent that Real Madrid, Barcelona, Manchester United and Bayern Munich are the four most expensive teams. There’s a general trend of increasing revenue with increasing value. Two conspicuous exceptions are Schalke 04 and Paris Saint-Germain, which produce revenues far higher than expected based on their values.

Although not reflected in the plot above, there’s a relationship between the value of the team and its performance. With only a few exceptions the previously mentioned four teams have dominated the Champions League in recent years. Does this make sense? The richest teams are able to attract the most talented players. The resulting pool of talent increases their chances of winning. This in turn translates into revenue and the cycle is complete.

We’ll grab the bookmakers’ odds for the Champions League.

> library(gambleR)
> champions.league = oddschecker("football/champions-league/winner")
> head(champions.league[, 11:18])
              Ladbrokes Coral William Hill Winner Betfair Sportsbook BetBright Unibet Bwin
Barcelona           3/1   3/1         10/3    3/1                3/1       7/2    3/1 10/3
Bayern Munich       4/1   5/1          4/1    4/1                4/1       4/1    9/2  4/1
Real Madrid         5/1   5/1          4/1    9/2                9/2       9/2    5/1  5/1
Man City           12/1  12/1         11/1   10/1               10/1      12/1   12/1 12/1
Juventus           12/1  14/1         12/1   12/1               10/1      14/1    8/1 12/1
PSG                14/1  14/1         14/1   14/1               14/1      14/1   12/1 14/1

According to the selection of bookmakers above, Barcelona, Bayern Munich and Real Madrid are the major contenders in this competition. Betfair Sportsbook has Barcelona edging the current champions Real Madrid as favourites to win the competition. Bayern Munich and Real Madrid have slightly higher odds, with Bayern Munich perceived as the second most likely winner.

The decimal odds on offer at Betfair Sportsbook are

> champions.decimal[, 15]
        Barcelona     Bayern Munich       Real Madrid          Man City          Juventus 
              4.0               5.0               5.5              11.0              11.0 
              PSG   Atletico Madrid          Dortmund           Arsenal           Sevilla 
             15.0              17.0              26.0              26.0              51.0 
        Tottenham            Napoli         Leicester              Roma           Benfica 
             41.0              67.0              67.0             101.0             101.0 
            Porto  Bayer Leverkusen        Villarreal   Monchengladbach              Lyon 
            151.0              67.0             101.0             151.0             201.0 
              PSV   Sporting Lisbon       Dynamo Kiev          Besiktas             Basel 
            201.0             201.0             251.0             251.0             301.0 
      Club Brugge            Celtic     FC Copenhagen     PAOK Saloniki Red Star Belgrade 
            501.0             501.0                NA                NA                NA 
         Salzburg 
               NA 

The corresponding implied probabilities are

> champions.probability[, 15]
        Barcelona     Bayern Munich       Real Madrid          Man City          Juventus 
        0.2500000         0.2000000         0.1818182         0.0909091         0.0909091 
              PSG   Atletico Madrid          Dortmund           Arsenal           Sevilla 
        0.0666667         0.0588235         0.0384615         0.0384615         0.0196078 
        Tottenham            Napoli         Leicester              Roma           Benfica 
        0.0243902         0.0149254         0.0149254         0.0099010         0.0099010 
            Porto  Bayer Leverkusen        Villarreal   Monchengladbach              Lyon 
        0.0066225         0.0149254         0.0099010         0.0066225         0.0049751 
              PSV   Sporting Lisbon       Dynamo Kiev          Besiktas             Basel 
        0.0049751         0.0049751         0.0039841         0.0039841         0.0033223 
      Club Brugge            Celtic     FC Copenhagen     PAOK Saloniki Red Star Belgrade 
        0.0019960         0.0019960                NA                NA                NA 
         Salzburg 
               NA

These sum to 1.178, giving an over-round of 17.8%.

Let’s focus on a football game between Anderlecht and Rostov. These are not major contenders, but they faced off last Saturday (3 August 2016), so the data are readily available.

Example: Anderlecht versus Rostov

The odds for the football match between Anderlecht and Rostov are shown below.

betfair-anderlecht-rostov

The match odds are 2.0 for a win by Anderlecht, 4.1 for a win by Rostov and 3.55 for a draw. Let’s convert those to the corresponding implied probabilities:

> decimal.odds = c(anderlecht = 2.0, rostov = 4.1, draw = 3.55)
> 1 / decimal.odds
anderlecht     rostov       draw 
   0.50000    0.24390    0.28169 

According to those odds the implied probabilities of each of the outcomes are 50%, 24.4% and 28.2% respectively.

> sum(1 / decimal.odds)
[1] 1.0256

Summing those probabilities gives an over-round of 2.6%, which is very competitive. However, including the 5% commission levied by Betfair, this increases to 7.6%.

Although Anderlecht were the favourites to win this game, it turns out that Rostov had a convincing victory.

anderlecht-rostov-result

The same principles apply when there are many possible outcomes for an event.

Example: Horse Racing (18:20 at Stratford)

I scraped the odds for the 18:20 race at Stratford on 28 June 2016 from oddschecker. Here are the data for nine bookmakers.

> odds[, 1:9]
                 Bet Victor Betway Marathon Bet Betdaq Bet 365 Ladbrokes Sky Bet 10Bet 188Bet
Deauville Dancer        6/4   13/8         13/8    7/5     6/4       6/4     6/4   6/4    6/4
Cest Notre Gris         7/4    7/4          7/4    7/4     7/4       7/4     7/4   7/4    7/4
Ross Kitty             15/2    7/1          7/1   41/5     7/1       7/1     7/1   7/1    7/1
Amber Spyglass         12/1   12/1         12/1   68/5    12/1      12/1    11/1  11/1   11/1
Venture Lagertha       20/1   22/1         20/1   89/5    16/1      20/1    20/1  20/1   20/1
Lucky Thirteen         22/1   22/1         20/1   21/1    22/1      20/1    22/1  20/1   20/1
Overrider              25/1   20/1         25/1   22/1    25/1      20/1    22/1  22/1   22/1
Kims Ocean             28/1   25/1         25/1   21/1    22/1      25/1    28/1  25/1   25/1
Rizal Park             80/1   66/1         50/1   82/1    80/1      66/1    50/1  66/1   66/1
Chitas Gamble         250/1  200/1        100/1  387/1   250/1     125/1   125/1 150/1  150/1
Irish Ranger          250/1  200/1        100/1  387/1   250/1     150/1   125/1 150/1  150/1

The decimal odds on offer at Bet Victor are

> decimal.odds[,1]
Deauville Dancer  Cest Notre Gris       Ross Kitty   Amber Spyglass Venture Lagertha 
            2.50             2.75             8.50            13.00            21.00 
  Lucky Thirteen        Overrider       Kims Ocean       Rizal Park    Chitas Gamble 
           23.00            26.00            29.00            81.00           251.00 
    Irish Ranger 
          251.00 

The corresponding implied probabilities are

> probability[,1]
Deauville Dancer  Cest Notre Gris       Ross Kitty   Amber Spyglass Venture Lagertha 
       0.4000000        0.3636364        0.1176471        0.0769231        0.0476190 
  Lucky Thirteen        Overrider       Kims Ocean       Rizal Park    Chitas Gamble 
       0.0434783        0.0384615        0.0344828        0.0123457        0.0039841 
    Irish Ranger 
       0.0039841 

The total implied probability per bookmaker is

> sort(colSums(probability))
       Bet Victor            Betway       Marathon Bet            Betdaq           Bet 365 
           1.1426            1.1444             1.1581            1.1623            1.1701 
        Ladbrokes           Sky Bet              10Bet            188Bet         Netbet UK 
           1.1764            1.1765             1.1773            1.1773            1.1773 
      Boylesports            Winner       William Hill        Stan James           Betfair 
           1.1797            1.1861             1.1890            1.1895            1.1935 
            Coral          RaceBets Betfair Sportsbook         BetBright       Sportingbet 
           1.1964            1.2003             1.2173            1.2229            1.2288 
          Betfred         Totesport          32Red Bet          888sport       Paddy Power 
           1.2303            1.2303             1.2392            1.2392            1.2636 

It’s obvious that there is a wide range of value being offered by various bookmakers, extending from the competitive Bet Victor and Betway with an over-round of around 14% to the substantial over-round of 26% at Paddy Power.

From a gambler’s point of view, the best value is obtained by finding the bookmaker who is offering the largest odds for a particular outcome. It’s probable that this bookmaker will also have a relatively low over-round. Sites like oddschecker make it a simple matter to check the odds on offer from a range of bookmakers. If you have the time and patience it might even be possible to engage in betting arbitrage.

Animated Mortality

Kyle Walker’s pyramid plots gave me a serious case of visualisation envy. Here’s something similar using the mortality data from the lifespan package.

Animated mortality pyramid plot.

The change in the mortality profile from year to year over two decades is evident. There’re unmistakable peaks which propagate up the plot, corresponding to babies born in 1943 and 1947, around the start and just after the Second World War.

feedeR: Reading RSS and Atom Feeds from R

I’m working on a project in which I need to systematically parse a number of RSS and Atom feeds from within R. I was somewhat surprised to find that no package currently exists on CRAN to handle this task. So this presented the opportunity for a bit of DIY.

You can find the fruits of my morning’s labour here.

Installing and Loading

The package is currently hosted on GitHub.

> devtools::install_github("DataWookie/feedeR")
> library(feedeR)

Reading a RSS Feed

Although Atom is supposed to be a better format from a technical perspective, RSS is relatively ubiquitous. The vast majority of blogs provide an RSS feed. We’ll look at the feed exposed by R-bloggers.

> rbloggers <- feed.extract("https://feeds.feedburner.com/RBloggers")
> names(rbloggers)
[1] "title"   "link"    "updated" "items"

There are three metadata elements pertaining to the feed.

> rbloggers[1:3]
$title
[1] "R-bloggers"

$link
[1] "https://www.r-bloggers.com"

$updated
[1] "2016-08-06 09:15:54 UTC"

The actual entries on the feed are captured in the items element. For each entry the title, publication date and link are captured. There are often more fields available for each entry, but these three are generally present.

> nrow(rbloggers$items)
[1] 8
> head(rbloggers$items, 3)
                                                              title                date
1                                                       readr 1.0.0 2016-08-05 20:25:05
2 Map the Life Expectancy in United States with data from Wikipedia 2016-08-05 19:48:53
3 Creating Annotated Data Frames from GEO with the GEOquery package 2016-08-05 19:35:45
                                                                                           link
1                                                       https://www.r-bloggers.com/readr-1-0-0/
2 https://www.r-bloggers.com/map-the-life-expectancy-in-united-states-with-data-from-wikipedia/
3 https://www.r-bloggers.com/creating-annotated-data-frames-from-geo-with-the-geoquery-package/

Reading an Atom Feed

Atom feeds are definitely in the minority, but this format is still used by a number of popular sites. We’ll look at the feed from The R Journal.

> rjournal <- feed.extract("http://journal.r-project.org/rss.atom")

The same three elements of metadata are present.

> rjournal[1:3]
$title
[1] "The R Journal"

$link
[1] "http://journal.r-project.org"

$updated
[1] "2016-07-23 13:16:08 UTC"

Atom feeds do not appear to consistently provide the date on which each of the entries was originally published. The title and link fields are always present though!

> head(rjournal$items, 3)
                                                                                title date
1                         Heteroscedastic Censored and Truncated Regression with crch   NA
2 An Interactive Survey Application for Validating Social Network Analysis Techniques   NA
3            quickpsy: An R Package to Fit Psychometric Functions for Multiple Groups   NA
                                                                     link
1  http://journal.r-project.org/archive/accepted/messner-mayr-zeileis.pdf
2        http://journal.r-project.org/archive/accepted/joblin-mauerer.pdf
3 http://journal.r-project.org/archive/accepted/linares-lopez-moliner.pdf

Outlook

I’m still testing this across a selection of feeds. If you find a feed that breaks the package, please let me known and I’ll debug as necessary.

Web Scraping and “invalid multibyte string”

A couple of my collaborators have had trouble using read_html() from the xml2 package to access this Wikipedia page. Specifically they have been getting errors like this:

Error in utils::type.convert(out[, i], as.is = TRUE, dec = dec) :
  invalid multibyte string at '<e2>€<94>'

Since I couldn’t reproduce these errors on my machine it appeared to be something relating to their particular machine setup. Looking at their locale provided a clue:

> Sys.getlocale()
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;
LC_NUMERIC=C;LC_TIME=Korean_Korea.949"

whereas on my machine I have:

> Sys.getlocale()
[1] "LC_CTYPE=en_ZA.UTF-8;LC_NUMERIC=C;LC_TIME=en_ZA.UTF-8;LC_COLLATE=en_ZA.UTF-8;
LC_MONETARY=en_ZA.UTF-8;LC_MESSAGES=en_ZA.UTF-8;LC_PAPER=en_ZA.UTF-8;LC_NAME=C;LC_ADDRESS=C;
LC_TELEPHONE=C;LC_MEASUREMENT=en_ZA.UTF-8;LC_IDENTIFICATION=C"

The document that they were trying to scrape is encoded in UTF-8, which I see in my locale but not in theirs. Perhaps changing locale will sort out the problem? Since the en_ZA locale is a bit of an acquired taste (unless you’re South African, in which case it’s de rigueur!), the following should resolve the problem:

> Sys.setlocale("LC_CTYPE", "en_US.UTF-8")

It’s possible that command might produce an error stating that it cannot be honoured by your system. Do not fear. Try the following (which seems to work almost universally!):

Sys.setlocale("LC_ALL", "English")

Try scraping again. Your issues should be resolved.

Sportsbook Betting (Part 1): Odds


This series of articles was written as support material for Statistics exercises in a course that I’m teaching for iXperience. In the series I’ll be using illustrative examples for wagering on a variety of Sportsbook events including Horse Racing, Rugby and Tennis. The same principles can be applied across essentially all betting markets.

Odds

To make some sense of gambling we’ll need to understand the relationship between odds and probability. Odds can be expressed as either “odds on” or “odds against”. Whereas the former is the odds in favour of an event taking place, the latter reflects the odds that an event will not happen. Odds against is the form in which gambling odds are normally expressed, so we’ll focus on that. The odds against are defined as the ratio, L/W, of losing outcomes (L) to winning outcomes (W). To link these odds to probabilities we note that the winning probability, p, is W/(L+W). The odds against are thus equivalent to (1-p)/p.

To make this more concrete, consider the odds against rolling a 6 with a single die. The number of losing outcomes is L = 5 (for all of the other numbers on the die: 1, 2, 3, 4 and 5) while the number of winning outcomes is W = 1. The odds against are thus 5/1, while the winning probability is 1/(5+1) = 1/6.

Fractional Odds

Fractional odds are quoted as L/W, L:W or L-W. From a gambler’s perspective these odds reflect the net winnings relative to the stake. For example, fractional odds of 5/1 imply that the gambler stands to make a profit of 50 on a stake of 10. In addition to the profit, a winning gambler gets the stake back too. So, in the previous scenario, the gambler would receive a total of 60. Conversely, factional odds of 1/2 would pay out 10 for a stake of 20. Odds of 1/1 are known as “even odds” or “even money”, and will pay out the same amount as was wagered.

The numerator and denominator in fractional odds are always integers.

In a fair game a player who placed a wager at fractional odds of L/W would reasonably expect to win L/W times his wager.

Decimal Odds

Decimal odds quote the ratio of the full payout (including original stake) to the stake. Using the same symbols as above, this is equivalent to the ratio (L+W)/W or 1+L/W. The decimal odds are numerically equal to the fractional odds plus 1. In a fair game the decimal odds are also the inverse of the probability of a winning outcome. This makes sense because the inverse of the decimal odds is W/(L+W).

From a gambler’s perspective, decimal odds reflect the gross total which will be paid out relative to the stake. For example, decimal odds of 6.0 are equivalent to fractional odds of 5/1 and imply that the gambler stands to get back 60 on a stake of 10. Similarly, decimal odds of 1.5 are the same as fractional odds of 1/2, and a winning gambler would get back 30 on a wager of 20.

Decimal odds are quoted as a positive number greater than 1.

Odds and Probability

As indicated above, there is a direct relationship between odds and probabilities. For a fair game, this relationship is simple: the probabilities are the reciprocal of the decimal odds. And for a fair game, the sum of the probabilities of all possible outcomes must be 1.

The reciprocal relationship between decimals odds and probabilities implies that outcomes with the lowest odds are the most likely to be realised. This might not tie up with the conventional understanding of odds, but is a consequence of the fact that we are looking at the odds against that outcome.

Example: Fair Odds on Rugby

The Crusaders are playing the Hurricanes at the AMI Stadium. A bookmaker is offering 1/2 odds on the Crusaders and 2/1 odds on the Hurricanes. These fractional odds translate into decimal odds of 1.5 and 3.0 respectively. Based on these odds, the implied probabilities of either team winning are

&gt; (odds = c(Crusaders = 1.5, Hurricanes = 3))
 Crusaders Hurricanes 
       1.5        3.0 
&gt; (probability = 1 / odds)
 Crusaders Hurricanes 
   0.66667    0.33333 

The Crusaders are perceived as being twice as likely to win. Since they are clearly the favourites for this match it stands to reason that there would be more wagers placed on the Crusaders than on the Hurricanes. In fact, on the basis of the odds we would expect there to be roughly twice as much money placed on the Crusaders.

A successful wager of 10 on the Crusaders would yield a net win of 5, while the same wager on the Hurricanes would stand to yield a net win of 20. If we include the initial stake then we get the corresponding gross payouts of 15 and 30.

&gt; (odds - 1) * 10                                          # Net win
Crusaders Hurricanes 
5         20 
&gt; odds * 10                                                # Gross Win
Crusaders Hurricanes 
15         30 

In keeping with the reasoning above, suppose that a total of 2000 was wagered on the Crusaders and 1000 was wagered on the Hurricanes. In the event of a win by the Crusaders the bookmaker would keep the 1000 wagered on the Hurricanes, but pay out 1000 on the Crusaders wagers, leaving no net profit. Similarly, if the Hurricanes won then the bookmaker would pocket the 2000 wagered on the Crusaders but pay out 2000 on the Hurricanes wagers, again leaving no net profit. The bookmaker’s expected profit based on either outcome is zero. This does not represent a very lucrative scenario for a bookmaker. But, after all, this is a fair game.

From a punter’s perspective, a wager on the Crusaders is more likely to be successful, but is not particularly rewarding. By contrast, the likelihood of a wager on the Hurricanes paying out is lower, but the potential reward is appreciably higher. The choice of a side to bet on would then be dictated by the punter’s appetite for risk and excitement (or perhaps simply their allegiance to one team or the other).

The expected outcome, which weights the payout by its likelihood, of a wager on either the Crusaders or the Hurricanes is zero.

&gt; (probability = c(win = 2, lose = 1) / 3)                 # Wager on Crusaders
win    lose 
0.66667 0.33333 
&gt; payout = c(win = 0.5, lose = -1)
&gt; sum(probability * payout)
[1] 0
&gt; (probability = c(win = 1, lose = 2) / 3)                 # Wager on Hurricanes
win    lose 
0.33333 0.66667 
&gt; payout = c(win = 2, lose = -1)
&gt; sum(probability * payout)
[1] 0

Again this is because the odds represent a fair game.

Most games of chance are not fair, so the situation above represents a special (and not very realistic) case. Let’s look at a second example which presents the actual odds being quoted by a real bookmaker.

Example: Real Odds on Tennis

The odds below are from an online betting website for the tennis match between Madison Keys and Venus Williams. These are real, live odds and the implications for the player and the bookmaker are slightly different.

odds-tennis

We’ll focus our attention on the overall winner, for which the decimal odds on Madison Keys are 1.83, while those on Venus Williams are 2.00.

&gt; (odds = c(Madison = 1.83, Venus = 2.00))
Madison   Venus 
   1.83    2.00 
&gt; (probability = 1 / odds)
Madison   Venus 
0.54645 0.50000 

The first thing that you’ve observed is that the implied probabilities do not sum to 1. We’ll return to this point in the next article.

The odds quoted for each player very similar, which implies that the bookmaker considers these players to be evenly matched. Madison Keys has slightly lower odds, which suggests that she is a slightly stronger contender. A wager on either player will not yield major rewards because of the low odds. However, at the same time, a wager on either player has a similar probability of being successful: both around 50%.

Let’s look at another match. Below are the odds from the same online betting website for the game between Novak Djokovic and Radek Stepanek.

odds-tennis-men

The odds for this game are profoundly different to those for the ladies match above.

&gt; (odds = c(Novak = 1.03, Radek = 16.00))
Novak Radek 
 1.03 16.00 
&gt; (probability = 1 / odds)
  Novak   Radek 
0.97087 0.06250 

Novak Djokovic is considered to be the almost certain winner. A wager on him thus has the potential to produce only 3% winnings. Radek Stepanek, on the other hand, is a rank outsider in this match. His perceived chance of winning is low. As a result, the potential returns should he win are large.

To find out more about converting between different forms of odds and the corresponding implied probabilities, have a look at this tutorial.

In the next instalment we’ll examine how bookmakers’ odds ensure their profit yet provide a potentially rewarding (and entertaining) experience for gamblers.

Building a Life Table

After writing my previous post, Mortality by Year and Age, I’ve become progressively more interested in the mortality data. Perhaps those actuaries are onto something? I found this report, which has a wealth of pertinent information. On p. 13 the report gives details on constructing a Life Table, which is one of the fundamental tools in Actuarial Science. The lifespan package has all of the data required to construct a Life Table, so I created a lifetable data frame which has those data broken down by gender.

> library(lifespan)
> subset(lifetable, sex == "M") %>% head
  x sex     lx      dx         qx
1 0   M 100000 596.534 0.00596534
2 1   M  99403 256.848 0.00258389
3 2   M  99147 174.077 0.00175575
4 3   M  98973 114.213 0.00115398
5 4   M  98858  83.082 0.00084041
6 5   M  98775  71.536 0.00072423
> subset(lifetable, sex == "F") %>% head
    x sex     lx      dx         qx
133 0   F 100000 452.585 0.00452585
134 1   F  99547 203.525 0.00204450
135 2   F  99344 130.223 0.00131083
136 3   F  99214  84.746 0.00085418
137 4   F  99129  62.055 0.00062600
138 5   F  99067  54.475 0.00054988

The columns in the data above should be interpreted as follows:

  • lx is the number of people who have survive to age x, based on an initial cohort of 100 000 people;
  • dx is the expected number of people in the cohort who die aged x on their last birthday; and
  • qx is the probability that someone aged x will die before reaching age x+1.

A plot gives a high level overview of the data. Below lx is plotted as a function of age. Click on the image to access an interactive Plotly version. The cohort size has been renormalised so that lx is expressed as a percent. It’s readily apparent that the attrition rate is much higher for males than females, and that very few people survive beyond the age of 105.

life-table

Using these data we can also calculate some related conditional probabilities. For example, what is the probability that a person aged 70 will live for at least another 5 years?

> survival(70, 5)
      F       M 
0.87916 0.80709 

Another example, what is the probability that a person aged 70 will live for at least another 5 years but then die in the 10 years after that?

> survival(70, 5, 10)
      F       M 
0.37472 0.46714

Interesting stuff! Everything indicates that in terms of longevity, females have the upper hand.

Somebody made the following witty comment on LinkedIn in response to my previous post:

Just good to know that death risk visibly decreases after 100y/o. This helps.

Well, yes and no. In an absolute sense your risk of dying after the age of 100 is relatively low. But the reason for this is that the probability of you actually making it to the age of 100 is low. If, however, you do manage to achieve this monumental age, then your risk of dying is rather high.

> survival(100)
      F       M 
0.65813 0.60958

So men aged 100 have 39% probability of dying before reaching the age of 101, while the probability for women is 34%.

Note that there are also life table data in the babynames package.