Neha Narula: The future of money

There are actually people all over the world running this software, and we call them Bitcoin miners. Anyone can become a Bitcoin miner. You can go download the software right now and run it in your computer and try to collect some bitcoins. I can’t say that I would recommend it, because right now, the puzzle is so hard and the network is so powerful, that if I tried to mine Bitcoin on my laptop, I probably wouldn’t see any for about two million years. The miners, professional miners, use this special hardware that’s designed to solve the puzzle really fast. Now, the Bitcoin network and all of this special hardware, there are estimates that the amount of energy it uses is equivalent to that of a small country. So, the first set of cryptocurrencies are a little bit slow and a little bit cumbersome. But the next generation is going to be so much better and so much faster.
Neha Narula

View POST Data using Chrome Developer Tools

When figuring out how to formulate the contents of a POST request it’s often useful to see the “typical” fields submitted directly from a web form.

  1. Open Developer Tools in Chrome. Select the Network tab (at the top).
  2. Submit the form. Watch the magic happening in the Developer Tools console.

    The Network tab in Chrome Developer Tools.

  3. Click on the first document listed in the Developer Tools console, then select the Headers tab.

    The Headers tab showing the details of the POST payload.

That’s just scratching the surface of the wealth of information available on the Network tab. Read this to find out more.

Deleting All Nodes and Relationships

Seems that I am doing this a lot: deleting my entire graph (all nodes and relationships) and rebuilding from scratch. I guess that this is part of the learning process.

sample-graph

Route 1: Delete Relationships then Nodes

A relationship is constrained to join a start node to an end node. Every relationship must be associated with at least one node (a relationship may begin and end on the same node). No such constraint exists for nodes. As a result relationships must be deleted before nodes are deleted.

Delete all relationships using either

$ START r = RELATIONSHIP(*) DELETE r;

or

$ MATCH ()-[r]-() DELETE r;

Then delete the nodes with

$ MATCH (n) DELETE n;

Route 2: Detach and Delete

Using DETACH DELETE it’s possible to delete relationships and nodes at once.

$ MATCH (n) DETACH DELETE n;

Check

Confirm that all nodes and relationships have gone.

$ MATCH (n) RETURN COUNT(n);
$ MATCH ()-[r]->() RETURN COUNT(r);

Or, alternatively:

$ START n = NODE(*) return COUNT(n);
$ START r = RELATIONSHIP(*) return COUNT(r);

Running Cypher Queries from File on Windows

lazywebcypher

Recent packages of Neo4j for Windows do not include neo4j-shell. The Neo4j browser will only accept one statement at a time, making scripts consisting of multiple Cypher commands a problem.

Have a look at LazyWebCypher which will allow you to execute an arbitrary volume on Cypher code on your local Neo4j server.

There are multiple avenues to upload your script: provide a URL, select a file or simply paste the code. Then move to the other tab and hit the Import button.

Voila!

Remote Access to Neo4j on Windows

Accessing the Neo4j server running on your local machine is simple: just point your browser to http://localhost:7474/. But with the default configuration the server is not accessible from other machines. This means that other folk can share in the wonder of your nodes edges.

Enabling remote access is simple.

  1. Shut down your running Neo4j server.
  2. Press the Options button, which will bring up a dialog like this:
    neo4j-options

  3. Press the top Edit button, which will open the neo4j.conf file in an editor.
  4. Browse to the HTTP Connector section and add the line highlighted below.
    # HTTP Connector
    #
    dbms.connector.http.type=HTTP
    dbms.connector.http.enabled=true
    #
    # dbms.connector.http.address=0.0.0.0:#{default.http.port}
    dbms.connector.http.address=0.0.0.0:7474
    

Your server will now be available to other machines via port 7474. From the local machine it’ll still be at http://localhost:7474/.

The above solution works (at least) for Neo4j 3.0.4.

Installing Neo4j on Ubuntu 16.04

Some instructions for installing Neo4j on Ubuntu 16.04. More for my own benefit than anything else.

neo4j-logo

First you need to become root.

$ sudo /bin/bash
#

Installing Java

Neo4j is implemented in Java, so you’ll need to have the Java Runtime Environment (JRE) installed. If you already have this up and running, go ahead and skip this step.

# apt install default-jre default-jre-headless

Check whether you can now run the java executable.

# java

If that works for you, great! It didn’t immediately work on one of my machines. Strangely there were some dangling links in the alternatives system (which, to be honest, I was not even aware of until then!). It took a bit of Googling to figure this out, but the issue was resolved with the following:

# update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/bin/java
# update-alternatives --set javac /usr/lib/jvm/java-8-openjdk-amd64/bin/javac

Installing Neo4j

First we’ll add the repository key to our keychain.

# wget --no-check-certificate -O - https://debian.neo4j.org/neotechnology.gpg.key | apt-key add -

Then add the repository to the list of apt sources.

# echo 'deb http://debian.neo4j.org/repo stable/' >/etc/apt/sources.list.d/neo4j.list

Finally update the repository information and install Neo4j.

# apt update
# apt install neo4j

The server should have started automatically and should also be restarted at boot. If necessary the server can be stopped with

# service neo4j stop

and restarted with

# service neo4j start

At this stage you can drop the root privilege.

# exit
$

Accessing Neo4j

You should now be able to access the database server via http://localhost:7474/browser/.

I had some problems logging in with the default username and password (neo4j and neo4j), but this was easily resolved by deleting the file /var/lib/neo4j/data/dbms/auth and restarting the server.

PLOS Subject Keywords: Association Rules

In a previous post I detailed the process of compiling data on subject keywords used in articles published in PLOS journals. In this instalment I’ll be using those data to mine Association Rules with the arules package. Good references on the topic of Association Rules are

association-rule-image

Terminology

Data suitable for mining Association Rules should consist of:

  • a set of uniquely identified transactions, where
  • each transaction should have one or more items, where
  • items are binary attributes.

The derived rules take the form X → Y, where X and Y are disjoint itemsets, each consisting of one or more items. The itemsets X and Y are known as the antecedent (lhs or left hand side) and consequent (rhs or right hand side). The rules should be interpreted as “If X is present then so too is Y”.

An Association Rules analysis aims to identify pairs (or groups) of items that are commonly found together. A shopping analogy would be that bread is often purchased with milk, while peanuts are often purchased with beer. This kind of analysis is not only confined to this sort of “consumption” scenario: it can be applied in any situation where a discrete set of items is associated with individual transactions.

In the context of the data we previously gathered from PLOS, where every article is tagged with one or more subjects, each of the articles is a “transaction” and the subjects are then the “items”. We’ll be deriving rules for which subjects commonly occur together. Or, more specifically, we’ll be generating rules like “If an article is tagged with subject X then it is probably also tagged with subject Y”.

Transaction Matrix

The arules package derives Association Rules from a Transaction Matrix. The form in which we have the subjects data is particularly convenient for building a sparse matrix (class ngCMatrix from the Matrix package).

> head(subjects)
                           doi      journal              subject count
1 10.1371/journal.pbio.0000007 PLOS Biology               Borneo     1
2 10.1371/journal.pbio.0000007 PLOS Biology Conservation science     1
3 10.1371/journal.pbio.0000007 PLOS Biology            Elephants     1
4 10.1371/journal.pbio.0000007 PLOS Biology   Endangered species     2
5 10.1371/journal.pbio.0000007 PLOS Biology        Plant fossils     2
6 10.1371/journal.pbio.0000007 PLOS Biology    Pleistocene epoch     1

For the purposes of the arules package the ngCMatrix needs to have items (in this case case subjects) as rows and transactions (in this case articles) as columns.

> library(Matrix)
> 
> subjects.matrix <- with(subjects, sparseMatrix(i = as.integer(subject),
+                                                j = as.integer(doi),
+                                                dimnames = list(levels(subject),
+                                                                      levels(doi))))
> dim(subjects.matrix)
[1]   9357 185984
> class(subjects.matrix)
[1] "ngCMatrix"
attr(,"package")
[1] "Matrix"

There are 185984 articles and 9357 subjects. Next we coerce this into a Transactions Matrix.

> library(arules)
> 
> subjects.matrix <- as(subjects.matrix, "transactions")
> class(subjects.matrix)
[1] "transactions"
attr(,"package")
[1] "arules"

Here’s some summary information. We see that the vast majority of articles are associated with eight subjects.

> summary(subjects.matrix)
transactions as itemMatrix in sparse format with
 185984 rows (elements/itemsets/transactions) and
 9357 columns (items) and a density of 0.000824 

most frequent items:
          Gene expression Polymerase chain reaction          Mouse models             Apoptosis                none 
                    17819                      9458                  8773                  7630                7593 
                  (Other) 
                  1382690 

element (itemset/transaction) length distribution:
sizes
     1      2      3      4      5      6      7      8 
  7625     16     22     34     26     30     54 178177 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00    8.00    7.71    8.00    8.00 

includes extended item information - examples:
          labels
1     293T cells
2 3D bioprinting
3    3D printing

includes extended transaction information - examples:
                                                itemsetID
1 10.1371/annotation/008b05a8-229b-4aca-94ae-91e6dd5ca5ba
2 10.1371/annotation/00a3b22e-36a9-4d51-89e5-1e6561e7a1e9
3 10.1371/annotation/00d17a45-7b78-4fd5-9a9a-0f2e49b04eee

Generating Rules (Default Settings)

There are two major algorithms for generating Association Rules: Apriori and Eclat. We’ll be using the former here. We’ll try to derive some rules using apriori() with default parameters.

> rules <- apriori(subjects.matrix)
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
        0.8    0.1    1 none FALSE            TRUE     0.1      1     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 18598 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9357 item(s), 185984 transaction(s)] done [0.13s].
sorting and recoding items ... [0 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object  ... done [0.01s].

Zero rules! Well, that’s a little disappointing. But not entirely surprising: the default minimum thresholds on support (0.1) and confidence (0.8) are rather conservative. (I’ll explain what support and confidence mean shortly.) We’ll relax them in order to at least generate a decent selection of rules.

Generating Rules (Relaxed Settings)

Reducing the thresholds on support and confidence to 0.002 and 0.75 respectively results in 35 rules. Lower values for these thresholds are necessary because there is a relatively small degree of subject overlap between the articles in the collection. Not surprising since they are derived from a wide range of disciplines!

> rules <- apriori(subjects.matrix, parameter = list(support = 0.002, confidence = 0.75))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
       0.75    0.1    1 none FALSE            TRUE   0.002      1     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 371 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9357 item(s), 185984 transaction(s)] done [0.14s].
sorting and recoding items ... [838 item(s)] done [0.01s].
creating transaction tree ... done [0.10s].
checking subsets of size 1 2 3 4 done [0.04s].
writing ... [35 rule(s)] done [0.00s].
creating S4 object  ... done [0.04s].

Below is some summary information on those rules. We see that the largest rule length (total number of items on lhs and rhs of rule) is only 3.

> summary(rules)
set of 35 rules

rule length distribution (lhs + rhs):sizes
 2  3 
25 10 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.00    2.00    2.29    3.00    3.00 

summary of quality measures:
    support          confidence         lift       
 Min.   :0.00203   Min.   :0.756   Min.   :  9.34  
 1st Qu.:0.00233   1st Qu.:0.795   1st Qu.: 27.89  
 Median :0.00290   Median :0.858   Median : 39.60  
 Mean   :0.00337   Mean   :0.859   Mean   : 66.18  
 3rd Qu.:0.00365   3rd Qu.:0.923   3rd Qu.: 90.97  
 Max.   :0.01154   Max.   :1.000   Max.   :233.23  

mining info:
            data ntransactions support confidence
 subjects.matrix        185984   0.002       0.75

This seems like a good time to peruse the rules themselves. To see the details we need to use inspect().

> inspect(rules)
   lhs                                        rhs                                        support   confidence lift    
1  {Memory T cells}                        => {T cells}                                  0.0021292 0.98020     28.1633
2  {Circadian oscillators}                 => {Circadian rhythms}                        0.0021292 0.85900    233.2272
3  {HbA1c}                                 => {Diabetes mellitus}                        0.0023550 0.90123     38.5500
4  {Secondary lung tumors}                 => {Lung and intrathoracic tumors}            0.0026346 1.00000    123.9067
5  {H1N1}                                  => {Influenza}                                0.0020432 0.76923     90.7770
6  {Corals}                                => {Coral reefs}                              0.0020271 0.75551    199.5923
7  {T helper cells}                        => {T cells}                                  0.0022314 0.82341     23.6585
8  {Face recognition}                      => {Face}                                     0.0021400 0.77282    191.3866
9  {T cell receptors}                      => {T cells}                                  0.0026669 0.91176     26.1971
10 {Breast tumors}                         => {Breast cancer}                            0.0024142 0.79610     57.2330
11 {Forest ecology}                        => {Forests}                                  0.0030271 0.94147     93.3362
12 {Antibody response}                     => {Antibodies}                               0.0028336 0.87542     39.0439
13 {Surgical oncology}                     => {Surgical and invasive medical procedures} 0.0024626 0.75702     50.3737
14 {Prostate gland}                        => {Prostate cancer}                          0.0030218 0.79603    130.7859
15 {HIV prevention}                        => {HIV}                                      0.0035003 0.90669     42.3372
16 {Tuberculosis diagnosis and management} => {Tuberculosis}                             0.0036885 0.93333     91.1686
17 {Geriatrics}                            => {Elderly}                                  0.0033229 0.80469    125.9756
18 {HIV diagnosis and management}          => {HIV}                                      0.0036186 0.81675     38.1376
19 {HIV epidemiology}                      => {HIV}                                      0.0038390 0.85817     40.0719
20 {Regulatory T cells}                    => {T cells}                                  0.0040971 0.87687     25.1945
21 {Chemotherapy}                          => {Cancer treatment}                         0.0040917 0.77259     22.3432
22 {Multiple alignment calculation}        => {Sequence alignment}                       0.0050811 0.85987     30.8255
23 {Malarial parasites}                    => {Malaria}                                  0.0051617 0.79668     74.8332
24 {HIV infections}                        => {HIV}                                      0.0076888 0.84816     39.6044
25 {Cytotoxic T cells}                     => {T cells}                                  0.0115386 0.95846     27.5388
26 {HIV epidemiology,HIV infections}       => {HIV}                                      0.0020432 0.95000     44.3597
27 {Gene expression,Regulator genes}       => {Gene regulation}                          0.0023174 0.77798     26.6075
28 {Malarial parasites,Parasitic diseases} => {Malaria}                                  0.0030863 0.78415     73.6565
29 {Malaria,Parasitic diseases}            => {Malarial parasites}                       0.0030863 0.82117    126.7428
30 {Phylogenetics,Sequence alignment}      => {Phylogenetic analysis}                    0.0022260 0.79310     33.2517
31 {Cytotoxic T cells,Flow cytometry}      => {T cells}                                  0.0023497 0.98202     28.2157
32 {Cytotoxic T cells,Immune response}     => {T cells}                                  0.0033121 0.96100     27.6117
33 {Cytokines,Cytotoxic T cells}           => {T cells}                                  0.0028981 0.96078     27.6055
34 {Enzyme-linked immunoassays,Vaccines}   => {Antibodies}                               0.0026293 0.77619     34.6185
35 {Gene regulation,Microarrays}           => {Gene expression}                          0.0041885 0.89437      9.3349

Each of the rules consists of two itemsets, a lhs and a rhs, with the implication that if the lhs itemset is selected then so too is the rhs itemset.

Rule Metrics

Naturally some rules are stronger than others. Their relative quality is measured by a set of metrics: support, confidence and lift.

Support

The support for an itemset is the proportion of transactions which contain that itemset. The support for a rule is the proportion of transactions which contain both the antecedent and consequent itemsets.

The five rules below are those with the highest support. The rule {Cytotoxic T cells} → {T cells} has support of 0.0115386, which means that “Cytotoxic T cells” and “T cells” are present in 1.15% of transactions. Likewise, the rule {Gene regulation,Microarrays} → {Gene expression} has support of 0.0041885, indicating that “Gene regulation”, “Microarrays” and “Gene expression” appear in 0.4% of transactions.

> inspect(head(sort(rules, by = "support"), n = 5))
   lhs                                 rhs                  support   confidence lift   
25 {Cytotoxic T cells}              => {T cells}            0.0115386 0.95846    27.5388
24 {HIV infections}                 => {HIV}                0.0076888 0.84816    39.6044
23 {Malarial parasites}             => {Malaria}            0.0051617 0.79668    74.8332
22 {Multiple alignment calculation} => {Sequence alignment} 0.0050811 0.85987    30.8255
35 {Gene regulation,Microarrays}    => {Gene expression}    0.0041885 0.89437     9.3349

Support does not directly indicate the strength of the rule, just how often the components of the rule are present in the data. However, having a decent level of support for a rule is important because it indicates what proportion of the data contributed to deriving that rule. Obviously if a rule is based on only a few transactions then it is not likely to be very robust.

Confidence

The confidence assigned to a rule is the proportion of transactions which contain both the antecedent and consequent relative to those which contain the antecedent. Equivalently, this is the ratio of the support for the rule to the support for the antecedent. Alternatively confidence is the probability of the rhs being present in a transaction conditional on the lhs also being present.

The five rules below are those with the highest confidence. For example, we see that articles with subjects which include “Secondary lung tumors” will certainly also contain “Lung and intrathoracic tumors”. Similarly, articles which have both “Cytokines” and “Cytotoxic T cells” as subjects will very likely also have “T cells”.

> inspect(head(sort(rules, by = "confidence"), n = 5))
   lhs                                    rhs                             support   confidence lift   
4  {Secondary lung tumors}             => {Lung and intrathoracic tumors} 0.0026346 1.00000    123.907
31 {Cytotoxic T cells,Flow cytometry}  => {T cells}                       0.0023497 0.98202     28.216
1  {Memory T cells}                    => {T cells}                       0.0021292 0.98020     28.163
32 {Cytotoxic T cells,Immune response} => {T cells}                       0.0033121 0.96100     27.612
33 {Cytokines,Cytotoxic T cells}       => {T cells}                       0.0028981 0.96078     27.606

Lift

The lift of a rule indicates how much greater the support for the rule is relative to the support for the antecedent and consequent, treated as if they are independent. Equivalently, lift is the ratio of confidence in a rule to the expected confidence were the antecedent and consequent independent. A useful interpretation of the lift is the increase in probability for the consequent if the antecedent is present. Alternatively, it’s the ratio of the conditional probability of the consequent given the antecedent to the marginal probability of the consequent. Write out the expression and it’ll make more sense.

A lift of 1 indicates that the antecedent and consequent are independent. In the rules below we see that the presence of “Circadian oscillators” results in a massive increase in the likelihood of the presence of “Circadian rhythms”. Similarly, if both “Malaria” and “Parasitic diseases” are present then the probability of “Malarial parasites” being used increases by over one hundred fold.

> inspect(head(sort(rules, by = "lift"), n = 5))
   lhs                             rhs                  support   confidence lift  
2  {Circadian oscillators}      => {Circadian rhythms}  0.0021292 0.85900    233.23
6  {Corals}                     => {Coral reefs}        0.0020271 0.75551    199.59
8  {Face recognition}           => {Face}               0.0021400 0.77282    191.39
14 {Prostate gland}             => {Prostate cancer}    0.0030218 0.79603    130.79
29 {Malaria,Parasitic diseases} => {Malarial parasites} 0.0030863 0.82117    126.74

Rule Selection

Before we look at rule selection we’ll generate a more extensive assortment of rules by further lowering the thresholds for support and confidence.

> rules <- apriori(subjects.matrix, parameter = list(support = 0.001, confidence = 0.25))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
       0.25    0.1    1 none FALSE            TRUE   0.001      1     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 185 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[9357 item(s), 185984 transaction(s)] done [0.15s].
sorting and recoding items ... [1600 item(s)] done [0.02s].
creating transaction tree ... done [0.13s].
checking subsets of size 1 2 3 4 done [0.07s].
writing ... [984 rule(s)] done [0.00s].
creating S4 object  ... done [0.04s].

You can use the subset() function to focus on particular rules of interest. What other subjects are commonly associated with “HIV”?

> subset(rules, subset = lhs %in% "HIV")
set of 19 rules 

So there is quite a selection of them. We can narrow that down by focusing on those which have a relatively high level of confidence.

> inspect(subset(rules, subset = lhs %in% "HIV" & confidence > 0.5))
    lhs                                            rhs              support   confidence lift  
677 {HIV,HIV prevention}                        => {HIV infections} 0.0018174 0.51920    57.274
685 {HIV,Tuberculosis diagnosis and management} => {Tuberculosis}   0.0010646 0.92958    90.802
694 {HIV,HIV epidemiology}                      => {HIV infections} 0.0020432 0.53221    58.709
844 {Cytotoxic T cells,HIV}                     => {T cells}        0.0010108 0.97917    28.134

Selection criteria are applied in subset() using the operators %in% (does item appear in itemset?), %pin% (like %in% but with partial matching) and %ain% (match all items specified) to operate on lhs and rhs. Arithmetic comparisons are used on the rule metrics.

Here’s another example which indicates that articles with subject “Dementia” will also have either “Alzheimer disease” or “Cognitive impairment” as a subject roughly 50% of the time.

> inspect(subset(rules, subset = lhs %in% "Dementia"))
    lhs           rhs                    support   confidence lift  
101 {Dementia} => {Alzheimer disease}    0.0011022 0.53385    67.681
102 {Dementia} => {Cognitive impairment} 0.0011399 0.55208    55.502

Symmetry

It might have occurred to you that these rules should be symmetric: if X → Y then surely Y → X too? This is certainly the case. Consider the four rules below, which consist of two symmetric pairs.

> inspect(subset(rules, subset = lhs %in% "Fungal genetics" | rhs %in% "Fungal genetics"))
    lhs                  rhs               support   confidence lift   
165 {Fungal genetics} => {Fungal genomics} 0.0010646 0.38900    135.482
166 {Fungal genomics} => {Fungal genetics} 0.0010646 0.37079    135.482
167 {Fungal genetics} => {Fungi}           0.0016453 0.60118     94.195
168 {Fungi}           => {Fungal genetics} 0.0016453 0.25779     94.195

The highlighted rules {Fungal genetics} → {Fungi} and {Fungi} → {Fungal genetics} form one symmetric pair. Note that the support and lift are the same for both rules, but that the first rule has a higher confidence than the second. This is due to different supports for the antecedent in each case. Whereas “Fungal genetics” is a subject for 509 articles, “Fungi” appears as a subject for 1187 articles. The corresponding values of support are 0.0027368 and 0.0063823 respectively. Since “Fungi” is more than twice as common, the confidence in the second rule is diminished significantly. This emphasises the fact that rules with the highest confidence are those for which the support for the antecedent is almost as high as the support for the rule itself.

Conclusion

Although the example presented here has been confined to binary (presence/absence) data, Association Rules can also be applied effectively to categorical data, as illustrated in Example 2 of the paper by Hahsler et al. cited above.

Association Rules are a powerful unsupervised learning technique which can be fruitfully applied in data mining. The resulting rules are often instrumental in identifying actionable insights from the data.

ubeR: A Package for the Uber API

Uber exposes an extensive API for interacting with their service. ubeR is a R package for working with that API which Arthur Wu and I put together during a Hackathon at iXperience.

alt-text

Installation

The package is currently hosted on GitHub. Installation is simple using the devtools package.

> devtools::install_github("DataWookie/ubeR")
> library(ubeR)

Authentication

To work with the API you’ll need to create a new application for the Rides API.

  • Set Redirect URL to http://localhost:1410/.
  • Enable the profile, places, ride_widgets, history_lite and history scopes.

With the resulting Client ID and Client Secret you’ll be ready to authenticate. I’ve stored mine as environment variables but you can just hard code them into the script for starters.

> UBER_CLIENTID = Sys.getenv("UBER_CLIENTID")
> UBER_CLIENTSECRET = Sys.getenv("UBER_CLIENTSECRET")
> 
> uber_oauth(UBER_CLIENTID, UBER_CLIENTSECRET)

Identity

We can immediately use uber_me() to retrieve information about the authenticated user.

> identity <- uber_me()
> names(identity)
[1] "picture"         "first_name"      "last_name"       "uuid"            "rider_id"       
[6] "email"           "mobile_verified" "promo_code"   
> identity$first_name
[1] "Andrew"
> identity$picture
[1] "https://d1w2poirtb3as9.cloudfront.net/default.jpeg"

Clearly I haven’t made enough effort in personalising my Uber account.

Designated Places

Uber allows you to specify predefined locations for “home” and “work”. These are accessible via uber_places_get().

> uber_places_get("home")
$address
[1] "St Andrews Dr, Durban North, 4051, South Africa"

> uber_places_get("work")
$address
[1] "Dock Rd, V & A Waterfront, Cape Town, 8002, South Africa"

These addresses can be modified using uber_places_put().

History

You can access data for recent rides using uber_history().

> history <- uber_history(50, 0)
> names(history)
 [1] "status"       "distance"     "request_time" "start_time"   "end_time"     "request_id"  
 [7] "product_id"   "latitude"     "display_name" "longitude"

The response includes a wide range of fields, we’ll just pick out just a few of them for closer inspection.

> head(history)[, c(2, 4:5, 9)]
  distance          start_time            end_time  display_name
1   1.3140 2016-08-15 17:35:24 2016-08-15 17:48:54 New York City
2  13.6831 2016-08-11 15:29:58 2016-08-11 16:04:22     Cape Town
3   2.7314 2016-08-11 09:09:25 2016-08-11 09:23:51     Cape Town
4   3.2354 2016-08-10 19:28:41 2016-08-10 19:38:07     Cape Town
5   7.3413 2016-08-10 16:37:30 2016-08-10 17:21:16     Cape Town
6   4.3294 2016-08-10 13:38:49 2016-08-10 13:59:00     Cape Town

Product Descriptions

We can get a list of cars near to a specified location using uber_products().

> cars <- uber_products(latitude = -33.925278, longitude = 18.423889)
> names(cars)
[1] "capacity"          "product_id"        "price_details"     "image"            
[5] "cash_enabled"      "shared"            "short_description" "display_name"     
[9] "description"  
> cars[, c(1, 2, 7)]
  capacity                           product_id short_description
1        4 91901472-f30d-4614-8ba7-9fcc937cebf5             uberX
2        6 419f6bdc-7307-4ea8-9bb0-2c7d852b616a            uberXL
3        4 1dd39914-a689-4b27-a59d-a74e9be559a4         UberBLACK

Information for a particular car can also be accessed.

> product <- uber_products(product_id = "91901472-f30d-4614-8ba7-9fcc937cebf5")
> names(product)
[1] "capacity"          "product_id"        "price_details"     "image"            
[5] "cash_enabled"      "shared"            "short_description" "display_name"     
[9] "description"      
> product$price_details
$service_fees
list()

$cost_per_minute
[1] 0.7

$distance_unit
[1] "km"

$minimum
[1] 20

$cost_per_distance
[1] 7

$base
[1] 5

$cancellation_fee
[1] 25

$currency_code
[1] "ZAR"

Estimates

It’s good to have a rough idea of how much a ride is going to cost you. What about a trip from Mouille Point to the Old Biscuit Mill?

old-biscuit-mill

> estimate <- uber_requests_estimate(start_latitude = -33.899656, start_longitude = 18.407663,
+                                    end_latitude = -33.927443, end_longitude = 18.457557)
> estimate$trip
$distance_unit
[1] "mile"

$duration_estimate
[1] 600

$distance_estimate
[1] 4.15

> estimate$pickup_estimate
[1] 4
> estimate$price
  high_amount display_amount display_name low_amount surge_multiplier currency_code
1        5.00           5.00    Base Fare       5.00                1           ZAR
2       56.12    42.15-56.12     Distance      42.15                1           ZAR
3        8.30      6.23-8.30         Time       6.23                1           ZAR

Not quite sure why the API is returning the distance in such obscure units. (Note to self: convert those to metric equivalent in next release!) The data above are based on the car nearest to the start location. What about prices for a selection of other cars?

> estimate <- uber_estimate_price(start_latitude = -33.899656, start_longitude = 18.407663,
+                     end_latitude = -33.927443, end_longitude = 18.457557)
> names(estimate)
 [1] "localized_display_name" "high_estimate"          "minimum"                "duration"
 [5] "estimate"               "distance"               "display_name"           "product_id"
 [9] "low_estimate"           "surge_multiplier"       "currency_code"         
> estimate[, c(1, 5)]
  localized_display_name  estimate
1                  uberX  ZAR53-69
2                 uberXL  ZAR68-84
3              uberBLACK ZAR97-125

The time of arrival for each of those cars can be accessed via uber_estimate_time().

> uber_estimate_time(start_latitude = -33.899656, start_longitude = 18.407663)
  localized_display_name estimate display_name                           product_id
1                  uberX      180        uberX 91901472-f30d-4614-8ba7-9fcc937cebf5
2                 uberXL      420       uberXL 419f6bdc-7307-4ea8-9bb0-2c7d852b616a
3              uberBLACK      300    uberBLACK 1dd39914-a689-4b27-a59d-a74e9be559a4

So, for example, the uberXL would be expected to arrive in 7 minutes, while the uberX would pick you up in only 3 minutes.

Requesting a Ride

It’s also possible to request a ride. At present these requests are directed to the Uber API Sandbox. After we have done further testing we’ll retarget the requests to the API proper.

A new ride is requested using uber_requests().

> ride <- uber_requests(start_address = "37 Beach Road, Mouille Point, Cape Town",
+                       end_address = "100 St Georges Mall, Cape Town City Centre, Cape Town")

Let’s find out the details of the result.

> names(ride)
 [1] "status"           "destination"      "product_id"       "request_id"
 [5] "driver"           "pickup"           "eta"              "location"
 [9] "vehicle"          "surge_multiplier" "shared"     
> ride$pickup
$latitude
[1] -33.9

$longitude
[1] 18.406
> ride$destination
$latitude
[1] -33.924

$longitude
[1] 18.42

Information about the currently requested ride can be accessed using uber_requests_current(). If we decide to walk instead, then it’s also possible to cancel the pickup.

> uber_requests_current_delete()

Future

For more information about units of measurement, limits and parameters of the Uber API, have a look at the API Overview.

We’ll be extending the package to cover the remaining API endpoints. But, for the moment, most of the core functionality is already covered.

Also Relevant

A nice blog post by Simon Jackson who used ubeR to plot his recent trips.