Zacks Data on Quandl

zacks-logo

Data from Zacks Research have just been made available on Quandl. Registered Quandl users have free preview access to these data, which cover the following:

These data describe over 5000 publicly traded US and Canadian companies and are updated daily.

Finding the Data

If you are not already a registered Quandl user, now is the time to sign up. You will find links to all of the data sets mentioned above from the Quandl vendors page. Then, for example, from the Earnings Estimates page you can search for a particular company. I selected Hewlett Packard, which links to a page giving summary data on the Earnings per Share (EPS) for the next three years. These data are presented both in tabular format as well as an interactive plot.

Zacks-HP-EPS

Browsing the data via the Quandl web site gives you a good appreciation of what is available and the general characteristics of the data. However, to do something meaningful you would probably want to download data into an offline analysis package.

Getting the Data into R

I am going to focus on accessing the data through R using the Quandl package.

library(Quandl)

Obtaining the data is remarkably simple. First you need to authenticate yourself.

> Quandl.auth("ZxixTxUxTxzxyxwxFxyx")

You will find your authorisation token under the Account Settings on the Quandl web site.

Grabbing the data is done via the Quandl() function, to which you need to provide the appropriate data set code.

Quandl-menu

Beneath the data set code you will also find a number of links which will popup the precise code fragment required for downloading the data in a variety of formats and on a selection of platforms (notable amongst these are R, Python and Matlab although there are interfaces for a variety of other platforms too).

> # Annual estimates
> #
> Quandl("ZEE/HPQ_A", trim_start="2014-10-31", trim_end="2017-10-31")[,1:5]
        DATE EPS_MEAN_EST EPS_MEDIAN_EST EPS_HIGH_EST EPS_LOW_EST
1 2017-10-31         3.90          3.900         3.90        3.90
2 2016-10-31         4.09          4.100         4.31        3.87
3 2015-10-31         3.94          3.945         4.01        3.88
4 2014-10-31         3.73          3.730         3.74        3.70

> # Quarterly estimates
> #
> Quandl("ZEE/HPQ_Q", trim_start="2014-10-31", trim_end="2017-10-31")[,1:5]
> HP[,1:5]
        DATE EPS_MEAN_EST EPS_MEDIAN_EST EPS_HIGH_EST EPS_LOW_EST
1 2015-10-31         1.10           1.10         1.14        1.04
2 2015-07-31         0.97           0.98         1.00        0.94
3 2015-04-30         0.95           0.95         1.00        0.91
4 2015-01-31         0.92           0.92         1.00        0.85
5 2014-10-31         1.05           1.05         1.07        1.03

Here we see a subset of the EPS data available for Hewlett Packard, giving the maximum and minimum as well as the mean and median projections of EPS at both annual and quarterly resolution.

Next we'll look at a comparison of historical actual and estimated earnings.

> Quandl("ZES/HPQ", trim_start="2011-11-21", trim_end="2014-08-20")[,1:6]
         DATE EPS_MEAN_EST EPS_ACT EPS_ACT_ADJ EPS_AMT_DIFF_SURP EPS_PCT_DIFF_SURP
1  2014-08-20         0.89    0.89       -0.37              0.00              0.00
2  2014-05-22         0.88    0.88       -0.22              0.00              0.00
3  2014-02-20         0.85    0.90       -0.16              0.05              5.88
4  2013-11-26         1.00    1.01       -0.28              0.01              1.00
5  2013-08-21         0.87    0.86       -0.15             -0.01             -1.15
6  2013-05-22         0.81    0.87       -0.32              0.06              7.41
7  2013-02-21         0.71    0.82       -0.19              0.11             15.49
8  2012-11-20         1.14    1.16       -4.65              0.02              1.75
9  2012-08-22         0.99    1.00       -5.50              0.01              1.01
10 2012-05-23         0.91    0.98       -0.18              0.07              7.69
11 2012-02-22         0.87    0.92       -0.18              0.05              5.75
12 2011-11-21         1.13    1.17       -1.05              0.04              3.54

Looking at the last column gives the EPS surprise amount (difference between the actual and estimated EPS) as a percentage. It's clear that the estimates are generally rather good.

The last thing that we are going to look at is dividend data.

> Quandl("ZDIV/HPQ", trim_start="2014-11-07", trim_end="2014-11-07")[,1:6]
       AS_OF DIV_ANNOUNCE_DATE DIV_EX_DATE DIV_PAY_DATE DIV_REC_DATE DIV_AMT
1 2014-11-07          20140717    20140908     20141001     20140910    0.16

Here we see that a $0.16 per share dividend was announced on 17 July 2014 and paid on 1 October 2014.

Having access to these data for a wide range of companies promises to be an enormously useful resource. Unfortunately access to the preview data is fairly limited, but if you plan on making full use of the data, then the premium access starting at $100 per month seems like a reasonable deal.

Creating More Effective Graphs

A few years ago I ordered a copy of the 2005 edition of Creating More Effective Graphs by Naomi Robbins. Somewhat shamefully I admit that the book got buried beneath a deluge of papers and other books and never received the attention it was due. Having recently discovered the R Graph Catalog, which implements many of the plots from the book using ggplot2, I had to dig it out and give it some serious attention.

Both the book and web site are excellent resources if you are looking for informative ways to present your data.

Being a big fan of xkcd, I rather enjoyed the example plot in xkcd style (which I don't think is covered in the book...). The code provided on the web site is used as the basis for the plot below.

life-expectancy

This plot is broadly consistent with the data from the Public Data archive on Google, but the effects of smoothing in the xkcd style plot can be clearly seen. Is this really important? Well, I suppose that depends on the objective of the plot. If it's just to inform (and look funky in the process), then the xkcd plot is perfectly fine. If you are looking for something more precise, then a more conventional plot without smoothing would be more appropriate.

life-expectancy-google

I like the xkcd style plot though and here's the code for generating it, loosely derived from the code on the web site.

> library(ggplot2)
> library(xkcd)
> 
> countries <- c("Rwanda", "South Africa", "Norway", "Swaziland", "Brazil")
> 
> hdf <- droplevels(subset(read.delim(file = "http://tiny.cc/gapminder"), country %in% countries))
> 
> direct_label <- data.frame(year = 2009,
+ 	lifeExp = hdf$lifeExp[hdf$year == 2007],
+ 	country = hdf$country[hdf$year == 2007])
> 
> set.seed(123)
> 
> ggplot() +
+ 	geom_smooth(data = hdf,
+ 		aes(x = year, y = lifeExp, group = country, linetype = country),
+ 		se = FALSE, color = "black") +
+ 	geom_text(aes(x = year + 2.5, y = lifeExp + 3, label = country), data = direct_label,
+ 		hjust = 1, vjust = 1, family = "xkcd", size = 7) +
+ 	theme(legend.position = "none") +
+ 	ylab("Life Expectancy") +
+ 	xkcdaxis(c(1952, 2010), c(20, 83))

Standard Bank: Striving for Mediocrity

Recently I was in my local Standard Bank branch. After finally reaching the front of the queue and being helped by a reasonably courteous young man, I was asked if I would mind filling out a survey. Sure. No problem. I had been in the bank for 30 minutes, I could probably afford another 30 seconds.

And then I was handed this abomination:

standard-bank-survey

So, if I was deliriously satisfied with the service that I had received, then I would award them a 10. If I was neither impressed nor dismayed, I would give them a 9. But if I was not happy at all, then I would give them an 8.

Let me repeat that so that the horror sinks in: if I was completely dissatisfied with their service then I would give them an 8! Out of 10. That's 80%.

80% for shoddy service!

Whoever is managing this little piece of supposed market research should be ashamed. What a load of rubbish.

Plotting Flows with riverplot

I have been looking for an intuitive way to plot flows or connections between states in a process. An obvious choice is a Sankey Plot, but I could not find a satisfactory implementation in R... until I read the riverplot post by January Weiner. His riverplot package does precisely what I am need.

Getting your data into the right format is a slightly clunky procedure. However, my impression is that the package is still a work in progress and it's likely that this process will change in the future. For now though, here is an illustration of how a multi-level plot can be constructed.

Assembling the Data

The plan for this example is to have four nodes at each of six layers, with flows between layers. The data are a little contrived, but they illustrate the procedure quite nicely and they produce a result which is not dissimilar to the final plot I was after. We have to create data structures for both nodes and edges. I will start with the edges and then use these data to extract the nodes.

The edges data frame consists of records with a "from" node (N1) and a "to" node (N2) as well as a value for the flow between them. Here I systematically construct a grid of random flows and remove some records to break the symmetry.

> edges = data.frame(N1 = paste0(rep(LETTERS[1:4], each = 4), rep(1:5, each = 16)),
+                    N2 = paste0(rep(LETTERS[1:4], 4), rep(2:6, each = 16)),
+                    Value = runif(80, min = 2, max = 5) * rep(c(1, 0.8, 0.6, 0.4, 0.3), each = 16),
+                    stringsAsFactors = F)
> 
> edges = edges[sample(c(TRUE, FALSE), nrow(edges), replace = TRUE, prob = c(0.8, 0.2)),]
> head(edges)
   N1 N2  Value
1  A1 A2 2.3514
4  A1 D2 2.2052
5  B1 A2 3.0959
7  B1 C2 2.8756
9  C1 A2 4.5099
10 C1 B2 4.1782

The names of the nodes are then extracted from the edge data frame. Horizontal and vertical locations for the nodes are calculated based on the labels. These locations are not strictly necessary because the package will work out sensible default values for you.

> nodes = data.frame(ID = unique(c(edges$N1, edges$N2)), stringsAsFactors = FALSE)
> #
> nodes$x = as.integer(substr(nodes$ID, 2, 2))
> nodes$y = as.integer(sapply(substr(nodes$ID, 1, 1), charToRaw)) - 65
> #
> rownames(nodes) = nodes$ID
> head(nodes)
   ID x y
A1 A1 1 0
B1 B1 1 1
C1 C1 1 2
D1 D1 1 3
A2 A2 2 0
B2 B2 2 1

Finally we construct a list of styles which will be applied to each node. It's important to choose suitable colours and introduce transparency for overlaps (which is done here by pasting "60" onto the RGB strings).

> library(RColorBrewer)
> #
> palette = paste0(brewer.pal(4, "Set1"), "60")
> #
> styles = lapply(nodes$y, function(n) {
+   list(col = palette[n+1], lty = 0, textcol = "black")
+ })
> names(styles) = nodes$ID

Constructing the riverplot Object

Now we are in a position to construct the riverplot object. We do this by joining the node, edge and style data structures into a list and then adding "riverplot" to the list of class attributes.

> library(riverplot)
> 
> rp <- list(nodes = nodes, edges = edges, styles = styles)
> #
> class(rp) <- c(class(rp), "riverplot")

Producing the plot is then simple.

> plot(rp, plot_area = 0.95, yscale=0.06)

riverplot-example

Conclusion

I can think of a whole host of applications for figures like this, so I am very excited about the prospects. I know that I am going to have to figure out how to add additional labels to the figures, but I'm pretty sure that will not be too much of an obstacle.

The current version of riverplot is v0.3. Incidentally, when I stumbled on a small bug in v0.2 of riverplot, January was very quick to respond with a fix.

Commitments of Traders: Moves in the Last Week

In my previous post I gave some background information on the Commitments of Traders report along with a selection of summary plots.

One of the more interesting pieces of information that one can glean from these reports is the shift in trading sentiment from week to week. Below is a plot reflecting the relative change in the number of long and short positions held by traders in each of the sectors (Commercial, Non-Commercial and Non-Reportable).

The changes are normalised to the total number of positions (both long and short) held in the previous week. To illustrate how this works, consider the JPY.

> tail(subset(OP, name == "JPY" & sector == "Commercial"), 2)
      name       date     sector   long   shrt
11842  JPY 2014-05-13 Commercial 125523 -35537
11845  JPY 2014-05-20 Commercial 117310 -48851

This indicates that the number of positions that are long relative to the JPY has decreased while the number of positions that are short on the JPY (given by a negative number) has increased. Both of these changes are consistent with the fact that traders are selling the JPY in favour of other currencies.

A synopsis of these data for a range of currencies is given in the plot below. This is how the plot works. Again we will consider Commercial trades involving the JPY. We are thus looking at the second to last row and first column. Here there are two cells: long trades on the left and short trades on the right. The coloured bars indicate the relative change for long (blue) and short (orange) trades. The relative changes are normalised to the total number of trades for the currency and sector in the previous week. We can see here that the orange bar is broader than the blue bar indicating that the change in short trades is larger than the change in long trades. The grey boxes show the 95% confidence interval for the expected range of these changes. The closer the bars come to the edge of the boxes, the more significant the change. So, in this case, the change in the number of short trades is significant and is probably an indication of a change in sentiment regarding the JPY.

If anyone is interested in updated charts like this, please just let me know.

140520-weekly-change