Anthony Goldbloom: The jobs we’ll lose to machines

The future state of any single job lies in the answer to a single question: To what extent is that job reducible to frequent, high-volume tasks, and to what extent does it involve tackling novel situations? On frequent, high-volume tasks, machines are getting smarter and smarter. Today they grade essays. They diagnose certain diseases. Over coming years, they’re going to conduct our audits, and they’re going to read boilerplate from legal contracts. Accountants and lawyers are still needed. They’re going to be needed for complex tax structuring, for pathbreaking litigation. But machines will shrink their ranks and make these jobs harder to come by.
Anthony Goldbloom

Making Sense of Logarithmic Loss

Logarithmic Loss, or simply Log Loss, is a classification loss function often used as an evaluation metric in kaggle competitions. Since success in these competitions hinges on effectively minimising the Log Loss, it makes sense to have some understanding of how this metric is calculated and how it should be interpreted.

Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is basically equivalent to maximising the accuracy of the classifier, but there is a subtle twist which we’ll get to in a moment.

In order to calculate Log Loss the classifier must assign a probability to each class rather than simply yielding the most likely class. Mathematically Log Loss is defined as

- \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^M y_{ij} \log \, p_{ij}

where N is the number of samples or instances, M is the number of possible labels, y_{ij} is a binary indicator of whether or not label j is the correct classification for instance i, and p_{ij} is the model probability of assigning label j to instance i. A perfect classifier would have a Log Loss of precisely zero. Less ideal classifiers have progressively larger values of Log Loss. If there are only two classes then the expression above simplifies to

- \frac{1}{N} \sum_{i=1}^N [y_{i} \log \, p_{i} + (1 - y_{i}) \log \, (1 - p_{i})].

Note that for each instance only the term for the correct class actually contributes to the sum.

Log Loss Function

Let’s consider a simple implementation of a Log Loss function:

> LogLossBinary = function(actual, predicted, eps = 1e-15) {
+   predicted = pmin(pmax(predicted, eps), 1-eps)
+   - (sum(actual * log(predicted) + (1 - actual) * log(1 - predicted))) / length(actual)
+ }

Suppose that we are training a binary classifier and consider an instance which is known to belong to the target class. We’ll have a look at the effect of various predictions for class membership probability.

> LogLossBinary(1, c(0.5))
[1] 0.69315
> LogLossBinary(1, c(0.9))
[1] 0.10536
> LogLossBinary(1, c(0.1))
[1] 2.3026

In the first case the classification is neutral: it assigns equal probability to both classes, resulting in a Log Loss of 0.69315. In the second case the classifier is relatively confident in the first class. Since this is the correct classification the Log Loss is reduced to 0.10536. The third case is an equally confident classification, but this time for the wrong class. The resulting Log Loss escalates to 2.3026. Relative to the neutral classification, being confident in the wrong class resulted in a far greater change in Log Loss. Obviously the amount by which Log Loss can decrease is constrained, while increases are unbounded.

Looking Closer

Let’s take a closer look at this relationship. The plot below shows the Log Loss contribution from a single positive instance where the predicted probability ranges from 0 (the completely wrong prediction) to 1 (the correct prediction). It’s apparent from the gentle downward slope towards the right that the Log Loss gradually declines as the predicted probability improves. Moving in the opposite direction though, the Log Loss ramps up very rapidly as the predicted probability approaches 0. That’s the twist I mentioned earlier.

log-loss-curve

Log Loss heavily penalises classifiers that are confident about an incorrect classification. For example, if for a particular observation, the classifier assigns a very small probability to the correct class then the corresponding contribution to the Log Loss will be very large indeed. Naturally this is going to have a significant impact on the overall Log Loss for the classifier. The bottom line is that it’s better to be somewhat wrong than emphatically wrong. Of course it’s always better to be completely right, but that is seldom achievable in practice! There are at least two approaches to dealing with poor classifications:

  1. Examine the problematic observations relative to the full data set. Are they simply outliers? In this case, remove them from the data and re-train the classifier.
  2. Consider smoothing the predicted probabilities using, for example, Laplace Smoothing. This will result in a less “certain” classifier and might improve the overall Log Loss.

Code Support for Log Loss

Using Log Loss in your models is relatively simple. XGBoost has logloss and mlogloss options for the eval_metric parameter, which allow you to optimise your model with respect to binary and multiclass Log Loss respectively. Both metrics are available in caret‘s train() function as well. The Metrics package also implements a number of Machine Learning metrics including Log Loss.

Review: Data Mining with Rattle and R

data-mining-with-rattle-and-R

I read Data Mining with Rattle and R by Graham Williams over a year ago. It’s not a new book and I’ve just been tardy in writing up a review. That’s not to say that I have not used the book in the interim: it’s been on my desk at work ever since and I’ve dipped into it from time to time. As a reference for ongoing analyses it’s an extremely helpful resource.

This book focuses on the use of Rattle, an interface to R, for Data Mining. Before I take a look at its contents, here are some specifics: it was published in 2011 by Springer-Verlag; it’s part of the Use R! series; 374 pages in length; DOI 10.1007/978-1-4419-9890-3.

Part I: Explorations

  1. Introduction
    Some general background information on Data Mining, outlining the typical path of a Data Mining project. Also discusses appropriate tools and why R+Rattle is a good combination.
  2. Getting Started
    Instructions on getting both R and Rattle up and running. Breeze through an introductory example, from loading data and exploratory analysis to building models and evaluating results. Finishes with a more in-depth review on some aspects of the R language, with a useful section on using environments for encapsulation.
  3. Working with Data
    Looking at various issues pertaining to data: where it comes from, quality and documentation.
  4. Loading Data
    Getting data into Rattle from various sources (including the clipboard, directly from the WWW, CSV or tab delimited files and ODBC). Then partitioning those data into training and testing sets as well as assigning roles to variables.
  5. Exploring Data
    A deeper look at exploratory data analysis on single variables using techniques ranging from numeric and textual summaries to visualisation with histograms, scatter, bar, dot, box and mosaic plots. Relationships between variables are considered using correlation matrices and hierarchical clustering.
  6. Interactive Graphics
    What extra information can be extracted from the data using interactive tools like latticist and GGobi?
  7. Transforming Data
    Data transformation is often the largest task in an analysis project. The issues of cleaning and dealing with missing data and outliers are dealt with. Scaling and non-linear transformation of the data are considered. Finally the chapter takes a look at variable recoding via binning of continuous variables or creating indicator variables.

Part II: Building Models

  1. Descriptive and Predictive Analytics
  2. Cluster Analysis
  3. Association Analysis
  4. Decision Trees
  5. Random Forests
  6. Boosting
  7. Support Vector Machines

The first chapter in this part of the book describes the distinction between descriptive and predictive analytics, and acts as an introduction to building models in Rattle. Each of the following six chapters considers a different type of model. The chapter on association analysis uses data on DVD purchases while the remaining chapters all use the weather data set for illustration. Each chapter starts with a high level (mostly) qualitative description of how the model works and an overview of the algorithm. This is followed by a detailed tutorial showing how the model is applied in Rattle. The details of achieving the same analysis via the R console are then presented. Having both of these perspectives is extremely helpful. Details of various parameters used to fine tune the model round out each chapter.

Part III: Delivering Performance

  1. Model Performance Evaluation
    There’s a distinct part of the Rattle interface for dealing with the important topic of model evaluation. It provides access to a range of evaluation methods extending from simple metrics like precision and recall based on a confusion matrix through to risk charts and ROC curves.
  2. Deployment
    Various options for deploying a Rattle model, ranging from a deployment in R to exporting PMML and converting to a stand alone model.

Part IV: Appendices

  1. Installing Rattle
    How to install Rattle (predicated on an existing R installation).
  2. Sample Datasets
    The datasets which come along with the Rattle package and a discussion of how these data were acquired and processed. This is a valuable portion of the book since it documents the entire data preparation pipeline.

There’s an extensive set of references and a complete index.

Since I enjoy being close to the raw, bleeding edge I re-installed the beta version of Rattle today and gave it a spin.

> library(devtools)
> devtools::install_bitbucket("kayontoga/rattle")
> library(rattle)
Rattle: A free graphical interface for data mining with R.
Version 4.0.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
> rattle()

The GUI looks pretty sleek. I did have hassles with some of the functionality in the beta version, so if you are going to give this a try I’d advise you to just install the version available at CRAN.
rattle-gui

Wrapping things up, Data Mining with Rattle and R is not just about how to use Rattle to solve Data Mining problems. It also digs quite deep into a number of Data Mining and Machine Learning algorithms. As such it’s also a pretty handy reference. If you are looking at a way of transitioning from a point-and-click style analysis to R, then I think that installing Rattle and getting hold of a copy of this book would be a good place to start. I think that the key thing to consider with regards to this book is that it does not set out to be an encyclopaedia on Machine Learning. It’s focus is on using Rattle (and in the process it provides the necessary background information).

Constructing a Word Cloud for ICML 2015

Word clouds have become a bit cliché, but I still think that they have a place in giving a high level overview of the content of a corpus. Here are the steps I took in putting together the word cloud for the International Conference on Machine Learning (2015).

word-cloud

  1. Extract the hyperlinks to the PDFs of all of the papers from the Conference Programme web site using a pipeline of grep and uniq.
  2. In R, extract the text from each of these PDFs using the Rpoppler package.
  3. Split the text for each page into lines and remove the first line (corresponding to the running title) from every page except the first.
  4. Intelligently handle word breaks and then concatenate lines within each document.
  5. Transform text to lower case then remove punctuation, digits and stop words using the tm package.
  6. Compile the words for all of the documents into a single data.frame.
  7. Using the dplyr package count the number of times that each word occurs across the entire corpus as well as the number of documents which contain that word. This is what the top end of the resulting data.frame looks like:
    > head(word.counts)
           word count doc
    1       can  6496 270
    2  learning  5818 270
    3 algorithm  5762 252
    4      data  5687 254
    5     model  4956 242
    6       set  4040 269
    
  8. Finally, construct a word cloud with the tagcloud package using the word count to weight the word size and the document count to determine grey scale.

The first cloud above contains the top 300 words. The larger cloud below is the top 1000 words.

word-cloud-large

ICML 2015 (Lille, France): Day 5 (Workshops)

On my final day at ICML I attended the Workshop on Machine Learning Open Source Software: Open Ecosystems. The topics included not only Open Source Software (OSS) for doing Machine Learning but also researchers publishing their work under an Open Source model. The latter is obviously good for Science!

Extending the Numeric Python ecosystem beyond in-memory computing (Matthew Rocklin, @mrocklin)

Python has issues. One of the biggest is the Global Interpreter Lock (GIL), which prevents simultaneous manipulation of Python objects by different threads.

Dask falls around the middle of the implementation continuum for parallel programming, somewhere between high and low levels of abstraction. It has an interface which mimics that of Numpy. Dask arrays are like Numpy arrays but are divided into chunks to facilitate parallel computation. Any operations on the Dask array are automatically done in parallel (across all available cores). The image below is an animated GIF produced by Dask illustrating the execution of a parallel job.

Collaborative Filtering via Matrix Decomposition in mlpack (Ryan R. Curtin)

mlpack is a machine learning library written in C++. The design goals for mlpack are

  • scalable and fast implementation of a wide range of ML algorithms;
  • an interface that is manageable for non-experts.

The code is template based (similar to the Armadillo C++ Linear Algebra Library).

mlpack comes with a simple command line tool which does Collaborative Filtering.

$ cf -i MovieLens-100k.csv -R 10 -q query.csv -a RegSVD -n 10

Of course, under the hood this tool is implemented using the mlpack library.

BLOG: A Probabilistic Programming Language

Described BLOG, a scripting language for specifying ML problems. Examples presented were

  • coin flipping;
  • Bayesian Linear Regression;
  • Kalman Filter;
  • LDA.

Blitz Talks for Posters

A series of short talks introducing the various posters on display:

  • nilearn: Machine learning for Neuro-Imaging in Python;
  • KeLP: Kernel-based Learning Platform in Java;
  • Automatic Differentiation;
  • FAST toolkit for Hidden Markov Models;
  • OpenML: An online Machine Learning Community.

Julia’s Approach to Open Source Machine Learning (John Myles White)

Julia is a high-level language for high-performance technical computing. It is not strongly typed. Instead it has a rich “expressive” type system, including things like immutable and parametric types. It is fully unicode compliant, which allows you to have variable names like Θ, Φ or even 🎩.

Although it is not necessary to explicitly declare types, it is important to think about types because they can have a large impact on performance. You have the freedom to do whatever you want (at least, as far as types are concerned), but Julia will not guarantee that the resulting code will be any good. “Julia is a compiler that should never yell at you.” So, designing code for Julia requires careful thought.

Julia does Just in Time (JIT) compilation. Function code_native will display the machine code generated by the Julia JIT compiler for any function.

All of the code, including packages, is on GitHub. Testing is done with Travis and code coverage is measured with Coveralls.

More information on the language can be found at Julia: A fresh approach to numerical computing and Julia: A Fast Dynamic Language for Technical Computing.

Caffe: Community Architecture for Fast Feature Embedding

Caffe is an Open Source framework for Deep Learning. Somewhat ironically the “hard” problem in the xkcd comic below was solved quite rapidly using Caffe.

tasks

Caffe is written in C++ but has command line, MATLAB and Python interfaces. It runs on both CPU and GPU.

Caffe has a Model Zoo where trained models can be posted.

From flop to success in academic software development (Gaël Varoquaux)

“Most lines of code written by programmers in academia never reach an audience.”

Gaël has been involved in a number of relatively high profile software projects including Sckikit-learn, Mayavi and joblib.

Computational Science (versus Computer Science) is the use of computers and mathematical models to address scientific research in a range of domains. But how to choose a good computational science problem? Tricky question. Find a need. Then define scope and vision. Target a problem that will yield other contributors and then share the common code. The contributions of others will make your code better over time.

Six steps:

  1. Focus on quality;
  2. Make good documentation and examples;
  3. Use GitHub;
  4. Limit technicality and focus on the key issues (refer to scope and vision);
  5. Packaging and releases are important;
  6. Acknowledge contributors and give credit.

Other ruminations:

  • Choose your weapons. Python is a good one.
  • Scriptability is important.
  • Strive for Quality. Do less but do it better. Be consistent.

Pareto principle: 80% of your use cases can be solved with 20% of the code.

And that was the end of my conference!

Below is a word cloud that I put together from the PDFs of the conference papers. The details of its construction can be found here.

word-cloud

ICML 2015 (Lille, France): Day 3

Selected scribblings from the third day at the International Conference for Machine Learning (ICML 2015) in Lille, France. I’m going out on a limb with some of this, since the more talks I attend, the more acutely aware I become of my limited knowledge of the cutting edge of Machine Learning. Caveat emptor.

Adaptive Belief Propagation (Georgios Papachristoudis, John Fisher)

Belief Propagation describes the passage of messages across a network. The focus of this talk was Belief Propagation within a tree. The authors consider an adaptive algorithm and found that their technique, AdaMP, was significantly better than the current state of the art algorithm, RCTreeBP.

The Hedge Algorithm on a Continuum (Walid Krichene, Maximilian Balandat, Claire Tomlin, Alexandre Bayen)

The content of this talk went a little over my head. I did learn something about regret though, which in the context of decision theory quantifies the negative impact of finding out that a different action would have led to a preferable result. The importance of this concept is that a regret averse individual might anticipate the potential for future regret and factor this into their decision making process.

Deep Edge-Aware Filters (Li Xu, Jimmy Ren, Qiong Yan, Renjie Liao, Jiaya Jia)

Applications: image enhancement, tone mapping (HDR). Much previous research has focused on accelerating various filters, for example, the bilateral filter. However, only after a decade of work has a real time implementation of the bilateral filter been developed.

New filtering techniques are being developed all the time. However, existing acceleration techniques do not necessarily generalise to these new techniques. The authors proposed a uniform framework for accelerating filters with the following benefits:

  1. optimise for one but apply to all;
  2. implementation can be achieved in hardware.

They set out to generate a system which would learn how to achieve the effects of one or more (combined) filters. Learning in the colour domain does not work well because edges are not captured effectively. Optimising in the gradient domain works better though. Learning is done in the gradient domain using a Convolutional Neural Network (CNN), so that the original image needs to be reconstructed from the results of the model.

One of the major benefits of the technique is that it can learn composite filters. So, rather than applying filters A, B and C, the network will learn the results of the combined filters and then be able to generate the composite effect. Furthermore, it does this quickly, with up to 200 times acceleration.

More information on this work can be found at http://www.lxu.me/projects/deepeaf. As a photographer who has played around with filters quite extensively, this is very exciting research!

DRAW: A Recurrent Neural Network For Image Generation (Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, Daan Wierstra)

The DRAW system analyses an image progressively, breaking it down into parts and then analysing each part in sequence. But it does more than that. It “imagines” new images, effectively generating novel image content. The constructed image initially starts out as fairly blurred, but it becomes sharper as the system works on it.

DRAW-numbers

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio)

Objective: image caption generation. Human vision is foveated (spatial focus) and sequential (temporal dynamics). The parsing of an image can also depend on the task at hand (what information the observer is trying to derive from the image).

The authors describe a technique in which the image is segmented and a neural network is used to model the way that an observer’s attention would traverse the image. A CNN is then used to extract feature vectors from parts of the image and these are used to generate words which are aggregated to form a caption.

The system generally produces appropriate captions. When it fails to produce a reasonable caption it is readily obvious why it has failed. Some of the failures are pretty amusing.

The model can be “forced” by specifying words that should appear in the caption. The system then attempts to generate a reasonable caption incorporating these words.

Code for this project is available at https://github.com/kelvinxu/arctic-captions. Kelvin also mentioned the Microsoft COCO project.

Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network (Seunghoon Hong, Tackgeun You, Suha Kwak, Bohyung Han)

Visual tracking aims to target a subject across successive frames in a video. Authors used a CNN to extract features of the subject from the image. An Incremental SVM was used to identify these features in a new image.

Learning Treatment Policies in Mobile Health (Susan Murphy)

Two objectives of mobile health programmes are monitoring of subjects and also giving them feedback. Two projects were discussed:

  1. Smoking Cessation Health; and
  2. Heartsteps Activity Coach (for heart attack patients).

Subjects are asked to wear chest or wrist bands to monitor their activity. Their smartphones also act as monitoring devices and give push feedback to the subjects. Both of these are a little invasive, so one of the challenges is to ensure that the subjects continue with the programme.

Two types of decision times: regular intervals (where tempo might vary according to what is being observed) and at subject demand.

The system takes into account the availability of the subjects for interventions. For example, if the subject is driving or walking, then no push messages will be sent to the subject. When messages are sent, the logical question is: did the intervention have the required result? To answer this question the first step is to gather data. To get truly robust data it’s necessary to make observations within the framework of a randomised clinical trial.

A Stochastic Treatment Policy can be used to retard the habituation rate of subjects.

Finding Galaxies in the Shadows of Quasars with Gaussian Processes (Roman Garnett, Shirley Ho, Jeff Schneider)

The authors describe a project which used spectral data from a large catalog of Quasars to build a Gaussian Process model for Quasar emission spectra. This model was then used to detect evidence of galaxies lying between the Quasars and Earth. Such galaxies would result in absorption of the light emitted by the Quasars (the authors were specifically looking at absorption of light associated with the Lyman-α line of atomic Hydrogen). Data used in this project can be found at http://www.sdss.org/.

Second Poster Session

The day ended with another well catered poster session. Apparently Microsoft is hosting a little shindig later this evening too.

ICML 2015 (Lille, France): Day 2

Some notes from the second day at the International Conference for Machine Learning (ICML 2015) in Lille, France. Don’t quote me on any of this because it’s just stuff that I jotted down during the course of the day. Also much of the material discussed in these talks lies outside my field of expertise. Caveat emptor.

Two Big Challenges in Machine Learning (Léon Bottou)

Machine Learning is an Exact Science. It’s also an Experimental Science. It’s also Engineering.

The experimental paradigm in Machine Learning consists of splitting a large data set into training and testing sets. Train. Test. Iterate. Done! But this paradigm is reaching its limits. Other experimental disciplines (for example, Physics, Biology) have to work a lot harder on designing, executing and interpreting experiments.

Léon referred us to the Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks project.

Machine Learning practitioners cannot rely exclusively on train/test experiments. We need to adopt best practices from other experimental sciences. Proposition: Machine Learning papers should specify the limitations in their approach. For example, for data with a Zipf distribution the majority of examples have only a single occurrence. In such a case we should aim to achieve maximum coverage across the data rather than looking at average accuracy.

A Nearly-Linear Time Framework for Graph-Structured Sparsity (Ludwig Schmidt)

Data often exhibits structured sparsity. For example, Astronomical images are mostly empty, but where there is something interesting in the image, it is well structured.

Sparsity is often accompanied by rich structure. This motivates the application of Sparse Linear Regression. From this point the talk very quickly became highly theoretical. Unfortunately, my graduate course in graph theory was too long ago for me to make too much sense of it. From what I could gather, the approach was to represent the vector of regression coefficients as a graph. Considerations for this approach are generality, statistical efficiency and computational efficiency.

Learning from Corrupted Binary Labels via Class-Probability Estimation (Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, Bob Williamson)

Consider a data set consisting of instances with binary labels. Some of these labels are either corrupted (True to False or vice versa) or censored. Corruption occurs at random. A simple approach would be to treat data as if it were not corrupted. But it’s possible to do much better than that.

We can model the situation using a pair of parameters, alpha and beta, which describe the relative frequency of corruption of each type of label. There is also an underlying clean class probability, which we aim to access via a corrupted class probability. The latter can be estimated directly from the corrupted data.

Manifold-valued Dirichlet Processes (Hyunwoo Kim, Jia Xu, Baba Vemuri, Vikas Singh)

I didn’t know anything about Dirichlet Processes, so I did a bit of reading. It seems that an application we would be interested in would be to use them for k-means clustering with an unknown number of clusters.

Multi-instance multi-label learning in the presence of novel class instances (Anh Pham, Raviv Raich, Xiaoli Fern, Jesús Pérez Arriaga)

Multiple instances with multiple labels. For example, a selection of images, each of which has labels indicating some selected contents. More specifically, “Tree”, “Cloud” and “Water” in image 1; “Duck” and “Water” in image 2; etc. The novel instances are things that appear in an image but are not reflected in the label. For example, “Grass” in image 1. How do we deal with instances that are not included in the bag of known labels?

Their approach is a discriminative graphical model with exact inference based on multiclass logistic regression which caters for unknown labels.

Entropy evaluation based on confidence intervals of frequency estimates: Application to the learning of decision trees (Mathieu Serrurier, Henri Prade)

The use of entropy as a splitting criterion becomes progressively less effective as you get further down the decision tree due to the sparsity of cases. Simple decision trees are thus prone to overfitting. Possible remedies are early-stopping or pruning.

Support Matrix Machines (Luo Luo, Yubo Xie, Zhihua Zhang, Wu-Jun Li)

Consider a classification problem where the features are matrices (e.g., EEG or image data). There is redundancy or correlation between adjacent elements which would be lost if the data were represented as a vector, since it would ignore structure. One solution is a matrix analogue of SVM. Details were buried in a mountain of maths.

Attribute Efficient Linear Regression with Distribution-Dependent Sampling (Doron Kukliansky, Ohad Shamir)

Two approaches to budgeted learning:

  1. few subjects and many tests;
  2. many subjects with fewer tests.

The latter is obviously preferable from the perspective of the subjects. If the same number of tests are conducted overall, it may not have a significant effect on costs, but perhaps it is possible to get away with fewer tests in total?

Concrete example: 4 subjects with 4 tests each versus 8 subjects with 2 (randomly sampled) tests each. Effectively leads to a data matrix with missing elements. But it’s not really a “missing data” problem. One approach is the AERR algorithm (Linear regression with limited observation) described by Hazan, Elad and Koren, Tomer.

Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays (Junpei Komiyama, Junya Honda, Hiroshi Nakagawa)

I’ll freely admit that, despite having worked in the online gaming industry, I had to check what a Multi-arm Bandit (MAB) referred to in the context of Machine Learning. It turns out that they are not too different from what you would expect! The player has access to an array of machines, each of which produces a random reward which is sampled from a particular distribution associated with that machine. He has to decide which of these machines to play, the number of times that he plays each machine and the sequence in which he plays them. The player attempts to optimise the cumulative reward across a series of games. He has to balance exploitation of machines that he already knows to provide higher rewards with exploration of other machines (which might, in principle, yield even higher rewards).

Somewhat surprisingly, this model has been applied to diverse scenarios like resource allocation in the presence of uncertainty, but where the level of uncertainty declines with time.

Compressing Neural Networks with the Hashing Trick (Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, Yixin Chen)

Deep Learning is easier if you have access to significant computing resources. But what if you want to do it on a small device (like a phone) with a relatively low-end CPU and limited RAM? In a neural network model the weight matrices are the biggest consumers of memory. Compressing these matrices will make neural networks possible on smaller devices. A variety of techniques exist for reducing the size of these matrices, all of which introduce only a marginal amount of additional error for moderate compression (so there is definitely room for this approach without compromising efficiency). The hashing trick appears to achieve the most effective compression without severe impacts on performance.

Deep Learning with Limited Numerical Precision (Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan)

Consider algorithms running on numerically noisy hardware. From a computational perspective the noisy hardware is likely to be more efficient. Using low precision (16 bit versus 32 bit) fixed point (as opposed to floating point) arithmetic results in improvements in memory footprint as well as data transfer bandwidth. But what is the impact on Machine Learning algorithms?

The proportion of the bits in the fixed point number which are allocated to representing the fractional part has a strong influence on the results. This occurs when “conventional” round-to-nearest rounding schemes are used. However, when stochastic rounding (where numbers are rounded up or down randomly with weights depending on the distance to the ceiling or floor) is applied, the results are much better.

Scalable Model Selection for Large-Scale Factorial Relational Models (Chunchen Liu, Lu Feng, Ryohei Fujimaki, Yusuke Muraoka)

Relational models attempt to reveal the latent group structure in a network and also to predict unseen links that have not yet been captured by the network.

Consistent estimation of dynamic and multi-layer block models (Qiuyi Han, Kevin Xu, Edoardo Airoldi)

A simple network can be insufficient to describe some phenomenon of interest. Instead one might consider a dynamic network, which changes over time, or a multi-layer network, including information relating to different relationships (for example, connections across Facebook and LinkedIn). A single network can be modelled using an adjacency matrix. A Stochastic Block Model (SBM) incorporates a number of classes, where edges are formed with a class-dependent probability. In this model the adjacency matrix is block diagonal.

The authors worked with a SBM for multi-graphs (sequence of graph snapshots), where the edge probability matrix changed between snapshots but the classes remain static. One of the examples presented was based on the Reality Commons data set, which was analysed using a multi-graph SBM approach.

Community Detection Using Time-Dependent Personalized PageRank (Haim Avron, Lior Horesh)

Community detection is the process of finding a collection of nodes which are internally cohesive and well separated. One technique for identifying communities is based on graph diffusion. The fundamental quantity involved in this process is a diffusion vector. The classical form of the diffusion vector is the PageRank vector as described by Brin and Page.

Poster Session

A well catered poster session with an extensive range of interesting contributions. Delcious salami. It might be my downfall.

ICML 2015 (Lille, France): Day 1 (Tutorials)

Started the day with a run through the early morning streets of Lille. This city seems to wake up late because it was still nice and quiet well after sunrise. Followed by a valiant attempt to sample everything on the buffet breakfast. I’ll know where to focus my attention tomorrow.

ICML2015

The first day of the International Conference on Machine Learning (ICML 2015) in Lille consisted of tutorials in two parallel streams. Evidently the organisers are not aware of my limited attention span because these tutorials each had nominal durations of longer than 2 hours, punctuated by a break in the middle.

Below are my notes from the tutorials. Most of the material discussed in these talks lies outside my field of expertise. So my understanding may well be flawed. Caveat emptor. I would appreciate feedback on anything that I have got horrendously wrong.

Natural Language Understanding: Foundations and State-of-the-Art (Percy Liang)

Natural Language Understanding (NLU) is not the same as Natural Language Processing (NLP). It’s more specific, referring to the actual comprehension of and extraction of meaning from language. Understanding is a more difficult problem than simply processing since it has to deal with vagueness (incomplete information), uncertainty and ambiguity. Percy Liang made reference to the Turing Test, implying that if a computer were to succeed in passing this test then it would need to be capable of understanding language (as opposed to simply processing it). Siri and Google Now were cited as examples of current systems which have implemented a form of NLU.

The first step towards NLU is choosing an internal representation for the language data. There are a number of options, including

Once you’ve wrestled the data into the machine, the fun can begin. Percy described an analytical hierarchy, with syntax at the lowest level and pragmatics at the highest level.

Syntax

Syntax determines the structure of sentences. A large proportion of syntactic structure is extracted while constructing a dependency parse tree. However, as Percy pointed out, even then syntax can be fraught with ambiguity. He cited the following sentence as an example:

I ate some dessert with a fork.

This can have two interpretations depending on whether you regard “some dessert” or “some dessert with a fork” to be the subject of the sentence.

Semantics

Semantics refers to the extraction of literal meaning from language. The interpretation of individual words or combinations of words. This too can be a tricky process. And there are fancy words to describe why it’s tricky. First there is the problem of words with different meanings depending on context. This is known as polysemy, and it can make the interpretation of words (and groups of words) unclear for humans, never mind machines. Then there is anaphora, which Percy illustrated using

The dog chased the cat, which ran up a tree. It waited at the top.

and

The dog chased the cat, which ran up a tree. It waited at the bottom.

Clearly for both examples the meaning of “it” in the second sentence depends on a deeper understanding of the situation being described and cannot be extracted from the text alone. The Winograd Schema Challenge is an alternative to the Turing Test which attempts to indentify machine intelligence based on the resolution of anaphora.

dog_chasing_cat_up_tree

Percy quoted John Rupert Firth:

You shall know a word by the company it keeps.John Rupert Firth

The implication is that an understanding of a word is facilitated by its context. The surrounding words at least partially determine its meaning.

One approach to semantic analysis is to construct a word content matrix. I assumed that this referred to a Document-term_matrix or its transpose. I have encountered these before while using the tm package, so for a moment I felt on firmer ground. Something that had not occurred to me (but in retrospect is rather obvious because these matrices can be massive) is to then apply dimensionality reduction. This would replace a sparse high-dimensional representation with a dense low-dimensional one. Percy suggested that SVD would be a suitable approach, but I imagine that PCA would work too.

Pragmatics

I’ll freely confess that by this stage both my attention and resolve were waning. I was beginning to entertain lewd thoughts about a cheese sandwich too. So what I managed to glean about pragmatics is fairly limited.

Pragmatics is the understanding of actual or intended meaning. And, as you might guess, this is an even harder problem than semantics. Percy mentioned two relevant projects:

For more detailed information on Percy’s talk, check out the slides.

Bayesian Time Series Modeling: Structured Representations for Scalability (Emily Fox)

After lunch Emily Fox started out by describing the omnipresence of time series: they are literally everywhere! There are two major goals of time series analysis:

  1. understanding evolution (changes with time); and
  2. revealing relational structure (dependencies within multivariate time series).

And there are a number of challenges encountered in reaching these goals:

  • Data volume: time series data can be highly multidimensional and consist of a large number (or infinite stream) of observations;
  • The data might be irregularly sampled (different sampling times for each element of a multidimensional dataset);
  • Missing values;
  • Data of heterogeneous type and origin.

Analysis Techniques

Some suitable analysis techniques:

Hidden Markov Models (HMMs)

Hidden Markov Models divide a time series into a sequence of states, with a transition matrix which describes the probability of changes between those states. Transition probabilities depend only on current state (this is the Markovian property). Observations within a given state are assumed to be independent, although this assumption is seldom realised in practice.

Vector Autoregressive (VAR) Models

These models are useful for modelling continuous value processes and are a multivariate extension of AR models.

State Space Models

Like HMMs but for continuous value processes. These are a super-set of VAR models.

Dynamic Latent Factor Models

A Dynamic Latent Factor Model is a state-space model with a low dimensional state but high dimensional observations. The observations are projected from a high dimensional space onto a lower dimensional subspace. This works because there is generally some redundancy between the original observations. Full complexity is captured in the subspace. The covariance matrix in the subspace changes with time and the relationship between the subspace and the space of the original observations also changes as the covariance matrix evolves.

This was the first time that I had given any serious thought to latent variable models and I was intrigued. A little bit lost. But definitely intrigued.

Clustering Time Series

In Achieving a Hyperlocal Housing Price Index: Overcoming Data Sparsity by Bayesian Dynamical Modeling of Multiple Data Streams authors Ren, Fox and Bruce describe an approach for analysing multiple time series where some series are dense and others are sparse. The covariance matrix is assumed to be block diagonal, with each block consisting of a cluster of time series with similar structure. The number of clusters is determined using Bayesian Non-parametric Clustering which allows the clusters to evolve as new data are added.

Graphical Models for Time Series

Edges represent relationships between variables. If there is no edge then variables are conditionally independent. A Decomposable Graph can be broken down into Cliques and Separators.

For more information on Emily’s talk, check out the slides.

Computational Social Science (Hanna Wallach)

Hanna Wallach started off by juxtaposing the disciplines of Computer Science and Social Science. Whereas the latter is question driven, the former is method and/or data driven. With regards to research goals in the two disciplines she quoted:

Policy makers or computer scientists may be interested in finding the needle in the haystack (such as a potential terrorist threat or the right web page to display from a search), but social scientists are more commonly interested in characterizing the haystack.Daniel J. Hopkins and Gary King

More generally the two disciplines differ in a variety of aspects:

  • Object of Study: pretty much anything for Computer Science and social processes for Social Science;
  • Driving Force: Computer Scientists are motivated by data and methods, while Social Scientists are concerned with questions;
  • Data: big and digital for Computer Science but relatively small and generally not digitised for Social Science; and
  • Research Goal: Computer Scientists aim at prediction while Social Scientists strive for explanation.

The distinct research goals are important. Explanation implies causality. The causal mechanisms should arise from the underlying theory. Data can then be used to test hypotheses based on the theory. Models derived from the data need to be interpreted within the framework of the theory and should thus only include variables which are in the theory. Prediction, on the other hand, doesn’t care about causality and there is no need for interpretation. Consequently predictive models can include any relevant variables.

Bayesian Latent Variable Models

Latent variable models again. Based on what I had learned from the previous tutorial I was feeling a little more confident about these. The values of the latent variables are inferred from the observables in an iterative fashion.

Hanna made reference to Getting It Right: Joint Distribution Tests of Posterior Simulators which provides a robust approach to simulating points from a posterior distribution.

It’s a pity that this talk was in the last session. Attendance was not what it should have been. It’s also a pity that Hanna was so reliant on her notes. Otherwise her talk was great!

Unfortunately the slides for this talk do not seem to be available.

Cocktail Party

Well organised Cocktail Party to end the first day. One beer. A dozen canapés. Zero social interactions (unless you count waiters). I suppose that I am more sociable in my mind than in real life.