Shiny Bayesian Updates

Reading Bayesian Computation with R by Jim Albert (Springer, 2009) inspired a fit of enthusiasm. Admittedly, I was on a plane coming back from Amsterdam and looking for distractions. I decided to put together a Shiny app to illustrate successive Bayesian updates. I had not yet seen anything that did this to my satisfaction. I like to think that my results come pretty close.

You can find the app here. Below is a screenshot.

bayesian-beta

Basically the idea is:

  1. start with a beta prior (you specify the α and β parameters);
  2. enter the results of an experiment (number of trials and the number of successes); then
  3. the associated likelihood and resulting posterior distribution are displayed;
  4. the prior is updated from the posterior distribution by pressing the "Update" button;
  5. repeat at your leisure.

I also wanted to mention a related article by my friend and colleague, Ryan Nel, which contains some very cool visuals.

Hope that somebody finds this useful and interesting. Feedback would be appreciated.

First Steps with Julia: Grabbing Packages

After watching a rather inspiring talk by John Myles White at ICML 2015, I was determined to learn more about Julia. Being about to get onto a long flight back to South Africa the following day, I was determined to have everything I needed to indulge in a bit of high altitude geeking. The first thing I did was to find a list of important packages. With a small dose of grep and awk I transformed the list into a script to download those packages. Little effort required. I went off to find a cheese sandwich.

Pkg.add("BackpropNeuralNet")
Pkg.add("Bokeh")
Pkg.add("Boltzmann")
Pkg.add("Calculus")
Pkg.add("Clustering")
Pkg.add("Convex")
Pkg.add("Cpp")
Pkg.add("DataArrays")
Pkg.add("DataFrames")
Pkg.add("DataFramesMeta")
Pkg.add("DataStructures")
Pkg.add("DecisionTree")
Pkg.add("Distances")
Pkg.add("Distributions")
Pkg.add("DSP")
Pkg.add("FunctionalCollections")
Pkg.add("Gadfly")
Pkg.add("GeneticAlgorithms")
Pkg.add("GLM")
Pkg.add("GLMNet")
Pkg.add("Graphs")
Pkg.add("HDF5")
Pkg.add("HypothesisTests")
Pkg.add("Images")
Pkg.add("JuMP")
Pkg.add("MachineLearning")
Pkg.add("Mamba")
Pkg.add("Markdown")
Pkg.add("Match")
Pkg.add("MixedModels")
Pkg.add("MLBase")
Pkg.add("Mocha")
Pkg.add("MultivariateStats")
Pkg.add("NLopt")
Pkg.add("OpenStreetMap")
Pkg.add("Optim")
Pkg.add("Orchestra")
Pkg.add("PGM")
Pkg.add("PyCall")
Pkg.add("RCall")
Pkg.add("RDatasets")
Pkg.add("Regression")
Pkg.add("Rif")
Pkg.add("StatsBase")
Pkg.add("StreamStats")
Pkg.add("TimeSeries")

Thankfully I had decent bandwidth at the hotel.

Lightning on your Twitter Feed

As an aside for a Social Media Automation project I have constructed a bot which uses data from the World Wide Lightning Location Network (WWLLN) to construct daily animated maps of global lightning activity and post them on my Twitter feed. The bot runs remotely and autonomously on an EC2 instance.

The images below are two examples of the lightning animations. You might need to open them in a separate tab or window to get them to animate. WordPress funkiness, I believe! If you would like to see these daily, follow @DataWookie.

A20150704

A20150615

Data Acquisition

New data are posted on a HTTP site every day. The bot needs to check for the availability of new data and then download it. The data were then loaded into R as a data.frame. Some elementary transformations were applied, like transforming the time of day into a floating point hour. The data were then split into 24 one hour segments.

Unfortunately there is a delay of a few days in the availability of the WWLLN data.

Plotting the Map

I made maps for each of the data segments. These maps were generated using ggplot2, which has functionality for creating a layer with map borders. I might consider a more elegant map at a later stage, but this gets the job done for now. On top of the map I applied two more layers reflecting the density of lightning strokes with shading and contours. On top of that went another layer with the point locations of the lightning strokes.

Plotting the Terminator

I wanted to indicate the passage of the day-night terminator across the map. The first step was to calculate the position of the Sun. Since I know the date and UTC time for each of the frames in the animation, it was a relatively simple matter to find the latitude and longitude respectively of the sub-solar point. Actually I had an awk script lying around which did this for a range of dates and times, so it was easy to cannibalise the maths out of that.

Getting the terminator line itself was a little more tricky. My plan was to start with points along a great circle path aligned with the prime meridian and then tilt this path appropriately. Although I am sure that I should know the elements of the required rotation matrix, I'm afraid that they had somehow slipped through my mental cracks. To make matters slightly worse, I was writing this code on a flight, so I did not have access to the internet. Undeterred I broke the transformation down into two components: a polar tilt followed by an azimuthal rotation. Those matrices I could figure out from first principles. Once I had that working, a dose of matrix multiplication composed them into the single transformation matrix I was looking for originally. Sorted!

Creating the Animation

The maps were spliced together into an animated GIF using convert from the ImageMagick package. The resulting GIFs were pretty huge. The maps had been rendered at 1024 by 512 pixel resolution, so they were fairly high definition to start with. I then used gifsicle to optimise the GIF, reducing to 64 colours at the same time. The resulting animations did not exhibit any perceptible degradation and were a bunch smaller.

Wrapping it Up

The GIFs and a short bit of text were then uploaded using the Twitter API. A final step was to archive the data, which I shifted across to a private folder on the cloud using the rdrop2 package. All of this went into a single R script, which is run as a daily cron job on my EC2 instance.


Constructing a Word Cloud for ICML 2015

Word clouds have become a bit cliché, but I still think that they have a place in giving a high level overview of the content of a corpus. Here are the steps I took in putting together the word cloud for the International Conference on Machine Learning (2015).

word-cloud

  1. Extract the hyperlinks to the PDFs of all of the papers from the Conference Programme web site using a pipeline of grep and uniq.
  2. In R, extract the text from each of these PDFs using the Rpoppler package.
  3. Split the text for each page into lines and remove the first line (corresponding to the running title) from every page except the first.
  4. Intelligently handle word breaks and then concatenate lines within each document.
  5. Transform text to lower case then remove punctuation, digits and stop words using the tm package.
  6. Compile the words for all of the documents into a single data.frame.
  7. Using the dplyr package count the number of times that each word occurs across the entire corpus as well as the number of documents which contain that word. This is what the top end of the resulting data.frame looks like:
    > head(word.counts)
           word count doc
    1       can  6496 270
    2  learning  5818 270
    3 algorithm  5762 252
    4      data  5687 254
    5     model  4956 242
    6       set  4040 269
    
  8. Finally, construct a word cloud with the tagcloud package using the word count to weight the word size and the document count to determine grey scale.

The first cloud above contains the top 300 words. The larger cloud below is the top 1000 words.

word-cloud-large

ICML 2015 (Lille, France): Day 5 (Workshops)

On my final day at ICML I attended the Workshop on Machine Learning Open Source Software: Open Ecosystems. The topics included not only Open Source Software (OSS) for doing Machine Learning but also researchers publishing their work under an Open Source model. The latter is obviously good for Science!

Extending the Numeric Python ecosystem beyond in-memory computing (Matthew Rocklin, @mrocklin)

Python has issues. One of the biggest is the Global Interpreter Lock (GIL), which prevents simultaneous manipulation of Python objects by different threads.

Dask falls around the middle of the implementation continuum for parallel programming, somewhere between high and low levels of abstraction. It has an interface which mimics that of Numpy. Dask arrays are like Numpy arrays but are divided into chunks to facilitate parallel computation. Any operations on the Dask array are automatically done in parallel (across all available cores). The image below is an animated GIF produced by Dask illustrating the execution of a parallel job.

You can check out the slides for this talk here.

Collaborative Filtering via Matrix Decomposition in mlpack (Ryan R. Curtin)

mlpack is a machine learning library written in C++. The design goals for mlpack are

  • scalable and fast implementation of a wide range of ML algorithms;
  • an interface that is manageable for non-experts.

The code is template based (similar to the Armadillo C++ Linear Algebra Library).

mlpack comes with a simple command line tool which does Collaborative Filtering.

$ cf -i MovieLens-100k.csv -R 10 -q query.csv -a RegSVD -n 10

Of course, under the hood this tool is implemented using the mlpack library.

BLOG: A Probabilistic Programming Language

Described BLOG, a scripting language for specifying ML problems. Examples presented were

  • coin flipping;
  • Bayesian Linear Regression;
  • Kalman Filter;
  • LDA.

Blitz Talks for Posters

A series of short talks introducing the various posters on display:

  • nilearn: Machine learning for Neuro-Imaging in Python;
  • KeLP: Kernel-based Learning Platform in Java;
  • Automatic Differentiation;
  • FAST toolkit for Hidden Markov Models;
  • OpenML: An online Machine Learning Community.

Julia's Approach to Open Source Machine Learning (John Myles White)

Julia is a high-level language for high-performance technical computing. It is not strongly typed. Instead it has a rich "expressive" type system, including things like immutable and parametric types. It is fully unicode compliant, which allows you to have variable names like Θ, Φ or even 🎩.

Although it is not necessary to explicitly declare types, it is important to think about types because they can have a large impact on performance. You have the freedom to do whatever you want (at least, as far as types are concerned), but Julia will not guarantee that the resulting code will be any good. "Julia is a compiler that should never yell at you." So, designing code for Julia requires careful thought.

Julia does Just in Time (JIT) compilation. Function code_native will display the machine code generated by the Julia JIT compiler for any function.

All of the code, including packages, is on GitHub. Testing is done with Travis and code coverage is measured with Coveralls.

More information on the language can be found at Julia: A fresh approach to numerical computing and Julia: A Fast Dynamic Language for Technical Computing.

Caffe: Community Architecture for Fast Feature Embedding

Caffe is an Open Source framework for Deep Learning. Somewhat ironically the "hard" problem in the xkcd comic below was solved quite rapidly using Caffe.

tasks

Caffe is written in C++ but has command line, MATLAB and Python interfaces. It runs on both CPU and GPU.

Caffe has a Model Zoo where trained models can be posted.

From flop to success in academic software development (Gaël Varoquaux)

"Most lines of code written by programmers in academia never reach an audience."

Gaël has been involved in a number of relatively high profile software projects including Sckikit-learn, Mayavi and joblib.

Computational Science (versus Computer Science) is the use of computers and mathematical models to address scientific research in a range of domains. But how to choose a good computational science problem? Tricky question. Find a need. Then define scope and vision. Target a problem that will yield other contributors and then share the common code. The contributions of others will make your code better over time.

Six steps:

  1. Focus on quality;
  2. Make good documentation and examples;
  3. Use GitHub;
  4. Limit technicality and focus on the key issues (refer to scope and vision);
  5. Packaging and releases are important;
  6. Acknowledge contributors and give credit.

Other ruminations:

  • Choose your weapons. Python is a good one.
  • Scriptability is important.
  • Strive for Quality. Do less but do it better. Be consistent.

Pareto principle: 80% of your use cases can be solved with 20% of the code.

And that was the end of my conference!

Below is a word cloud that I put together from the PDFs of the conference papers. The details of its construction can be found here.

word-cloud

ICML 2015 (Lille, France): Day 4

Sundry notes from the fourth day of the International Conference for Machine Learning (ICML 2015) in Lille, France. Some of this might not be entirely accurate. Caveat emptor.

Celeste: Variational inference for a generative model of astronomical images (Jeffrey Regier, Andrew Miller, Jon McAuliffe, Ryan Adams, Matt Hoffman, Dustin Lang, David Schlegel, Prabhat)

Colour modelled as a 4 dimensional vector. The Physics (Planck's Law) places some constraints on the components of these vectors. Light density model accounts for rotation as well as asymmetry of galactic axes.

Discounting effects of pixellation and atmosphere the authors consider an idealised sky view where stars are seen as delta functions. Diffraction and atmospheric effects, however, mean that stars are not actually observed as points of light. This is accounted for by a Point Spread Function (PSF). A Gaussian PSF was used and it was assumed that a given star could only contribute light to a limited range of nearby pixels, so that the Gaussian could be truncated.

Variational inference applied to deal with fact that evidence component in Bayes' Theorem intractable.

From Word Embeddings To Document Distances (Matt Kusner, Yu Sun, Nicholas Kolkin, Kilian Weinberger)

Talk was based on high quality word embedding using word2vec. Question addressed: can one construct document distances from word embeddings. Proposed algorithm is called the "Word Mover's Distance" (WMD), and is related to the Earth Mover's Distance.

Results assessed using k-Nearest Neighbours and look promising. However, the computational complexity is

O(n^3 \log n)

where n is the number of unique words. However, an approximate "Word Mover's Distance" is only O(n^2), which is much better.

Latent Topic Networks: A Versatile Probabilistic Programming Framework for Topic Models (James Foulds, Shachi Kumar, Lise Getoor)

Topic Models are useful for Data Mining and Computational Social Science. However, when generating custom Topic Models, more time is spent actually formulating the model than running it. So the analyst becomes the bottleneck. The authors describe a general purpose modelling framework. This system has reduced the time required to generate a custom Topic Model from around six months to a couple of days. The framework was built in Probabilistic Soft Logic (PSL), which is a first-order logic based SRL language.

The authors refer to Modeling Scientific Impact with Topical Influence Regression by Foulds and Smyth (2013).

Scalable Deep Poisson Factor Analysis for Topic Modeling (Zhe Gan, Changyou Chen, Ricardo Henao, David Carlson, Lawrence Carin)

Authors based their work on Beta-Negative Binomial Process and Poisson Factor Analysis by Zhou et al. (2012).

Ordinal Mixed Membership Models (Seppo Virtanen, Mark Girolami)

Mixed Membership Models: observations are grouped and a mixture model is applied to each group, where mixture components common across groups but mixture proportions are specific to individual groups. These models have previously provided sensible topics. More specifically, Ordinal Mixed Membership Models are based on ordinal variables which have values in a number of ordered categories.

Applied to consumer-generated reviews of mobile software applications.

This work might be applied to usability studies.

Efficient Training of LDA on a GPU by Mean-for-Mode Estimation (Jean-Baptiste Tristan, Joseph Tassarotti, Guy Steele)

Once a statistical model has been decided on, the next question is the computational model. Will the hardware be massively parallel? Will the data be distributed? Will communication be necessary? The answers to these questions might then feed back to the selection of a statistical model.

The authors described an implementation of Latent Dirichlet Allocation (LDA) on a GPU using a Data-Parallel Algorithm. The collapsed Gibbs sampling algorithm, which takes advantage of the sparsity of the data, does not work well in a GPU context. However an "uncollapsed" version of Gibbs sampling can be implemented in a data-parallel fashion although it does have some drawbacks. The authors looked at a slight modification of this algorithm which overcomes these problems. One aspect of their algorithm is to represent the counts in a dense matrix and using "approximate counts" which reduce the memory footprint and bandwidth requirements.

Social Phenomena in Global Networks (Jon Klenberg)

The "library" metaphor for the WWW has evolved into a "crowd" metaphor. Social interactions and graph theory have become important in describing the WWW.

Referred to A search for life on Earth from the Galileo spacecraft by Carl Sagan et al. (1993). A related thing has been done using pictures taken by Flickr users: the locations of pictures they take trace out where humans are living. Or does it simply indicate where people are living and taking pictures? The empty spaces: do they exist because people can't live there or because they don't want to live there?

Principles that affect social networks:

  • Homophily (people like you);
  • Triadic Closure (if A knows B and B knows C, then A probably knows C too); and
  • Small World Phenomenon.

These ideas go back well into the twentieth century.

Themes:

  1. Relationship between global structure and neighbourhoods of individual nodes;
  2. Socially shared information at neighbourhood level.

Neighbourhoods of individual nodes all look quite similar, with a node at the centre surrounded by a few tightly clustered groups of other nodes. You can break the global network down into one of these neighbourhood graphs for each individual. Erdös-Rényi graphs are an appropriate model for these neighbourhoods. Modelling Facebook as a collection of node neighbourhoods can give an understanding of global network structure arising from local structure.

Within a neighbourhood there are generally a relatively small number of strong ties (with frequent communication) and a comparably vast number of weak ties (with rare communication). Triangles in a network play an important role in the theory of strong and weak ties (have a look at The Strength of Weak Ties and Economic Action and Social Structure: The Problem of Embeddedness). The "embeddedness" of an edge is the number of common neighbours shared by the endpoint nodes of that edge. This is effectively the number of mutual friends. It's also the metric commonly used in Sociology.

Question: for a Facebook user in a relationship, find their partner just based on network structure. Consider 1.3 million Facebook users who are in a relationship, have 50-2000 friends and are at least 20 years old. For each of these users, rank their friends by a variety of metrics. For what fraction of users is their partner ranked at the top? Interestingly this algorithm gives better results for men than women, suggesting that for men there are fewer friends in their immediate neighbourhood who could be confused with their partner.

It's doubly awkward because it makes public what should be private. MySpace doesn't just create social networks, it anatomizes them. It spreads them out like a digestive tract on the autopsy table. You can see what's connected to what, who's connected to whom. You can even trace the little puffs of intellectual flatus as they pass through the system. The Globe and Mail

Learning to Rank using Gradient Descent: Test of Time Award (Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, Greg Hullender)

The award was for the paper Learning to Rank using Gradient Descent which appeared in the Proceedings of the 22nd International Conference on Machine Learning (Germany, 2005).

Lessons learned:

  • Speed is more important than accuracy.
  • If you think that you're onyo something, press on!
  • Goal-oriented research has pros and cons.

High Confidence Policy Improvement (Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh)

In a Reinforcement Learning (RL) context a policy is a mapping that assigns a probability distribution to possible actions. The system learns by being rewarded for taking the "right" actions towards achieving a goal. The authors presented a system which uses data to evolve the reinforcement learning policy. The system is labelled High Confidence because it will (seldom) return an evolved policy which is worse than the existing policy.

Functional Subspace Clustering with Application to Time Series (Mohammad Taha Bahadori, David Kale, Yingying Fan, Yan Liu)

Multivariate time series data gathered from patients in hospital. Data have high dimensionality with a complicated latent structure. One challenge is that the data may have been subjected to deformations or transformations. For example, in the time domain the signal might have been shifted or warped. How does the model discriminate between a signal that has been time warped and one that simply has different frequency content?

Banquet

The day ends with a Banquet which is included with the conference package. Apparently the food is going to be tres magnifique. Twitter is hosting a bash at my hotel after that, so I think I might swing by and grab a drink on the way to bed.

Data Science and Selling Shoes

Standing on the shoulders of giants is all well and good, but before jumping on someone’s back you might want to make sure that they can take the weight. There is a focus in the business world to use data science to sell advertisements. You may have access to the best dataset in the world, but if the people employing you only want you to find out how to best sell shoes with it, is it really worth it?Rachel Schutt and Cathy O’Neil, Doing Data Science

ICML 2015 (Lille, France): Day 3

Selected scribblings from the third day at the International Conference for Machine Learning (ICML 2015) in Lille, France. I'm going out on a limb with some of this, since the more talks I attend, the more acutely aware I become of my limited knowledge of the cutting edge of Machine Learning. Caveat emptor.

Adaptive Belief Propagation (Georgios Papachristoudis, John Fisher)

Belief Propagation describes the passage of messages across a network. The focus of this talk was Belief Propagation within a tree. The authors consider an adaptive algorithm and found that their technique, AdaMP, was significantly better than the current state of the art algorithm, RCTreeBP.

The Hedge Algorithm on a Continuum (Walid Krichene, Maximilian Balandat, Claire Tomlin, Alexandre Bayen)

The content of this talk went a little over my head. I did learn something about regret though, which in the context of decision theory quantifies the negative impact of finding out that a different action would have led to a preferable result. The importance of this concept is that a regret averse individual might anticipate the potential for future regret and factor this into their decision making process.

Deep Edge-Aware Filters (Li Xu, Jimmy Ren, Qiong Yan, Renjie Liao, Jiaya Jia)

Applications: image enhancement, tone mapping (HDR). Much previous research has focused on accelerating various filters, for example, the bilateral filter. However, only after a decade of work has a real time implementation of the bilateral filter been developed.

New filtering techniques are being developed all the time. However, existing acceleration techniques do not necessarily generalise to these new techniques. The authors proposed a uniform framework for accelerating filters with the following benefits:

  1. optimise for one but apply to all;
  2. implementation can be achieved in hardware.

They set out to generate a system which would learn how to achieve the effects of one or more (combined) filters. Learning in the colour domain does not work well because edges are not captured effectively. Optimising in the gradient domain works better though. Learning is done in the gradient domain using a Convolutional Neural Network (CNN), so that the original image needs to be reconstructed from the results of the model.

One of the major benefits of the technique is that it can learn composite filters. So, rather than applying filters A, B and C, the network will learn the results of the combined filters and then be able to generate the composite effect. Furthermore, it does this quickly, with up to 200 times acceleration.

More information on this work can be found at http://www.lxu.me/projects/deepeaf. As a photographer who has played around with filters quite extensively, this is very exciting research!

DRAW: A Recurrent Neural Network For Image Generation (Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, Daan Wierstra)

The DRAW system analyses an image progressively, breaking it down into parts and then analysing each part in sequence. But it does more than that. It "imagines" new images, effectively generating novel image content. The constructed image initially starts out as fairly blurred, but it becomes sharper as the system works on it.

DRAW-numbers

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio)

Objective: image caption generation. Human vision is foveated (spatial focus) and sequential (temporal dynamics). The parsing of an image can also depend on the task at hand (what information the observer is trying to derive from the image).

The authors describe a technique in which the image is segmented and a neural network is used to model the way that an observer's attention would traverse the image. A CNN is then used to extract feature vectors from parts of the image and these are used to generate words which are aggregated to form a caption.

The system generally produces appropriate captions. When it fails to produce a reasonable caption it is readily obvious why it has failed. Some of the failures are pretty amusing.

The model can be "forced" by specifying words that should appear in the caption. The system then attempts to generate a reasonable caption incorporating these words.

Code for this project is available at https://github.com/kelvinxu/arctic-captions. Kelvin also mentioned the Microsoft COCO project.

Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network (Seunghoon Hong, Tackgeun You, Suha Kwak, Bohyung Han)

Visual tracking aims to target a subject across successive frames in a video. Authors used a CNN to extract features of the subject from the image. An Incremental SVM was used to identify these features in a new image.

Learning Treatment Policies in Mobile Health (Susan Murphy)

Two objectives of mobile health programmes are monitoring of subjects and also giving them feedback. Two projects were discussed:

  1. Smoking Cessation Health; and
  2. Heartsteps Activity Coach (for heart attack patients).

Subjects are asked to wear chest or wrist bands to monitor their activity. Their smartphones also act as monitoring devices and give push feedback to the subjects. Both of these are a little invasive, so one of the challenges is to ensure that the subjects continue with the programme.

Two types of decision times: regular intervals (where tempo might vary according to what is being observed) and at subject demand.

The system takes into account the availability of the subjects for interventions. For example, if the subject is driving or walking, then no push messages will be sent to the subject. When messages are sent, the logical question is: did the intervention have the required result? To answer this question the first step is to gather data. To get truly robust data it's necessary to make observations within the framework of a randomised clinical trial.

A Stochastic Treatment Policy can be used to retard the habituation rate of subjects.

Finding Galaxies in the Shadows of Quasars with Gaussian Processes (Roman Garnett, Shirley Ho, Jeff Schneider)

The authors describe a project which used spectral data from a large catalog of Quasars to build a Gaussian Process model for Quasar emission spectra. This model was then used to detect evidence of galaxies lying between the Quasars and Earth. Such galaxies would result in absorption of the light emitted by the Quasars (the authors were specifically looking at absorption of light associated with the Lyman-α line of atomic Hydrogen). Data used in this project can be found at http://www.sdss.org/.

Second Poster Session

The day ended with another well catered poster session. Apparently Microsoft is hosting a little shindig later this evening too.

ICML 2015 (Lille, France): Day 2

Some notes from the second day at the International Conference for Machine Learning (ICML 2015) in Lille, France. Don't quote me on any of this because it's just stuff that I jotted down during the course of the day. Also much of the material discussed in these talks lies outside my field of expertise. Caveat emptor.

Two Big Challenges in Machine Learning (Léon Bottou)

Machine Learning is an Exact Science. It's also an Experimental Science. It's also Engineering.

The experimental paradigm in Machine Learning consists of splitting a large data set into training and testing sets. Train. Test. Iterate. Done! But this paradigm is reaching its limits. Other experimental disciplines (for example, Physics, Biology) have to work a lot harder on designing, executing and interpreting experiments.

Léon referred us to the Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks project.

Machine Learning practitioners cannot rely exclusively on train/test experiments. We need to adopt best practices from other experimental sciences. Proposition: Machine Learning papers should specify the limitations in their approach. For example, for data with a Zipf distribution the majority of examples have only a single occurrence. In such a case we should aim to achieve maximum coverage across the data rather than looking at average accuracy.

A Nearly-Linear Time Framework for Graph-Structured Sparsity (Ludwig Schmidt)

Data often exhibits structured sparsity. For example, Astronomical images are mostly empty, but where there is something interesting in the image, it is well structured.

Sparsity is often accompanied by rich structure. This motivates the application of Sparse Linear Regression. From this point the talk very quickly became highly theoretical. Unfortunately, my graduate course in graph theory was too long ago for me to make too much sense of it. From what I could gather, the approach was to represent the vector of regression coefficients as a graph. Considerations for this approach are generality, statistical efficiency and computational efficiency.

Learning from Corrupted Binary Labels via Class-Probability Estimation (Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, Bob Williamson)

Consider a data set consisting of instances with binary labels. Some of these labels are either corrupted (True to False or vice versa) or censored. Corruption occurs at random. A simple approach would be to treat data as if it were not corrupted. But it's possible to do much better than that.

We can model the situation using a pair of parameters, alpha and beta, which describe the relative frequency of corruption of each type of label. There is also an underlying clean class probability, which we aim to access via a corrupted class probability. The latter can be estimated directly from the corrupted data.

Manifold-valued Dirichlet Processes (Hyunwoo Kim, Jia Xu, Baba Vemuri, Vikas Singh)

I didn't know anything about Dirichlet Processes, so I did a bit of reading. It seems that an application we would be interested in would be to use them for k-means clustering with an unknown number of clusters.

Multi-instance multi-label learning in the presence of novel class instances (Anh Pham, Raviv Raich, Xiaoli Fern, Jesús Pérez Arriaga)

Multiple instances with multiple labels. For example, a selection of images, each of which has labels indicating some selected contents. More specifically, "Tree", "Cloud" and "Water" in image 1; "Duck" and "Water" in image 2; etc. The novel instances are things that appear in an image but are not reflected in the label. For example, "Grass" in image 1. How do we deal with instances that are not included in the bag of known labels?

Their approach is a discriminative graphical model with exact inference based on multiclass logistic regression which caters for unknown labels.

Entropy evaluation based on confidence intervals of frequency estimates: Application to the learning of decision trees (Mathieu Serrurier, Henri Prade)

The use of entropy as a splitting criterion becomes progressively less effective as you get further down the decision tree due to the sparsity of cases. Simple decision trees are thus prone to overfitting. Possible remedies are early-stopping or pruning.

Support Matrix Machines (Luo Luo, Yubo Xie, Zhihua Zhang, Wu-Jun Li)

Consider a classification problem where the features are matrices (e.g., EEG or image data). There is redundancy or correlation between adjacent elements which would be lost if the data were represented as a vector, since it would ignore structure. One solution is a matrix analogue of SVM. Details were buried in a mountain of maths.

Attribute Efficient Linear Regression with Distribution-Dependent Sampling (Doron Kukliansky, Ohad Shamir)

Two approaches to budgeted learning:

  1. few subjects and many tests;
  2. many subjects with fewer tests.

The latter is obviously preferable from the perspective of the subjects. If the same number of tests are conducted overall, it may not have a significant effect on costs, but perhaps it is possible to get away with fewer tests in total?

Concrete example: 4 subjects with 4 tests each versus 8 subjects with 2 (randomly sampled) tests each. Effectively leads to a data matrix with missing elements. But it's not really a "missing data" problem. One approach is the AERR algorithm (Linear regression with limited observation) described by Hazan, Elad and Koren, Tomer.

Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays (Junpei Komiyama, Junya Honda, Hiroshi Nakagawa)

I'll freely admit that, despite having worked in the online gaming industry, I had to check what a Multi-arm Bandit (MAB) referred to in the context of Machine Learning. It turns out that they are not too different from what you would expect! The player has access to an array of machines, each of which produces a random reward which is sampled from a particular distribution associated with that machine. He has to decide which of these machines to play, the number of times that he plays each machine and the sequence in which he plays them. The player attempts to optimise the cumulative reward across a series of games. He has to balance exploitation of machines that he already knows to provide higher rewards with exploration of other machines (which might, in principle, yield even higher rewards).

Somewhat surprisingly, this model has been applied to diverse scenarios like resource allocation in the presence of uncertainty, but where the level of uncertainty declines with time.

Compressing Neural Networks with the Hashing Trick (Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, Yixin Chen)

Deep Learning is easier if you have access to significant computing resources. But what if you want to do it on a small device (like a phone) with a relatively low-end CPU and limited RAM? In a neural network model the weight matrices are the biggest consumers of memory. Compressing these matrices will make neural networks possible on smaller devices. A variety of techniques exist for reducing the size of these matrices, all of which introduce only a marginal amount of additional error for moderate compression (so there is definitely room for this approach without compromising efficiency). The hashing trick appears to achieve the most effective compression without severe impacts on performance.

Deep Learning with Limited Numerical Precision (Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan)

Consider algorithms running on numerically noisy hardware. From a computational perspective the noisy hardware is likely to be more efficient. Using low precision (16 bit versus 32 bit) fixed point (as opposed to floating point) arithmetic results in improvements in memory footprint as well as data transfer bandwidth. But what is the impact on Machine Learning algorithms?

The proportion of the bits in the fixed point number which are allocated to representing the fractional part has a strong influence on the results. This occurs when "conventional" round-to-nearest rounding schemes are used. However, when stochastic rounding (where numbers are rounded up or down randomly with weights depending on the distance to the ceiling or floor) is applied, the results are much better.

Scalable Model Selection for Large-Scale Factorial Relational Models (Chunchen Liu, Lu Feng, Ryohei Fujimaki, Yusuke Muraoka)

Relational models attempt to reveal the latent group structure in a network and also to predict unseen links that have not yet been captured by the network.

Consistent estimation of dynamic and multi-layer block models (Qiuyi Han, Kevin Xu, Edoardo Airoldi)

A simple network can be insufficient to describe some phenomenon of interest. Instead one might consider a dynamic network, which changes over time, or a multi-layer network, including information relating to different relationships (for example, connections across Facebook and LinkedIn). A single network can be modelled using an adjacency matrix. A Stochastic Block Model (SBM) incorporates a number of classes, where edges are formed with a class-dependent probability. In this model the adjacency matrix is block diagonal.

The authors worked with a SBM for multi-graphs (sequence of graph snapshots), where the edge probability matrix changed between snapshots but the classes remain static. One of the examples presented was based on the Reality Commons data set, which was analysed using a multi-graph SBM approach.

Community Detection Using Time-Dependent Personalized PageRank (Haim Avron, Lior Horesh)

Community detection is the process of finding a collection of nodes which are internally cohesive and well separated. One technique for identifying communities is based on graph diffusion. The fundamental quantity involved in this process is a diffusion vector. The classical form of the diffusion vector is the PageRank vector as described by Brin and Page.

Poster Session

A well catered poster session with an extensive range of interesting contributions. Delcious salami. It might be my downfall.