Kaggle: Santa's Stolen Sleigh

This morning I read Wendy Kan's interesting post on Creating Santa's Stolen Sleigh. I hadn't really thought too much about the process of constructing an optimisation competition, but Wendy gave some interesting insights on the considerations involved in designing a competition which was both fun and challenging but still computationally feasible without military grade hardware.

This seems like an opportune time to jot down some of my personal notes and also take a look at the results. I know that this sort of discussion is normally the prerogative of the winners and I trust that my ramblings won't be viewed as presumptuous.

Santa's Stolen Sleigh closed on 8 January 2016. At the top of the leaderboard:


Well done, what a phenomenal effort! Check out the interview with Woshialex & Weezy.

Problem Summary

The problem could be summarised as follows: 100 000 Christmas gifts have to be delivered from the North Pole to various locations across the Earth while minimising the Weighted Reindeer Weariness, a metric depending on weight carried and distance travelled. Distances were to be calculated using the Haversine formula which takes into account the curvature of the Earth's quasi-spherical surface. Each gift had a weight between 1 and 50 (units unspecified). In addition the sleigh itself had a weight of 10 (same unspecified and obviously rather magical units) and a weight capacity of 1000.

... when the sleigh weight was too small, there's no incentive to merge any trips.Wendy Kan

The problem could be decomposed into two parts:

  1. partition the gifts into trips, where the total weight of gifts to be delivered on any particular trip must satisfy the weight constraint of the sleigh; and
  2. choose the order in which the gifts should be delivered on each trip.

A solution would jointly solve both parts in such a way as to minimise Weighted Reindeer Weariness. It's a variation of the Travelling Salesman Problem. In the course of doing some preliminary research I had a look at a couple of videos (video one and video two), which were definitely worth watching for some background information and ideas.

The competition was sponsored by FICO, who suggested a Mixed Integer Programming formulation for the problem. FICO made their Xpress Optimization Suite available to participants.

Exploratory Analysis

The first thing I did was to take a look at the distribution of gifts across the planet. I found that they were more or less evenly distributed across all the continents. Who would have guessed that so many good children lived in Antarctica? Perhaps there's limited scope to misbehave down there?

You better watch out
You better not cry
Better not pout
I'm telling you why
Santa Claus is coming to town.
Santa Claus Is Coming To Town

An obvious approach to the problem was to cluster the gifts according to their proximity and then deliver these clusters of gifts on a single trip. But this approach posed a problem: generally the k-means algorithm is implemented using the Euclidean distance, which makes sense on a plane but not the surface of a sphere. Of course, a distance matrix could be calculated using the Haversine formula but this would be a pretty enormous matrix, certainly too large for my hardware.

I could, however, readily calculate the distance between each gift and its nearest neighbour, the distribution of which is displayed below. The majority of gifts are located less than 20 km from their nearest neighbour. Some are much closer, within a few hundred metres of each other. Others are significantly more distant. The summary data below indicates that the most isolated location is 2783 km from its nearest neighbour. Obviously the horizontal axis of the plot has been aggressively truncated!

> summary(gifts$dist_nearest)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.1424   11.6500   18.1300   19.6200   25.9400 2783.0000


Let's look at the most isolated locations. The red points on the map below indicate gift locations which are at least 150 km distant from any other locations. Some of these are on extremely isolated pieces of land like Saint Helena, Coronation Island and the French Southern and Antarctic Lands. In terms of clustering, these points were going to be a little tricky and probably best delivered en route to somewhere else.


Clustering and Sorting

My first approach to solving the problem involved a simple clustering scheme. My gut feel was that in a tight cluster of gifts it would be important to deliver the heaviest ones first. I tried applying this simple heuristic to my clusters and the results were not particularly good. I then used a Genetic Algorithm (GA) to optimise the order of delivery within each cluster. The table below shows gifts allocated to two distinct trips (identified by TripId) with the order of delivery determined by the GA. These data confirmed that the best strategy was not simply to deliver the heaviest gifts first. Useful (and obvious in retrospect!).

   GiftId TripId Latitude  Longitude    Weight
1   72834      1 52.08236  -99.84075 19.450129
2   31544      1 51.99433  -99.66718 16.401965
3   13514      1 51.54432 -100.80105 17.188096
4    3711      1 51.42292  -99.61607  5.959066
5   69035      1 51.42310  -99.21229  1.000000
6   50071      2 52.14381 -100.39226 10.185356
7   75708      2 52.05063 -100.09734 11.833779
8   90582      2 51.57074 -100.11467 25.335999
9   29591      2 50.94006  -98.51085 50.000000
10  94799      2 51.40380 -100.10319 10.719852
11  64255      2 51.39730  -99.99052  1.000000
12  31418      2 50.98414 -100.38163  1.000000
13   6254      2 51.37832 -100.42426  1.000000
14  40247      2 51.18464 -100.77533  9.352351
15  77396      2 51.27686 -100.77209  2.605262
16   2600      2 50.82170  -99.46544  1.000000

My submission based on simple clustering with GA optimisation was fairly decent but nothing to tweet about.


Great Circle Solution

Thinking further about the problem it occurred to me that it might be efficient to deliver the gifts along paths that are as straight as possible, with minimal deviations. How about attempting to deliver the gifts along a Great Circle arc? For paths originating at the pole this would be equivalent to moving out and back close to a single line of longitude. It would certainly improve one aspect of Weighted Reindeer Weariness: the distance covered.

The minor arc of a great circle between two points is the shortest surface-path between them.Great Circle, Wikipedia

Putting this idea into practice and dividing the gifts up into longitudinal slices lead to an appreciable bump in my score.


My final innovation was to apply the same longitudinal slicing strategy but treat Antarctica separately. This again lead to an immediate improvement. I tried splitting off other areas of the globe, like Australia, but this didn't yield positive results.


So in the end my clustering had very little to do with gifts that were close together but instead grouped gifts which lay close to a meridional path originating at the North Pole. Definitely not the ideal approach, but one based on good physical principles and yielding a solution which was both robust and quick to compute.

I then shifted my attention to optimising the order of delivery within each quasi-meridional group. I started off with a GA, which worked pretty well but seemed a bit pedestrian in producing innovation. I then tried 2-opt as an alternative but found that it didn't make too much difference. Perhaps this was due to the linear arrangement of gifts? In the end I returned to the GA and just let that grind away until the competition closed, in the process trimming my score down to 12482212142. Progress was slow...

Below are some of the delivery routes designed using this approach. The size of the dots is proportional to weight and the dashed lines indicate the order of delivery. The first panel is an example in which it was optimal to simply deliver the gifts according to descending latitude. This worked well because all of the gifts lay very close to a straight line and no major lateral deviations were required. This ordering was, in fact, the starting point for my GA optimisation. The remaining panels show examples where it was more efficient to delay delivery of some of the gifts until the return journey. Notably, these were gifts with lower weight, where a relatively small penalty was incurred by delaying their delivery, but much was gained by delivering other gifts first.

crop-trip-plot-710 crop-trip-plot-3 crop-trip-plot-501 crop-trip-plot-971

Here are some examples of delivery schedules for gifts going to Antarctica. The range of longitudes spanned by each of these trips is larger than for the trips only going to higher latitudes, but this makes sense: once you have transported the gifts over the Equator and across the Southern Ocean, most of the travelling has already been done, so a bit of latitudinal distance does not have too much impact.

crop-trip-plot-1449 crop-trip-plot-1467 crop-trip-plot-1492 crop-trip-plot-1566

Okay, that's enough about my attempts, let's take a quick look at the leaderboard.

Leaderboard Analysis

It's apparent that the winner's final solution was a significant improvement on his initial submissions. In fact, it's interesting to see that his first three scores were 29121011015, 29120957549 and 29120751717 (all of which were a long way from the money at that stage of the competition) before plummeting to 13442292734. A series of smaller improvements whittled his score down to the final winning value of 12384507107.

On average competitors made around 7 submissions. Of the 1127 competitors, 275 made only a single submission (this was also the most common submission count). The relationship between submission count and best score is plotted below. Note that the submission counts here can differ from those reflected on the leaderboard: only submissions which resulted in an improved score are included below, whereas the leaderboard reflects the total number of submissions made, irrespective of whether they resulted in an improvement. It's evident that those who made more submissions generally ended up with better final scores. There's a couple of anomalies at 26 and 39 submissions caused by a pair of competitors who showed remarkable persistence.

Finally, looking at the number of submissions per day, we see that the rate is more or less constant (averaging around 200 per day) up until the final day of the competition when things spiked appreciably up to 510 submissions. Of those final day entries, 128 were submitted in the last hour and 5 in the final 30 seconds. The winner made his final submission around 8 minutes before the final bell.

Well, that's about it. It was a really great competition. Fortunately there'll be another similar one later on this year: yet another reason to look forward to Christmas.

Casting a Wide (and Sparse) Matrix in R

I routinely use melt() and cast() from the reshape2 package as part of my data munging workflow. Recently I've noticed that the data frames I've been casting are often extremely sparse. Stashing these in a dense data structure just feels wasteful. And the dismal drone of page thrashing is unpleasant.

So I had a look around for an alternative. As it turns out, it's remarkably easy to cast a sparse matrix using sparseMatrix() from the Matrix package. Here's an example.

First we'll put together some test data.

> set.seed(11)
> N = 10
> data = data.frame(
+   row = sample(1:3, N, replace = TRUE),
+   col = sample(LETTERS, N, replace = TRUE),
+   value = sample(1:3, N, replace = TRUE))
> data = transform(data,
+                  row = factor(row),
+                  col = factor(col))

It's just a data.frame with two fields which will be transformed into the rows and columns of the matrix and a third field which gives the values to be stored in the matrix.

> data
   row col value
1    1   E     1
2    1   L     3
3    2   X     2
4    1   W     2
5    1   T     1
6    3   O     2
7    1   M     2
8    1   I     1
9    3   E     1
10   1   M     2

Doing the cast is pretty easy using sparseMatrix() because you specify the row and column for every entry inserted into the matrix. Multiple entries for a single cell (like the highlighted records above) are simply summed, which is generally the behaviour that I am after anyway.

> library(Matrix)
> data.sparse = sparseMatrix(as.integer(data$row), as.integer(data$col), x = data$value)
> colnames(data.sparse) = levels(data$col)
> rownames(data.sparse) = levels(data$row)

And here's the result:

> data.sparse
3 x 8 sparse Matrix of class "dgCMatrix"
  E I L M O T W X
1 1 1 3 4 . 1 2 .
2 . . . . . . . 2
3 1 . . . 2 . . .

Kaggle: Walmart Trip Type Classification

Walmart Trip Type Classification was my first real foray into the world of Kaggle and I'm hooked. I previously dabbled in What's Cooking but that was as part of a team and the team didn't work out particularly well. As a learning experience the competition was second to none. My final entry put me at position 155 out of 1061 entries which, although not a stellar performance by any means, is just inside the top 15% and I'm pretty happy with that. Below are a few notes on the competition.

Before I get started, congratulations to the competitors at the top of the leaderboard! You guys killed it.


Data Preparation

Getting to my final model was quite a process, with many moments of frustration, disappointment, enlightenment and exhilaration along the way.

The first step was to clean up the data. Generally it appeared to be in pretty good shape, with nothing extraordinary jumping out at me. There were some minor issues though, for example, department labels for both "MENSWEAR" and "MENS WEAR", which needed to be consolidated into a single category.

My initial submission was a simple decision tree. This was an important part of the process for me because it established that I had a working analysis pipeline, a submission file in valid format, and a local value of the evaluation metric which was consistent with that on the public leaderboard. Submissions were gauged according to log loss and my first model scored 2.79291, quite a bit better than the Random Forest & Department Description benchmark at 5.77216, but still nowhere near competitive.


I then did a few very elementary things with the data and applied an XGBoost model, which resulted in a significant bump in model performance. I hadn't worked with XGBoost before and I discovered what some of the hype's about: even with the default parameters it produces an excellent model. It's also blazing fast.


That result was a lot more respectable. Time to dig deeper.

Feature Engineering

I realised that I would only be able to go so far with the existing features. Time to engineer some new ones. After taking a cursory look at the data, a few obvious options emerged:

  • the number of departments visited;
  • the total number of items bought (PositiveSum) or returned (NegativeSum);
  • the net number of items bought (WholeSum, being the difference between PositiveSum and NegativeSum);
  • the number of departments from which items were bought (DepartmentCount); and
  • groupings of various departments, giving new categories like Grooming (a combination of BEAUTY and PERSONAL_CARE), Baby (INFANT_APPAREL and INFANT_CONSUMABLE_HARDLINES) and Clothes (everything relating to clothing from BOYS_WEAR to PLUS_AND_MATERNITY).

Throwing those into the mix provided another, smaller improvement.


To see why these new features were effective, take a look at the data below which illustrates the clear distinction in the distribution of PositiveSum for trip types 39 and 40.

Kaggle Walmart Trip Type Classification: trip types 39 and 40

Below are the relative feature importances generated by one of my models. It's evident that both the WholeSum and PositiveSum (or its logarithm) were important. Clothes and financial services also featured highly.


Enough about my attempts, let's scope the leaderboard.

Leaderboard Analysis

I discovered something interesting while trolling the bottom end of the leaderboard page: you can download statistics for all competition entries. The data are presented as a CSV file. Here's the head.

230879,HulkBulk,"2015-10-26 18:58:32",34.53878
230879,HulkBulk,"2015-10-26 19:49:31",10.42797
230879,HulkBulk,"2015-10-26 20:03:20",7.90711
230907,"Bojan Tunguz","2015-10-26 20:12:06",34.53878
230938,Sadegh,"2015-10-26 21:41:55",34.53878
230940,"Paul H","2015-10-26 21:56:17",34.53878
230942,NxGTR,"2015-10-26 22:06:44",34.53878
230945,Chippy,"2015-10-26 22:14:40",3.44965
230940,"Paul H","2015-10-26 22:16:57",32.29692

Let's first look at the distribution of best and worst scores per competitor. The histogram below shows a peak in both best and worst scores around the "All Zeros Benchmark" at 34.53878. The majority of the field ultimately achieved best scores below 5.

Scrutinising the distribution of best scores reveals a peak between 0.6 and 0.7. Only a small fraction of the competitors (6.3%) managed to push below the 0.6 boundary, leaving the elite few (0.6%) with final scores below 0.5.

        group count percent
       (fctr) (int)   (dbl)
1  (0.45,0.5]     6  0.5655
2  (0.5,0.55]    21  1.9793
3  (0.55,0.6]    40  3.7700
4  (0.6,0.65]    93  8.7653
5  (0.65,0.7]    75  7.0688
6  (0.7,0.75]    39  3.6758
7  (0.75,0.8]    32  3.0160
8  (0.8,0.85]    31  2.9218
9  (0.85,0.9]    46  4.3355
10 (0.9,0.95]    21  1.9793

The scatter plot below shows the relationship between best and worst scores broken down by competitor.

Overplotting kind of kills that. Obviously a scatterplot is not the optimal way to visualise those data. A contour map offers a better view, yielding three distinct clusters: competitors who started off close to the "All Zeros Benchmark" and stayed there; ones who debuted near to the "All Zeros Benchmark" and subsequently improved dramatically and, finally, those whose initial entries were already substantially better than the "All Zeros Benchmark".

Next I looked for a pattern in the best or worst submissions as a function of first submission date. There's certainly evidence to suggest that many of the more competitive best scores were achieved by people who jumped onto this competition within the first week or so. Later in the competition there were more days when people joined the competition who would ultimately achieve poorer scores.

There's less information to be gleaned from looking at the same data against the date of last submission. Throughout the competition there were final entries from competitors that were close to the "All Zeros Benchmark". What happened to them? Were they discouraged or did they team up with other competitors?

The number of submissions per day remained consistently between 50 and 100 until the middle of December when it ramped up significantly, reaching a peak of 378 submissions on the last day of the competition. Of the entries on the final day, almost 20% were made during the last hour before the competition closed.

The box plot below indicates that there's a relationship between the number of submissions and best score, with competitors who made more submissions generally ending up with a better final score. There are, of course, exceptions to this general rule. The winner (indicated by the orange square) made only 6 submissions, while the rest of the top ten competitors made between 19 and 83 entries.

Some of the competitors have posted their work on a source code thread in the Forums. There'll be a lot to learn by browsing through that.

I've just finished the Santa's Stolen Sleigh competition and I'll be reporting on that in a few days time. Also working on a solution for the Homesite Quote Conversion competition, which is providing a new range of challenges.

Installing MongoDB 3.2 on Windows 7

It's not my personal choice, but I have to spend a lot of my time working under Windows. Installing MongoDB under Ubuntu is a snap. Getting it going under Windows seems to require jumping through a few more hoops. Here are my notes. I hope that somebody will find them useful.

  1. Download the installation. This will be an MSI installer package with a name like mongodb-win32-x86_64-2008plus-ssl-3.2.0-signed.msi.
  2. Run the installer with a deft double-click.

  3. Accept the License Agreement.
  4. Select the Complete installation type and click Install.
  5. Briefly browse YouTube (you won't have time to make a cup of coffee).
  6. When the installation is complete press Finish.
  7. Reboot your machine. This might not be entirely necessary, but my Windows experience tells me that it is never a bad idea.
  8. Create a folder for the data files. By default this will be C:\data\db.
  9. Create a folder for the log files. By default this will be C:\data\log.
  10. Open a command prompt, change the working directory to C:\Program Files\MongoDB\Server\3.2\bin and start the database server, mongod.exe.

At this stage you should be ready to roll. Open another command prompt and start the database client, mongo.exe which you'll find in the same folder as mongod.exe.

To make your installation a little more robust, you can also do the following:

  1. Create a configuration file at C:\Program Files\MongoDB\Server\3.2\mongod.cfg. For starters you could enter the following configuration directives:
        destination: file
        path: c:\data\log\mongod.log
        dbPath: c:\data\db
  2. Install MongoDB as a service by running
    mongod.exe --config "C:\Program Files\MongoDB\Server\3.2\mongod.cfg" --install
  3. The service can then be launched with
    net start MongoDB

    And stopping the service is as simple as

    net stop MongoDB

Review: Mastering Python Scientific Computing

Cover of Mastering Python Scientific Computing

I was asked to review "Mastering Python Scientific Computing" (authored by Hemant Kumar Mehta and published in 2015 by Packt Publishing). I was disappointed by the book. The title lead me to believe that it would help me to achieve mastery. I don't feel that it brought me any closer to this goal. To be sure, the book contains a lot of useful information. But the way that it is written is a weird compromise between high level overview and low level details. And it's the middle ground in between, where explanations are given and a link is made between the details and the global picture, which I normally find most instructive.

A complete guide for Python programmers to master scientific computing using Python APIs and tools.Mastering Python Scientific Computing

I had high hopes for this book. I tried to love this book. But in the end, despite my best efforts, I got completely frustrated after Chapter 5 and just gave up. Here's the Table of Contents and some notes on the first two chapters.

  1. The Landscape of Scientific Computing and Why Python?
    The first chapter presents an overview of Scientific Computing with some examples of applications and an outline of a typical workflow. It then digs rather deeply into the issue of error analysis before giving some background on the Python programming language. Although Python and Scientific Computing are central to this book, I found an introduction to both topics within a single chapter to be a little disconcerting.

  2. A Deeper Dive into Scientific Workflows and the Ingredients of Scientific Computing Recipes
    This chapter starts off by addressing a range of techniques used in Scientific Computing, including optimisation, interpolation and extrapolation, numerical integration and differentiation, differential equations and random number generation. Attention then shifts to Python as a platform for Scientific Computing. A lengthy list of relevant packages is given before Python's capabilities for interactive computing via IPython are introduced. The chapter concludes with a description of symbolic computing using SymPy and some examples of Python's plotting capabilities.

  3. Efficiently Fabricating and Managing Scientific Data
  4. Scientific Computing APIs for Python
  5. Performing Numerical Computing
  6. Applying Python for Symbolic Computing
  7. Data Analysis and Visualization
  8. Parallel and Large-scale Scientific Computing
  9. Revisiting Real-life Case Studies
  10. Best Practices for Scientific Computing

The book is riddled with numerous small errors. These really should have been picked up by the author and editor, or indeed by either of the two reviewers credited at the front of the book. I have submitted errata via the publisher's web site, which were duly acknowledged.

Review: Learning Shiny

Cover of Learning Shiny

I was asked to review Learning Shiny (Hernán G. Resnizky, Packt Publishing, 2015). I found the book to be useful, motivating and generally easy to read. I'd already spent some time dabbling with Shiny, but the book helped me graduate from paddling in the shallows to wading out into the Shiny sea.

The book states its objective as:

... this book intends to be a guide for the reader to understand the scope and possibilities of creating web applications in R, and from here, make their own path through a universe full of different possibilities."Learning Shiny" by Hernán G. Resnizky

And it does achieve this goal. If anything it's helpful in giving an idea of just what's achievable using Shiny. The book has been criticised for providing little more than what's to be found in the online Shiny tutorials, and there's certainly a reasonable basis for this criticism.

I'm not going to dig into the contents in any great detail, but here's the structure of the book with a few comments on each of the chapters.

  1. Introducing R, RStudio, and Shiny
    A very high level overview of R, RStudio and Shiny, what they do and how to get them installed. If you've already installed them and you have some prior experience with R then you could safely skip this chapter. I did, however, learn something new and useful about collapsing code blocks in RStudio, so perhaps don't be too hasty.

  2. First Steps towards Programming in R
    Somewhat disturbingly the first section in this chapter is entitled Object-oriented programming concepts but doesn't touch on any of the real Object Oriented support in R (S3, S4 and Reference classes). In fact, with regards to Object Oriented concepts it seemed to have rather the wrong end of the stick. The chapter does redeem itself though, giving a solid introduction to programming concepts in R, covering fundamental data types, variables, function definitions, control structures, indexing modes for each of the compound data types and reading data from a variety of sources.

  3. An Introduction to Data Processing in R
    This chapter looks at the functionality provided by the plyr, data.table and reshape2 packages, as well as some fundamental operations like sorting, searching and summarising. These are all central to getting data into a form suitable for application development. It might have made more sense to look at dplyr rather than plyr, but otherwise this chapter includes a lot of useful information.

  4. Shiny Structure - Reactivity Concepts
    Reactivity is at the core of any Shiny application: if it's not reactive then it's really just a static report. Having a solid understanding of reactivity is fundamental to your success with Shiny. Using a series of four concise example applications this chapter introduces the bipartite form of a Shiny application, the server and UI, and how they interact.

  5. Shiny in Depth - A Deep Dive into Shiny's World
    The various building blocks available for constructing a UI are listed and reviewed, starting with those used to structure the overall layout and descending to lower level widgets like radio buttons, check boxes, sliders and date selectors.

  6. Using R's Visualization Alternatives in Shiny
    Any self-respecting Shiny application will incorporate visualisation of some sort. This chapter looks at ways of generating suitable visualisations using the R's builtin graphics capabilities as well as those provided by the googleVis and ggplot2 packages. Although not covered in the book, plotly is another excellent alternative.

  7. Advanced Functions in Shiny
    Although the concept of reactivity was introduced in Chapter 4, wiring up any non-trivial application will depend on the advanced concepts addressed in this chapter. Specifically, validation of inputs; isolate(), which prevents portions of server code from activating when inputs are changed; observe(), which provides reactivity without generating any output; and ways to programmatically update the values of input elements.

  8. Shiny and HTML/JavaScript
    This chapter looks at specifying the Shiny UI with lower level code using either HTML tags, CSS rules or JavaScript. If you want to differentiate your application from all of the others, the techniques discussed here will get you going in the right direction. It does, however, assume some background in these other technologies.

  9. Interactive Graphics in Shiny
    It's also possible to drive a Shiny application by interacting with graphical elements, as illustrated here using JavaScript integration.

  10. Sharing Applications
    Means for sharing a Shiny application are discussed. The options addressed are:

    • just sharing the code directly or via GitHub (not great options because they significantly limit the range of potential users); or
    • hosting the application at http://www.shinyapps.io/ or your own server.

  11. From White Paper to a Full Application
    This chapter explores the full application development path from problem presentation and conceptual design, through coding UI.R and server.R, and then finally using CSS to perfect the appearance of the UI.

I did find a few minor errors which I submitted as errata via the publisher's web site. There are also a couple of things that I might have done differently:

  • I'm generally not a big fan of screenshots presented in a book, but in this case it would have been helpful to have a few more screenshots illustrating the effects of the code snippets.
  • Styling Shiny applications using CSS was only touched on: I think that there's a lot to be said on this subject and I would have liked to read more.

Using Checksum to Guess Message Length: Not a Good Idea!

A question posed by one of my colleagues: can a checksum be used to guess message length? My immediate response was negative and, as it turns out, a simple simulation supported this knee-jerk reaction.

Here's the situation: a piece of software has been written to process a stream of messages. Each message is a sequence of bytes, where the length of the sequence varies, and is terminated by a checksum byte, which is calculated as the cumulative bitwise XOR of the message bytes. There is no header information specifying the length of the sequence. Is it possible to use the checksum byte to locate the end of each message?

It seems like this might just be feasible. But there's a problem: a coincidental match would cause a premature end to a message. To illustrate this point, let's look at an example. Consider a message composed of the following series of bytes:

154  59 161 111 127 182 227  37   8 170 194

Now look at the checksums generated from the message as each new byte arrives.

    154 161   0 111  16 166  69  96 104 194

The first byte in the message is 154, so the checksum is just 154. The second byte in the message is 59, which doesn't match the checksum so we proceed to recalculate the checksum for the first two bytes, which is now 161. The third byte in the message is 161, which just happens to match the checksum! If we were using the checksum to determine the message length then we'd (wrongly) conclude that the message had come to an end.

Our ad hoc method for determining message length would have failed for this message in particular. But there would also be a knock-on effect because failing to correctly identify the end of one message would also mean incorrectly identifying the beginning of the next message, so the whole sequence of messages and checksums would get out of whack. That's a far bigger problem. In the interests of simplicity we'll just focus on the failure rate for a single message.

The quickest way to estimate the failure rate for our ad hoc system is simulation. I generated a population of randomised messages of increasing length and then calculated the expected failure rate as a function of message length. The results are plotted below with the message length on the x-axis and probability of falsely identifying a byte in the message as the checksum byte on the y-axis. The blue like is the average failure rate from the simulations and the orange band represents the 95% binomial confidence interval. I consolidated all failures per message. So, for example, it’s possible that a particular message might have more than one point at which the checksum byte could be mistaken. This is simply counted as a single failure.


According to these results, the likelihood of failure is relatively small if the messages are short. For example, a messsage of 16 bytes would have a 6% failure rate. However, as the messages get longer the likelihood of mistakenly identifying a checksum byte escalates significantly. By the time that the message is 256 bytes long the likelihood of a mistake is 65%. So for messages of practical lengths this is not going to be a feasible proposition.

Update: A theoretical curve (courtesy of a comment from Christine) has been added to the plot.