I’ve got a long running optimisation problem on a EC2 instance. Yesterday it was mysteriously killed. I shrugged it off as an anomaly and restarted the job. However, this morning it was killed again. Definitely not a coincidence! So I investigated. This is what I found and how I am resolving the problem.
I had the job running on a c4.2xlarge instance with 8 vCPUs and 15 GiB of RAM. I’d also added 4 GiB of swap space. Seemed to be perfectly adequate.
Understanding the Problem
The jobs died with a curt and rather uninformative message in the console:
Hard to figure out the source of the problem. Luckily Ubuntu comes with a plethora of tools for debugging. I had a look at the output from dmesg and that immediately pointed me to the source of the problem.
The dmesg output is included below. I’ve edited out some of the irrelevant details.
That’s the first sign that something is going horribly wrong: R invoked oom-killer. The OOM Killer is responsible for killing tasks when the system is running out of memory.
Some high level information on memory allocation. Note that all of the swap space has been used!
Then some details on memory allocation to individual processes.
Note the final eight lines, which correspond to my optimisation job (it’s running in parallel with seven worker threads). The biggest memory hog is PID 3026, which is the R master task.
Then the final coup de grâce: killing PID 3026, which in turn took down the rest of the R tasks.
Fixing the Problem
Obviously memory is the issue here. I thought that the available RAM and swap were sufficient, but evidently I was mistaken. These are the options I’m exploring to solve the problem:
upgrade to a larger instance (the m4.2xlarge also has 8 vCPUs but 32 GiB of RAM, although it’s a General Purpose rather than a Compute Optimised instance); or
add an EBS volume and create a wide swathe of swap space.