Fixing Bazel out-of-memory problems

Memory management is a generally hard topic in computer systems operations. Debugging it inside a cloud-hosted build system is even worse!

There are two potential problems:

The Bazel server runs in a JVM, and it internally tries to allocate more objects than the max heap size its allowed.
Bazel spawns subprocesses (called "actions", including test actions) and they collectively exhaust the memory in the machine or VM that Bazel runs in.

I'll cover these scenarios separately since they're mostly unrelated.

Of course, the Bazel JVM heap does occupy system memory, so they're related in the sense that a smaller Bazel server footprint would allow for more actions to run, but I've never considered that to be a potential remediation.

Bazel server out-of-memory

How to tell this is happening:

Bazel exits with code 33 (see ExitCodes.java)
Check the output of bazel info | grep heap if the Bazel server is still running, see if it is near the max.

Some things you can do about it:

Give it more RAM! Assuming the system has some available, you can use the host_jvm_args startup flag to adjust the usual JVM parameters like -Xmx2g.
Always turn on the --heap_dump_on_oom flag so that you get extra information in this case.
"Memory saving mode" can be useful if you're hosting Bazel in ephemeral CI workers where you expect every build to be cold, however that's slow and not recommended. If you do that, you can avoid Bazel tracking incremental state which saves some memory. bazel.build/configure/memory
Roll up your sleeves and figure out what's consuming so much memory in Bazel's JVM. Start from Bazel's documentation on memory profiling. An example can be rulesets where data is repeated rather than using depsets, an example analysis: https://github.com/aspect-build/rules_js/pull/391

System out-of-memory

Bazel schedules actions (build steps and test runners) based on the amount of system resources it thinks are available, and using some heuristic about how much RAM a typical action requires. Two kinds of things can go wrong, either Bazel thinks more RAM is available than the system actually has free, or Bazel underestimates the resources to be reserved for a given action.

By default, Bazel's max concurrency is based on the heuristic that each action needs one CPU core, so the --jobs flag default is the number of (maybe virtual) CPUs on the machine. Note that Bazel reports progress with a "X running" indicator which might lead you to believe that the concurrency is actually higher, but that's a misleading message because it can include actions that are queued waiting for resources

Did you know when @bazelbuild prints progress like
[12 / 100] 32 actions, 30 running
That "running" count includes "remote-cache" spawns! If you have --jobs=16 (the default on 16 core) the other 14 of them aren't actually running, they're queued for "local" spawn. https://t.co/a3w7TWZVTM
— Alex 🦅 Eagle (@Jakeherringbone) August 17, 2022

How to know this is happening:

The Bazel server gets killed by the operating system like Bazel server terminated abruptly (error code: 14, error message: 'Socket Closed', log file: ...)

Some things you can do about it:

If Bazel is running inside a container, it may calculate available RAM based on the host system rather than what is allocated for the container, due to Default local resources should respect cgroup limits on Linux. The remediation is to explicitly tell Bazel's scheduler what the container limits are by setting the --local_ram_resources flag to match the container runtime.
Reduce --jobs so that fewer things run concurrently. This is a blunt approach and makes all builds take longer, but saves you the effort of figuring out which actions didn't get enough resource reservation.
Figure out which actions consume a lot of RAM, and tell Bazel's scheduler to reserve more resources for them. For tests, use the test_size attribute - a larger size gets more reserved memory per the table in that documentation. For build actions, the Bazel team recommends using execution properties though to be honest, that looks really complex and I haven't used it myself.
Try out the --experimental_local_memory_estimate flag to make Bazel smarter about knowing the available system resources at the time it's scheduling the subprocess to spawn.
Investigate using Remote Build Execution so that heavy workloads move off the machine and run on a cloud of executors.