Estimating the effort to build a Bazel CI/CD

At Aspect, we've consulted for many companies, helping several of them to run Bazel on their CI/CD infrastructure. Since we've been through this migration several times we can report on the typical obstacles we've seen, and the engineering effort that was required to overcome those.

The goal of this post is to help engineering managers who are tasked with a project like this to understand the complexity involved and estimate the effort that will be required to implement and operate Bazel on CI. Our conclusion is that a medium-sized engineering org will need 12-24 senior-engineer-months of work, with 0.75-1.0 FTE required for maintenance and operations.

Avoid accidental discards of Bazel's analysis cache

In a large repo, Bazel spends a bunch of time on the "Analysis Phase", where all the rule implementation functions are run. The result is cached in-memory in the Bazel server, but can easily be discarded, just by running a bazel command with some flags that differ from the previous command.

The failure mode here is subtle - performance is degraded with a message like "options have changed, discarding analysis cache". Often this problem is introduced just because someone changed a CI script to add another command, say they did a bazel query somewhere to support a user request. Now the following Bazel command becomes slow. It's hard to detect that this has happened.

Aspect is working on an upstream PR in Bazel to just fail the build when the analysis cache is discarded.

To prevent this, you have to add some layer in your CI design wrapping Bazel calls, to ensure that the same flags are always passed. Note that having .bazelrc isn't always enough, because you might configure a flag such that it applies to only a subset of the commands it should. There are also Bazel bugs that cause analysis discards, such as bazel coverage always doing so.

Persistent runners

As Kubernetes became popular, the industry as a whole moved to ephemeral CI instances running in their own pods. For most build systems, that came with the added benefit of isolating build and test from other PRs or builds, overcoming incorrectness issues with the build system.

However, Bazel is the opposite. The correctness guarantee is built-into the tool (to the extent that the build definition is hermetic). Worse, since Bazel typically manages a hermetic toolchain for each language, an ephemeral runner has all of the up-front work to download these toolchains, and set them up.

Your first reaction is to use the CI system's caching mechanism. For example an article titled Caching dependencies to speed up workflows seems like just what you want! And used correctly, you can improve the situation with Bazel somewhat, by using the --repository_cache flag to avoid re-downloading toolchains and library dependencies.

However on each run, Bazel still has to execute the repository rules (the "Repository Cache" is actually a "Downloader Cache" of the inputs to repository rules, not a cache of the resulting external repository. The Bazel team has indicated they're investigating a cache of the latter, so maybe someday this will be easier.)

It also has to re-execute a lot of work which is normally saved in-memory in the running Bazel server (a JVM process which is meant to be long-lived). For example, all the starlark code which turns rule definitions into the Action graph. There is no way to export this state from a Bazel server and then restore it into a new ephemeral instance.

You'll want your CI runner pool to be responsive to demand. You want to scale near zero during off-peak, to stay within your operation budget, but scale up during peak load to avoid developers waiting in a queue.

Fortunately, this isn't just a Bazel-specific problem. Many organizations need CI runners within their private network, and might want to keep them running, so there are some resources to get you past this step.

Buildkite: Launching and Running Elastic CI Stack for AWS
CircleCI: Automatically scale self-hosted runners in AWS to meet demand
GitHub Actions: Autoscaling with self-hosted runners

Warm persistent runners

Going beyond the section above, we still have more obstacles even with persistent runners.

The builds during the ramp-up period of the day are slow since they run on a fresh machine, just as bad as an ephemeral one. The same developers are affected by this every day: whichever ones have a typically earlier schedule than their coworkers.
The Bazel server and output tree (bazel-out) are sensitive to whatever workload was performed last. Pull Requests can have widely varying base commits, and as soon as a worker has to "sync backwards" through the commit history, it's likely to run over a change that invalidates Bazel caches and outputs. Then, the next request is closer to HEAD and has to re-do all this work again.

You'll need to allocate a lot of time for reasoning about how to improve the 95th percentile builds that fall into these de-optimizations.

Runner health checking

When we stop using ephemeral build runners, we invite the possibility of resource leaks. For example, a test might start up docker containers and then fail to shut them down. After this runner performs enough builds, it will may run out of file handles, disk space, memory, or other finite resources.

You'll need some way to defend against poorly behaved workloads. This could include monitoring how many builds a runner has done, or the available resources remaining, and take that runner out of the pool proactively. Or, you might add some cleanup logic to find leaked resources and close them after builds are complete.

Choose and deploy a remote cache

There are quite a few options and it's pretty challenging to understand the trade-offs for all of these. Typical problems with your remote cache include:

Network saturation on a single machine. AWS goes up to 25G network on the largest instance size, which isn't enough in a medium-sized company to serve hundreds of CI worker machines. You may need a cache that can horizontally scale to additional shards.
You may need replication for better uptime. Updates to the remote cache software will otherwise cause significant delays for CI users and therefore require nighttime/weekend maintenance windows.

You might also get bogged down in a decision for remote caching when you want to select a remote execution service at the same time, even if you're not likely to need that anytime soon. As long as your build inputs are deterministic, you should expect a very high cache hit rate, so most builds are fast even with any actions being run on the CI worker machine. The correct thing to tackle first is to dive into any non-determinism in your inputs and fix them.

Mirror files from the internet

Depending on external services degrades your uptime. For example, sometimes GitHub or other CDN providers have outages or partial outages, and your pipeline fails because it's unable to fetch these files.

You'll want to setup Bazel's downloader, using --experimental_downloader_config to point at some read-through mirror you maintain.

It's a good practice to lock down the ability of your users to introduce new dependencies on the internet, not only during Bazel's downloading phase, but also in build actions or tests. You should consider network-firewalling the agents that run build actions to prevent these.

Define SLAs and monitor them

Your stakeholders want assurances that you're making CI faster and keeping it fast.

You need critical path developer metrics, like how long developers perceive they wait in the CI queue for a runner to be available, how long it takes from getting scheduled onto a runner before the first Bazel action spawns, how long it takes to be notified of a failing unit test. You want to support the health of the pipeline by monitoring the greenness ratio of main and watch for flakiness exceeding a tolerable threshold.

You also probably want some underlying metrics for diagnosing slowness, such as the rate of invalidations for external repository and analysis caches.

As with any monitoring project, you have to start from data collection, then make a robust pipeline to store, aggregate, and visualize, and then create alerts for your on-call devops engineer.

This might involve working with your production oncall support, if you want to share common instances of monitoring software like Prometheus and Grafana. Alternatively they might ask the DevOps team to deploy their own instances of these services, so plan ahead for both options.

Keep the build green

You'll have to decide what "green" even means. See our earlier article, Monorepo Shared Green.

Then, you have to have policies and mechanisms in-place to repair a red master branch. This needs to be quick - product teams will howl if they're blocked from shipping, and a red master can pick up a compounding breakage or developers may already be writing code that depends on the culprit.

You'll need to identify breakages, alert a build cop, and communicate the status of the red build. Later engineers who sync into the red region may ask "why is this broken", not realizing they picked up a bad base commit, so you may want a stable git ref that engineers clone. When you found the culprit, you need to quickly revert, ideally not waiting on CI and code review processes. You must ensure that the breakage is escalated until its resolved, as the buildcop isn't always reliable. A policy should be in-place to defend the buildcop from angry engineers who think they should have been given time to "fix forward" their broken commit.

In some cases, the breakage needs a post-mortem. Why wasn't it caught in pre-commit testing? What follow-up actions are needed to reduce the buildcop burden and keep main green more of the time?

Reduce cost

Everyone's budgets are being closely watched in 2023. Your Bazel rollout should come with decreased costs. This means it's not good enough to patch over cache misses by throwing more machines at it, like with Bazel remote build execution (RBE). Instead, you have to monitor determinism so that the cache hit rate stays high.

Constant tuning is required to keep the scale-up/scale-down curve closely matched with demand. Are agents running idle for too long? Could work be better "bin-packed" onto available machines?

Remote Execution

The previous section says to avoid RBE when it's just a workaround for a poor cache hit rate. Even for changes that really do invalidate much of the cache (a Node.js or Go SDK version change for example) the engineer working on that change is willing to tolerate a slower CI roundtrip since it's expected to be a heavy migration project.

However, in larger scale orgs (maybe over 250 engineers in a highly-connected monorepo) then you may have regular product engineering work which invalidates an expensive part of the graph, AND that graph is "wide" - lots of work could run in parallel. In this scenario, you want to add more compute power to each Bazel worker, and this is what Remote Build Execution is for.

There are SaaS offerings, or you might be using a remote cache system that includes RBE like BuildBarn.

Keep delivery and deploys working

Your CI pipeline is also responsible for creating release artifacts. Engineers should not ship releases from their local machines, containing whatever state their git branch is in, as this is not reproducible.

You have to decide what artifacts to deliver for a given build. See our article on Selective Delivery for one approach you can take.

The Continuous Delivery has to stamp builds with version control info, so that monitoring systems can report whether the server crash-loop is correlated with a new version of the app. However, stamped artifacts are non-deterministic and ruin the cache hit rate.

Also, you want to lock down the credentials available to the test running machines, while still allowing CD to push to your artifact store.

For these security and performance reasons, you need a second pipeline to build these artifacts.

This sounds hard. How long will it take?

Yes, it really is hard. If you'd like another opinion, take a look at Son's excellent post: sluongng.hashnode.dev/bazel-in-ci-part-2-wo.. which covers these topics in more technical detail.

From what we've seen, a "medium" sized company should expect to spend 6-12 months of engineering effort for 2 full time senior build / dev-infra engineers (with Bazel & CI/CD experience), and then ongoing maintenance and tuning equivalent to 0.75-1 full time build / dev-infra engineer.

Can I just use Remote Build Execution with a simple CI runner?

Answering this requires understanding how Bazel works.

The Bazel host machine (the CI runner) first has a lot of work to do to execute repository rules (the code it saved in the Repository Cache) and analyze & build the action graph. All that has to happen before any individual actions can be sent to remote executors to be run. On a naive setup like ephemeral workers, that can add minutes to each CI job and that time goes up the larger the repository is.

So, adding RBE to a naive setup will probably increase overall costs and not give the best-case performance of Bazel's incremental model.

What if you can't budget that much time for this project?

No surprise, as the authors of this post we've already built all these lessons into our product, Aspect Workflows. Everything in this post can be handled for you.

Learn more, and sign up for a free trial at https://aspect.build/workflows .