Monorepo Shared Green

The journey to monorepo shouldn't stop when all the code is in a single Version Control repository. After all, much of the touted benefit of monorepos is the increased code sharing and consistency between projects. If each team continues to use the term "repo" for their top-level folder in the monorepo, and works in isolation from other folders in the monorepo, then there was no benefit of moving them in! Instead, we want to continue the "mono" effort, bringing the concept to more of the developer workflow.

Aspect writes a lot about Bazel, which is the "monobuild" for your monorepo, allowing for code sharing between projects. In this post, I want to cover "mono-CI/CD". That is, how many continuous integration pipelines do you have in a monorepo, and how many continuous delivery mechanisms. I'll advocate for the "shared green" model.

First some concepts are needed:

Buildcopping

Each red/green status in the repo needs to be kept green. Discovering a breakage (green->red) and repairing it quickly is the job of a "Build Cop".

Some teams are not so great at being build cops. Usually the responsibility is unclear ("hey, anyone looking at why master is red?") or is assigned to someone who has more pressing product development work and who isn't really proficient at reading logs and reasoning about what broke. They often don't have authority to revert bad commits, instead asking the commit author to resolve it. And too often, authors are attached to what they landed and spend precious time repairing the problem by rolling-forward (adding new commits) rather than reverting.

In a large organization, it's more economical to have a small rotation of people who are well-trained and have an obvious runbook for keeping the pipeline green, than for each team to do this themselves.

It's very risky to deploy a service when tests are failing. In some cases, it's even a violation of regulatory compliance. So, when it's time to release, any brokenness on CI is finally a critical issue to product owners. Thus, a red repo halts deployments and can cause real cost to the business.

The Hard Way: each project has their own pipeline and status

To gate commits, we need to decide which pipelines to run a change against. We could use the dependency graph to determine which targets are potentially affected by a change, but this is incorrect and/or slow.

tinder/bazel-diff is very incorrect
target-determinator is very slow
the SkyFrame system Google uses for this is occasionally incorrect, as Ulf recently reminded me

It's even harder to do CD. Can we release our service? Which tests should be green? Do you release if a library you depend on has a failing test? How do you communicate to engineers why the CD system didn't produce an output for their commit? There are no satisfying solutions on the market today.

The Easy Way: shared green

Shared green simply says, there is one build&test pipeline for the monorepo. This "monostatus" applies across all libraries, applications, and services in the repository. If anything is red, nothing can release.

It's easy for engineers to reason about and requires no code - all CI systems have a way to gate the delivery step on the success of the testing step.

Making shared green scale

The fundamental requirement of a shared green is that it has to be almost always green. Red regions block teams from releasing, and the more they are blocked, the more severe the pushback from those teams. Worse, if your time-to-repair is longer than the interval between breakages landing, they will compound and be much harder to reason about and resolve. Also, another team may have landed a change that depends on the commit that needs to be reverted, so that the oncall now has to revert more than one commit.

How can we stay green more of the time to avoid these shared-green failure modes?

Reduce the time between a bad commit landing and the breakage being reported. Perhaps introduce a "Failing" status in CI, where the build and test is still running, but is known that it will go red later.
Reduce the time it takes the oncall to respond. Make sure the paging system works well, and escalate to secondary. Avoid "false positive" pages where oncall is paged for flakes, as this makes them less likely to respond to a real breakage.
Reduce the time it takes the oncall to repair the build. Point directly to the commit and give instructions for how to revert that commit, or build a button into your UI that reverts immediately.
Reduce the time it takes for the fix commit to be reported as green. This is simply a matter of keeping the master pipeline fast.
Give the oncall authority. No one may question a revert that's performed to keep the CI green.
Post-mortem each breakage. In-flight semantic collisions occur when two PRs are green individually but red when combined - if this becomes frequent, you may need a Merge Queue to re-test green PRs when landing (especially those which are further behind HEAD)
Allow a "break the glass" in CD for teams who want to release despite red. Audit when this happens and work to reduce the frequency this is needed.

Can it really work at my scale?

This is a tough question. Here are data-points that I know:

Google circa 2015 had 2 billion SLOC and 50k engineers. There was no snapshot of the google3 monorepo where all the starlark code could even successfully parse, let alone be analyzed, built and tested. No chance for shared green.
A large Aspect customer has 2 million SLOC and 500 engineers. They are still on shared green, but without the "break the glass" for CD. On-call is sometimes hard, and there are stretches of redness on master which prevents deployment. More investment in on-call responsiveness as suggested above could provide some relief for better experience and more growth.
All other Aspect customers have a shared green.

This suggests that if your company is 2-3 orders of magnitude smaller than 2015 Google (100-1000 times smaller in SLOC * number of engineers) then you may be able to keep a shared green.

Merge Queues

If you encounter in-flight collisions a lot, then you may need to run tests a third time. In addition to testing the developer's snapshot and the result of the merge, you can also run the tests right before merging. This uses more resources of course, and also slows down the developer, because you need to test linearly, one commit (or batch of commits) at a time.

A "Merge Queue" is a separate developer workflow system that manages this. GitHub offers one: Managing a merge queue. There's also a great research paper from Uber describing their fancy Merge Queue design: "Keeping Master Green at Scale".

However, if your rate of in-flight collisions is rare, then we recommend you just allow the build cop to revert a colliding commit. The build cop has to monitor for other failures on main, and the cost for that person to revert the occasional commit a few times a month is typically less than a Merge Queue.