At Aspect we've helped with a bunch of build system migrations at a few scales, and have been working under our own set of guiding principles.
Incubate net promoters
Software is written by humans, so human psychology matters. Humans are tribal, drawn to cosmetics, and jump to conclusions with their "system one" mind.
The initial reaction to Bazel will form a first impression that’s hard to correct later. People will feel their tribe is offended: "that might work in language X but us Y language developers would never do it that way". They'll mistake something cosmetic for something inherent: "the way the errors get presented in CI don't make any sense". They'll infer that one slow step means the whole thing is slower.
Even uninformed opinions can matter a lot. Early influential users will spread these around the engineering culture, causing people who had no strong opinion to become entrenched, making your later job either easier or harder.
For example, when rolling out a new invariant as a CI check, make the "how to fix" instruction obvious. If the person feels that the check caught a real issue that was trivial to get past, they'll feel it was an enjoyable experience. If they think you're throwing a roadblock in their way, they'll resent it.
Don't disrupt workflows
Keep the Makefile!
If the developer types
make test or
yarn build or
npm run serve today, maybe they still can. Avoid changing the topmost user-facing part of the tooling where possible.
Most product developers aren't interested in build system details and don't care about whatever change you've made, they don't want it reflected in what they need to type. Retraining is expensive and burns goodwill.
Also remember that many workflows take place in an editor. Bazel disrupts the paths on disk where editor extensions look for libraries. Be aware of this problem and be proactive to find and advertise workarounds for keeping editors happy.
Change one thing at a time
Build system changes can cause subtle regressions, where cause and effect are at a distance, and the developer encountering the problem and the build system engineer who caused it are totally unfamiliar with the domain of the other. This makes it hard to diagnose problems.
Just like a
git bisect workflow, a sane, linear history of events makes it much easier to reason about what happened. Based on the delta of what changed, you can make assertions about what is the possible blast radius, and whether that can explain the problem.
So, we try to change only one thing at a time. Use pre-factoring changes to the old build system to do things like break up a cycle in the dependency graph (but avoid such code changes if they're not load-bearing for Bazel migration, see "don't change the code" below). Use post-factorings to make related cleanups you noticed during the migration. Resist the urge to combine these at all costs!
One ideal outcome from this principle: you can use
bazel build --subcommands to see the flags passed to some tool X, then compare with how that tool was called by the legacy build system, and any differences should be intentional and required by this migration step.
The ratchet is a mechanical tool to ensure "no backsliding".
Whenever you tighten the semantics of the build, by ensuring some new invariant holds, you should have a ratchet in place to make sure it stays that way. For example, if you fix some type-check errors, you should make sure the CI system will mark any subsequent changes red if they re-introduce those type-check errors.
Combined with "change one thing at a time" this can give you incredible power to work in a huge codebase. For example if you just introduce one type-check error code at a time, you can fix only those, and use your ratchet to make sure those don't come back. You can then rinse-and-repeat with low risk changes, while ensuring that the system as a whole is eventually converging on the correct behavior.
A migration can't leave a developer experience in a bad state before improving it.
Alex explains this in a BazelCon talk: youtu.be/UwuRGpVpmbo?t=398 As explained in that talk, we want to maximize the benefits of Bazel, while deferring costs and known risks. We have to keep making improvements.
Don't say "we'll just leave a TODO here to come back and fix the performance regression". That TODO will be there longer than you think, maybe forever. In the meantime dissatisfaction might cause escalation to decision makers who de-fund the migration work.
This is related to the Youtube results for "changing a tire while driving". Even though a big migration is underway, it's critical to the business that it be non-disruptive.
Close the loop
Before calling a task "done", think about "acceptance criteria". What does "done" mean, and did you fix the root cause, or only one proximate cause?
As an example, answering a technical question for one user helps that user today. They may ask the same question again later, along with a hundred of their colleagues. Answering their question doesn't close the loop, instead you should figure out what documentation they would have naturally consulted, or what error message they were presented with. Go fix those things, then just send them a pointer to that fix.
For Bazel this often means adding validation steps, constraining legal values for attributes, and fixing error messages, in addition to getting into a habit of always improving documentation.
Leave the code alone
Sometimes it’s Bazel that should change - things like writing to the source folder, the choice of working directory for a test, having a filesystem layout in a certain way.
We want to avoid changes that break the legacy build, and don’t want developers to have an impression that Bazel requires code changes that are really just different idioms.
If the code does have to change, maybe it can be in a superficial way. For example you could add comments like Gazelle directives that inform the tooling without making any load-bearing changes that could break things.
Dependencies are an important case as well. We shouldn't change versions of any third-party library just because Bazel is managing them.
Leave few fingerprints
As a consultant, you want to make sure clients own their own code during and after a migration. You also want to touch only the build system, while leaving the owners of the code to make modifications.
You can run a tool which changes the code in known-safe ways (like running a formatter), then attribute changes to that tool rather than yourself.
If you're editing a bunch of code, you probably didn't follow the "leave the code alone" principle.