Deterministic npm dependencies with Bazel

Deterministic npm dependencies with Bazel

Determinism in a build system is the property that the outputs are identical for given input files. A build tool might embed a timestamp in its output, or use a non-stable sorting, or include local filesystem paths. All of these are non-deterministic and will ruin build performance in a build system like Bazel. This article discusses a typical source of de-optimized frontend builds we've seen at our clients who use npm packages. In later articles we'll see how similar problems exist in many language ecosystems, like in Python.

How non-determinism impacts your build

Under the hood, Bazel hashes the contents of each installed npm package that is an input to your build. These hashes are used to determine if a target needs to rebuild and to check for cache hits before rebuilding.

Diverging input hashes in npm dependencies from different npm & yarn installs means that you won't share remote cache hits between machines and you will rebuild targets on every machine each time your npm repository rule(s) run. Since npm dependencies are inputs to most, if not all, of your NodeJS targets, this problem can effect a large part of your build graph leading to unexpected rebuilds, long CI times and frustrated developers.

Developers that are affected by this may get frustrated and perceive Bazel as slow. On CI, workers that can't share cache hits for targets will have to do duplicate work to build these targets which is both slow and expensive at scale.

Fixing this problem will reduce the frequency of long local builds, improve CI times and reduce your CI compute costs. At scale, the productivity gains and cost savings can be significant. Even for small to medium sized projects, frequent rebuilds due to non-determinism in your npm dependencies can be frustrating and break your focus.

Non-determinism in npm installs

While npm packages downloaded from the registry are deterministic, the post-install steps of many npm packages can generate non-deterministic files. These files may contain absolute paths and/or timestamps or have other non-determinism bytes baked-in when generated during postinstall. "Native" packages which use a system like Make to compile non-JS sources at install time are a typical source of trouble, for example:

Diffing @npm repositories' node_modules
Files f1/node_modules/ssh2/build/Makefile and f2/node_modules/ssh2/build/Makefile differ
Files f1/node_modules/ssh2/build/config.gypi and f2/node_modules/ssh2/build/config.gypi differ

Vernacular NodeJS build systems typically don't take the file contents of npm dependencies into account when caching, so non-determinism in these files does not affect incrementally.

When building with Bazel, which provides much stronger guarantees for correctness and reproducibility, these generated files are accounted as inputs and non-determinism can lead to mass cache misses which can significantly impact the incrementally of your build.

Checking for non-determinism

If you're seeing full rebuilds of large parts of your build graph every time your npm repository rule(s) run and lower than expected remote cache hit rates, non-determinism in your npm dependencies could be the cause.

A simple way to determine if your build is suffering from non-determistic npm dependencies is to diff the results to two separate runs of your npm repository rules at different output bases. Running on different output bases is important so that the absolute path of the npm or yarn installs differs between the two runs.

If your build runs on multiple platforms then you should also run this check on all platforms you use, since postinstall steps will generate different files on different platforms. This may be the case, for example, if your developers build on MacOS but your CI runs Linux workers

In some cases, there may be non-determinism that creeps in due to non-hermetic system inputs such as the version of gcc or xcode install on the system. To check for this you'd have to diff separate runs of your npm repository rules from machines with different configurations.

Example

To illustrate this process, I created an example repository with a check_npm_determinism.sh script that performs a diff of two different npm installs to different output bases.

The script runs bazel --output_base=<path> fetch @npm//... twice on different output bases and then compares the resulting node_modules folders with diff,

diff -qr "$node_modules_1" "$node_modules_2"

If there is non-determinism in the npm install the output may look something like this,

$ ./check_npm_determinism.sh 
Fetching first @npm
Starting local Bazel server and connecting to it...
INFO: All external dependencies fetched successfully.
Loading: 7 packages loaded
Fetching second @npm
Starting local Bazel server and connecting to it...
INFO: All external dependencies fetched successfully.
Loading: 7 packages loaded
Diffing @npm repositories' node_modules
Files tmp/fetch_1/external/npm/_/node_modules/ssh2/lib/protocol/crypto/build/Makefile and tmp/fetch_2/external/npm/_/node_modules/ssh2/lib/protocol/crypto/build/Makefile differ
Files tmp/fetch_1/external/npm/_/node_modules/ssh2/lib/protocol/crypto/build/config.gypi and tmp/fetch_2/external/npm/_/node_modules/ssh2/lib/protocol/crypto/build/config.gypi differ

In this case, the postinstall step of ssh2 generates a build folder with a Makefile and a config.gypi file that contain absolute paths.

NB: If you want to see the contents on the diff result just drop the -q flag on the diff call.

Resolving non-determinism

In the simple case, if the files that are non-deterministic are not needed for your build, you can simply delete them in a postinstall script. In the example repository WORKSPACE file, I simply delete the node_modules/ssh2/lib/protocol/crypto/build folder as it is the only one that contains problematic files.

In this case, I did this by adding && rm -rf ./node_modules/ssh2/lib/protocol/crypto/build to npm install using the args attribute of npm_install,

npm_install(
    name = "npm",
    package_json = "//:package.json",
    package_lock_json = "//:package-lock.json",
    args = [
        "&&",
        "rm -rf ./node_modules/ssh2/lib/protocol/crypto/build"
    ],
)

but you could also do this in the package.json postinstall script or bash script called from the postinstall script.

In more complex cases, where the problematic files are needed for the build, you may need to patch the offending npm dependency so it doesn't generate non-deterministic files or land a fix upstream that does the same.

Catching regressions

Whether or not you've found and fixed non-determinism in your npm dependencies, your build could regress in the future when new npm dependencies are added or existing ones are upgraded.

It is best practice to periodically run this check on CI (or manually if you must). Running this check on every change set could be too slow if becomes the bottleneck on no-op change sets, but setting up a periodic cron job on CI that runs once a day or more could save you and your developers time by catching regressions in npm dependency determinism when they happen so you can keep your build times fast and your developers productive.