Lazy tool fetching under Bazel

·

5 min read

In the absence of a tool like Bazel to coordinate the tasks of downloading build inputs and invoking tools that need them, tools are designed with the workaround of doing this for themselves. This pattern of "lazy downloading" is common to simplify the developer experience. For example, web testing tools will download a copy of Chromium or Firefox to have a headless browser for use in the test fixture.

The download is probably large, or comes from a third-party, which is why it wasn't just embedded into the tool's package. A large download requires a cache so you don't wait for the fetch every time you run it. And a third-party download introduces a trust issue into the toolchain, and ought to have exactly pinned versions and content integrity hashes to ensure it isn't tampered with.

Today, I'll explore an example in which a tool downloads a modified Node.js interpreter.

That tool is https://github.com/vercel/pkg. As they describe it:

This command line interface enables you to package your Node.js project into an executable that can be run even on devices without Node.js installed.

Scanning further in the documentation, we see that "lazy download" pattern is being used:

pkg has so called "base binaries" - they are actually same node executables but with some patches applied. They are used as a base for every executable pkg creates. pkg downloads precompiled base binaries before packaging your application. If you prefer to compile base binaries from source instead of downloading them, you may pass --build option to pkg. First ensure your computer meets the requirements to compile original Node.js: BUILDING.md

See pkg-fetch for more info.

So, this pkg-fetch will download from a release archive when you run it. It does the things mentioned above: it has a cache for the downloads, and verifies the integrity hashes (see https://github.com/vercel/pkg-fetch/blob/main/lib/expected-shas.json). Since platform-dependent binaries are such a mess in NodeJS, for example see esbuild struggle with it this pattern gets used a lot in the Node ecosystem.

The Problem

The pkg tool works fine under NodeJS, but Bazel wants to be more prescriptive: we model external downloads, such as the one pkg-fetch performs, as a "repository rule" producing an "external repository", which you can download with bazel fetch. Bazel itself verifies the content integrity hash of the download, and manages its own cache. At this point, the rest of the build could be performed offline.

Then, when Bazel performs a build step (an "action"), it should run in a network-isolated sandbox. You can add a tags=["requires-network"] marker to permit build actions to connect over the network, so the pkg tool could run as-is under a Bazel action. But the cache behavior is what we really miss. Either we have to make the build non-hermetic, allowing pkg to use some folder outside the Bazel sandbox as its cache, or we have a slow fetch every time we run the tool. In my experiments, the download took 12sec and the tool executes in 2sec.

This gets even worse when Bazel does Remote Build Execution. Now, each run of the tool may happen on a different machine. Even if we took the non-hermetic approach above to put a cache folder somewhere on the disk, we'll mostly get cache misses.

A Solution

To make this tool more "Bazel-idiomatic", we want to divide the responsibilities between fetch and build. The download should happen in a repository rule (the built-in http_file is sufficient), and then the action that runs pkg should be network-sandboxed and have the downloaded file supplied as a declared hermetic input.

We'll wrap http_file with a simple macro so that we can include a copy of the hashes, which is needed because we are bypassing the expected-shas.json file in the pkg-fetch library I linked to above. Here's the code listing: https://github.com/aspect-build/bazel-examples/blob/main/vercel_pkg/defs/vercel_pkg_fetch.bzl - we can see this in action like this:

vercel_pkg % bazel fetch @pkg_fetch_node_macos_arm64//file
INFO: All external dependencies fetched successfully.
Loading: 0 packages loaded

Now we want to run pkg and have it use this file instead of fetching over the network.

As I wrote in https://blog.aspect.dev/genrule-bestrule a Bazel Macro is a great way to wrap a generic tool to have a more ergonomic expression in a BUILD file. I like to start from this "public user interface" when designing a solution. We want it to look like

vercel_pkg(
    name = "example_build",
    out = "example",
    entry_point = "index.js",
    node_major_version = 16,
)

Now, our macro can wrap the supplied bin entry that appears in the package.json file of the pkg package with load("@npm//:pkg/package_json.bzl", pkg_bin = "bin"). But there's a crucial missing ingredient: does the tool give us a way to point to a file we've already downloaded?

Fortunately, the library was designed to accept an environment variable PKG_CACHE_PATH: https://github.com/vercel/pkg#environment. The value of this variable can be any folder we construct under Bazel, as long as it looks the same as what the pkg-fetch library would have created. My typical approach for this is to run the tool outside of Bazel and see what that folder looks like. A naive use of http_file creates a folder structure that doesn't work with the tool - it expects a version number path segment, and then a filename with a certain format. The downloaded_file_path attribute of http_file lets us control this, however, so by constructing it this way:

# Match the path of what the `pkg` program would do dynamically.
downloaded_file_path = "v{}/fetched-v{}-{}".format(
    PKG_FETCH_VERSION, PKG_FETCH_NODE_VERSION, platform)

we create a folder we can use with PKG_CACHE_PATH.

You can see a complete solution for vercel_pkg here: https://github.com/aspect-build/bazel-examples/blob/main/vercel_pkg/defs/vercel_pkg.bzl