rctx.download custom headers coming to Bazel 7.1

Bazel is responsible for fetching files from the internet in most cases. From a WORKSPACE file you may have seen this as http_archive or http_file. Behind the scenes, when writing a repository rule, this uses the repository_ctx#download API, https://bazel.build/rules/lib/builtins/repository_ctx#download (or the download_and_extract variant).

Note that some rulesets decided not to use the Bazel downloader. For example, rules_go uses go mod download https://pkg.go.dev/cmd/go#hdr-Download_modules_to_local_cache and rules_python uses pip download.

The Bazel Downloader API is great because the files are always cached in the Bazel repository_cache. If you enable --experimental_remote_downloader you can even use your Remote Cache to serve as a read-through proxy of the files. The Downloader also has a rich configuration ability which Alex wrote about earlier: https://blog.aspect.dev/configuring-bazels-downloader

As ruleset authors, we've relied heavily on this Bazel API. But, we ran into some cases that didn't work. For example, the Docker registry is an HTTP server with lots of surprises. At least that has been our experience implementing a repository rule for rules_oci for fetching container images from registries like DockerHub.

Docker pull and HTTP Headers

In rules_oci, we don't want to run docker pull. We want to use the Bazel downloader to get the benefits listed above.

Internally, container images are just a few tar archives called "layers" which are put together via a "Manifest" file, which is a straightforward json file. Like everything else in the software industry, this json file also has its variations. We are mostly interested in the v2 version as it is widely used, while v1 is deprecated and not in use.

However, some people don't like to move with everybody else. So this encouraged Docker to implement a fallback behavior where if you are trying to request a manifest from a registry, and you don't have Accept HTTP header set with specific mime types, then the registry will happily assume that you are an old person trying to fetch an image that has the v2 manifest and gladly downgrade the manifest to v1.

Unfortunately Bazel's downloader had no way to set headers, so it was not suitable for fetching Docker layers. We had to workaround this by reverting to curl in our repository rule. But then the downloaded files are no longer cached, and so this is not a principled solution.

Thanks to Bazel being open-source, this was correctable! I filed the issue for this: https://github.com/bazelbuild/bazel/issues/17829. Then I worked with the Bazel team to land a fix for it. Thanks to Fabian and Tiago for reviewing the PR and getting it landed!

The feature that allows setting arbitrary HTTP headers is already on HEAD and available on the documentation page but has not been cherry-picked into Bazel's 7.1-release branch yet. However, it can be used today by putting 2b8885ed954e58d09d47290d47882a7298c88334 into .bazelversion file.

Alpine

This problem of setting headers is not unique to Docker registries. Alpine packages also have a property that also requires us to send some headers. However, this case is a little different than Docker one. Alpine uses an archive format which is simply multiple gzip streams of tars combined. It is a lot like `.deb` archives.

These three segments are;

signature: contains signature
control: contains metadata information about the package.
data: contains files such as libs and executables.

The problem with .apk packages is that unlike the control , data segments, the signature segment is not guaranteed to be stable, meaning its contents might change without notice. This is a problem in the Bazel world because in order to be reproducible everything we fetch from the internet has to be identical even after five years from now.

Luckily the signature segment is optional, some packages don't even have it, and can be omitted if needed. However, since an .apk is a single archive consisting of three .tar archives we can't just get the last two. This is where we need to use HTTP range requests to only fetch the last two segments of an archive so that it's always stable. In order to do that, we need to send a Range header to tell the HTTP server to stream only the range of bytes we specified.

Trying it out

Here's an example of an oci_pull rule that demonstrates how setting the Accept header changes the response of the registry.

def _simple_pull_impl(rctx):
    url = "https://index.docker.io/v2/fluxcd/flux/manifests/1.25.4"
    # Get the token: curl -fsSL "https://auth.docker.io/token?service=registry.docker.io&scope=repository:fluxcd/flux:pull"
    auth_pattern = {
        url: { "type": "pattern", "pattern": "Bearer <password>", "password": "<token>" }
    }
    headers = {
        "Accept": "application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.docker.distribution.manifest.v2+json",
    }
    rctx.download(
        url = url,
        auth = auth_pattern,
        # If you comment out the headers argument, the schema version will be 1.
        headers = headers,
        output = "manifest.json"
    )
    manifest = json.decode(rctx.read("manifest.json"))
    print("schemaVersion: %s" % manifest["schemaVersion"])

    rctx.file("BUILD.bazel", """filegroup(name = "manifest", srcs = ["manifest.json"])""")

simple_pull = repository_rule(
    implementation = _simple_pull_impl,
)

With HTTP headers the sky is the limit. We have only talked about a few specific problems, but this feature allows Bazel users to do much more than that.

Imagine if you want to download a big archive from the Artifactory but it takes too much time. Well, you can accelerate the fetching by downloading different parts of the archive in parallel by making multiple HTTP requests simultaneously. I'll post more on that later.