Version: v3.19.x

Background Information on Mutation

Mutation webhooks in Kubernetes is a nuanced concept with many gotchas. This page explores some of the background of mutation webhooks in Kubernetes, their operational and syntactical implications, and how Gatekeeper is trying to provide value on top of the basic Kubernetes webhook ecosystem.

Mutation Chaining

A key difference between mutating webhooks and validating webhooks are that mutating webhooks are called in series, whereas validating webhooks are called in parallel.

This makes sense, since validating webhooks can only approve or deny (or warn) for a given input and have no other side effects. This means that the result of one validating webhook cannot impact the result of any other validating webhook, and it's trivial to aggregate all of the validation responses as they come in: reject if at least one deny comes in, return all warnings and denies that are encountered back to the user.

Mutation, however, changes what the input resource looks like. This means that the output of one mutating webhook can have an effect on the output of another mutating webhook. For example, if one mutating webhook adds a sidecar container, and another webhook sets imagePullPolicy to Always, then the new sidecar container means that this second webhook has one more container to mutate.

The biggest practical issue with this call-in-sequence behavior is latency. Validation webhooks (which are called in parallel), have a latency equivalent to the slowest-responding webhook. Mutation webhooks have a total latency that is the sum of all mutating webhooks to be called. This makes mutation much more latency-sensitive.

This can be particularly harmful for something like external data, where a webhook reaches out to a secondary service to gather necessary information. This extra hop can be extra expensive, especially if these external calls are not minimized. Gatekeeper translates external data references scattered across multiple mutators into a single batched call per external data provider, and calls each provider in parallel, minimizing latency.

Mutation Recursion

Not only are mutators chained, but they recurse as well. This is not only due to Kubernetes' reinvocation policy, but also due to the nature of the Kubernetes control plane itself, since controllers may modify resources periodically. Whether because of the reinvocation policy, or because of control plane behavior, mutators are likely to operate on their own output. This has some operational risk. Consider a mutating webhook that prepends a hostname to a docker image reference (e.g. prepend gcr.io/), if written naievly, each successive mutation would add another prefix, leading to results like gcr.io/gcr.io/gcr.io/my-favorite-image:latest. Because of this, Kubernetes requires mutation webhooks to be idempotent.

This is a good idea, but there is one problem: webhooks that are idempotent in isolation may not be idempotent as a group. Let's take the above mutator and make it idempotent. We'll give it the following behavior: "if an image reference does not start with gcr.io/, prepend gcr.io/". This makes the webhook idempotent, for sure. But, what if there is another team working on the cluster, and they want their own image mutation rule: "if an image reference for the billing namespace does not start with billing.company.com/, prepend billing.company.com/". Each of these webhooks would be idempotent in isolation, but when chained together you'll see results like billing.company.com/gcr.io/billing.company.com/gcr.io/my-favorite-image:latest.

At small scales, with small teams, it's relatively easy to ensure that mutations don't interfere with each other, but at larger scales, or when multiple non-communicating parties have their own rules that they want to set, it can be hard, or impossible to maintain this requirement of "global idempotence".

Gatekeeper attempts to make this easier by designing mutation in such a way that "global idempotence" is an emergent property of all mutators, no matter how they are configured. Here is a proof, where we attempt to show that our language for expressing mutation always converges on a stable result.

Summary

By using Gatekeeper for mutation, it is possible to reduce the number of mutation webhooks, which should improve latency considerations. It should also help prevent decoupled management of mutation policies from violating the Kubernetes API server's requirement of idempotence.