kubectl drain skip bug could cause failures
- kubectl drain has multiple real edge-case bugs — not one new universal “skip bug” — and some can evict pods even when cordoning failed. - One open April 2026 bug shows drain retrying every 5 seconds after a 429 eviction rejection, even if the pod moved elsewhere. - That matters because drain is the maintenance primitive behind upgrades and scale-downs, and operators often trust its success output too much.
Kubernetes node drain is supposed to be the safe version of “take this machine out of service.” You cordon the node, evict the pods, wait for replacements, then do the reboot or upgrade. But the catch is that `kubectl drain` is a client-side workflow, and a few edge-case bugs show that the workflow can get confused in ways operators really do care about. ### What is drain actually supposed to do? `kubectl drain` marks a node unschedulable, then evicts or deletes the pods on it, while respecting things like PodDisruptionBudgets unless you explicitly bypass them. It also refuses some categories of pods unless you pass flags like `--ignore-daemonsets` or `--force`. The official docs are pretty clear on the contract — don’t touch the machine until drain completes. (kubernetes.io) ### So what’s the bug people are talking about? Turns out there isn’t one single newly disclosed “skip bug.” There are several drain edge cases in GitHub issues, and social posts seem to be collapsing them together. The most current one is an open Kubernetes issue from April 6, 2026. In that case, drain keeps retrying an eviction every 5 seconds after a `429 Too Many Requests` response — usually from a blocking PodDisruptionBudget — even if the pod has since been recreated on another node. (kubernetes.io) ### Why is that dangerous? Because drain is supposed to answer a simple question: “Is this node empty enough to take down?” If the command keeps tracking a pod that no longer belongs to the node, automation can stall forever. That can jam cluster upgrades, autoscaling scale-downs, or maintenance windows. In the issue report, the failure mode is especially awkward for StatefulSets, where a replacement pod can come back with the same name on a different node while drain still thinks it owns the old problem. (github.com) ### Is there also a real “skip” failure? Yes — but it’s older, and it’s a different bug. A 2023 issue, later marked fixed by Kubernetes PR `#122574`, described drain incorrectly ignoring a terminating pod during control-plane disruption. The bug came from bad logic around a pod `Get` call during API-server trouble, which could make drain treat the pod as effectively gone when it was still terminating. Basically, under the wrong conditions, drain could move on too early. (github.com) ### Can drain continue even when cordon failed? Also yes. Another issue showed `kubectl drain` continuing to evict pods even after the node cordon step was denied by an admission webhook. That means an operator could see an error about being unable to cordon a protected node, but drain would still proceed with evictions if the user had pod-eviction rights. That issue was filed in March 2024 and later closed, but it’s a good example of why “drain started” and “drain behaved safely” are not the same thing. (github.com) ### Does this mean drain is unsafe? Not broadly. It means drain is safe when the cluster state is boring — and a lot less reassuring when the cluster is already stressed, blocked by PDBs, or wobbling during control-plane events. Kubernetes itself documents that drain relies on eviction behavior, retries, and pod termination completing cleanly. Once those assumptions break, client-side edge cases matter a lot. (github.com) ### What should operators do differently? Treat drain as something to verify, not just invoke. Watch the node’s actual pod list. Watch replacement pods become Ready elsewhere. Be careful with `--disable-eviction`, because that bypasses PodDisruptionBudgets. And if your automation treats a long-running drain as “still making progress,” add checks for pods that have moved nodes or for cordon failures that should be fatal. (kubernetes.io) ### Bottom line? The real story is smaller than the social warning but still serious. `kubectl drain` is not one atomic server-side action. It’s a sequence of client decisions — and when the cluster gets weird, those decisions can be wrong in ways that either hang maintenance or make it look safer than it is. (kubernetes.io)