|
| 1 | +# WG Node Lifecycle Charter |
| 2 | + |
| 3 | +This charter adheres to the conventions described in the [Kubernetes Charter README] and uses |
| 4 | +the Roles and Organization Management outlined in [wg-governance]. |
| 5 | + |
| 6 | +[Kubernetes Charter README]: /committee-steering/governance/README.md |
| 7 | + |
| 8 | +## Scope |
| 9 | + |
| 10 | +The Kubernetes ecosystem currently faces challenges in node maintenance scenarios, with multiple |
| 11 | +projects independently addressing similar issues. The goal of this working group is to develop |
| 12 | +unified APIs that the entire ecosystem can depend on, reducing the maintenance burden across |
| 13 | +projects and addressing scenarios that impede node drain or cause improper pod termination. Our |
| 14 | +objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with |
| 15 | +existing APIs and behaviors. |
| 16 | + |
| 17 | +To properly solve the node drain, we must first understand the node lifecycle. This includes |
| 18 | +provisioning/sunsetting of the nodes, PodDisruptionBudgets, API-initiated eviction and node |
| 19 | +shutdown. This then impacts both the node and pod autoscaling, load balancing, and the applications |
| 20 | +running in the cluster. All of these areas have issues and would benefit from a unified approach. |
| 21 | + |
| 22 | +### In scope |
| 23 | + |
| 24 | +- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs |
| 25 | + and extending the current ones. This includes exploring extension to or interactions with the Node |
| 26 | + object. |
| 27 | +- Analyze the node lifecycle, the Node API, and possible interactions. We want to explore augmenting |
| 28 | + the Node API to expose additional state or status in order to coalesce other core Kubernetes and |
| 29 | + community APIs around node lifecycle management. |
| 30 | +- Improve the disruption model that is currently implemented by API-initiated Eviction API and PDBs. |
| 31 | + Improve the descheduling, availability and migration capabilities of today's application |
| 32 | + workloads. Also explore the interactions with other eviction mechanisms. |
| 33 | +- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle. |
| 34 | + To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000) |
| 35 | + feature to GA and resolve the associated node shutdown issues. |
| 36 | +- Improve the scheduling and pod/node autoscaling to take into account ongoing node maintenance and |
| 37 | + the new disruption model/evictions. |
| 38 | +- Explore the cloud provider use cases and how they can hook in into the node lifecycle. So that the |
| 39 | + users can use the same APIs or configurations across the board. |
| 40 | +- Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter, |
| 41 | + ...) to use the new approach. |
| 42 | + |
| 43 | + |
| 44 | +### Out of scope |
| 45 | + |
| 46 | +- Implementing cloud provider specific logic, the goal is to have high-level API that the providers |
| 47 | + can use, hook into, or extend. |
| 48 | +- Infrastructure provisioning, deprovisioning solution or physical infrastructure lifecycle |
| 49 | + management solution. |
| 50 | + |
| 51 | +## Stakeholders |
| 52 | + |
| 53 | +- SIG Apps |
| 54 | +- SIG Architecture |
| 55 | +- SIG Autoscaling |
| 56 | +- SIG CLI |
| 57 | +- SIG Cloud Provider |
| 58 | +- SIG Cluster Lifecycle |
| 59 | +- SIG Network |
| 60 | +- SIG Node |
| 61 | +- SIG Scheduling |
| 62 | +- SIG Storage |
| 63 | + |
| 64 | +Stakeholders span from multiple SIGs to a broad set of end users, |
| 65 | +public and private cloud providers, Kubernetes distribution providers, |
| 66 | +and cloud provider end-users. Here are some user stories: |
| 67 | + |
| 68 | +- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without |
| 69 | + any required manual interventions. I also want to be able to observe the node drain via the API |
| 70 | + and check on its progress. I also want to be able to discover workloads that are blocking the node |
| 71 | + drain. |
| 72 | +- To support the new features, node maintenance, scheduler, descheduler, pod autoscaling, kubelet |
| 73 | + and other actors should use a new eviction API to gracefully remove pods. This should enable new |
| 74 | + migration strategies that prefer to surge (upscale) pods first rather than downscale them. It |
| 75 | + should also allow other users/components to monitor pods that are gracefully removed/terminated |
| 76 | + and provide better behaviour in terms of de/scheduling, scaling and availability. |
| 77 | +- As an end user, I cannot bear the cost of blue-green upgrades, especially with special hardware |
| 78 | + accelerators; it's far too expensive. It is more cost-effective to coordinate a drain and then |
| 79 | + upgrade. |
| 80 | +- As a cloud provider, I need to perform regular maintenance on the hardware in my fleet. Enhancing |
| 81 | + Kubernetes to help CSPs safely remove hardware will reduce operational costs. |
| 82 | +- Modelling the cost of doing accelerator maintenance in today's world can be massive. And since |
| 83 | + hardware accelerators tend to need more love and care, having software support to coordinate |
| 84 | + maintenance will reduce operational costs. |
| 85 | + |
| 86 | +## Deliverables |
| 87 | + |
| 88 | +The WG will coordinate requirement gatherthing and design, eventually leading to |
| 89 | +KEP(s)s and code associated with the ideas. |
| 90 | + |
| 91 | +Area we expect to explore: |
| 92 | + |
| 93 | +- An API to express node drain/maintenance. |
| 94 | + Currently tracked in https://github.com/kubernetes/enhancements/issues/4212. |
| 95 | +- An API to solve the problems wrt the API-initiated Eviction API and PDBs. |
| 96 | + Currently tracked in https://github.com/kubernetes/enhancements/issues/4563. |
| 97 | +- An API to remove pods from endpoints before they terminate. |
| 98 | + Currently tracked in https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y. |
| 99 | +- Introduce enhancements across multiple Kubernetes SIGs to add support for the new APIs to solve |
| 100 | + wide range of issue. |
| 101 | + |
| 102 | +We expect to provide reference implementations of the new APIs including but not limited to |
| 103 | +controllers, API validation, integration with existing core components and extension points for the |
| 104 | +ecosystem. This should be accompanied by E2E / Conformance tests. |
| 105 | + |
| 106 | +## Roles and Organization Management |
| 107 | + |
| 108 | +This WG adheres to the Roles and Organization Management outlined in [wg-governance] |
| 109 | +and opts-in to updates and modifications to [wg-governance]. |
| 110 | + |
| 111 | +[wg-governance]: /committee-steering/governance/wg-governance.md |
| 112 | + |
| 113 | +## Timelines and Disbanding |
| 114 | + |
| 115 | +The working group will disband when the KEPs we create are completed. We will |
| 116 | +review whether the working group should disband if appropriate SIG ownership |
| 117 | +can't be reached. |
0 commit comments