-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AEP-7942 Vertical Pod Autoscaling for DaemonSets with Heterogeneous Resource Requirements #7942
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bernot-dev The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @bernot-dev. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retitle AEP-7942 Vertical Pod Autoscaling for DaemonSets with Heterogeneous Resource Requirements |
/assign |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking more about it, I don’t think this is a good idea. Here’s why:
- DaemonSets are supposed to be the same on all nodes, and this change goes against that rule.
- There are different types of nodes, and it would be very hard to decide how each DaemonSet pod should work on each one.
- Nodes in a cluster can be replaced often (like spot instances). If DaemonSets behave differently on each node, it could cause problems when nodes change.
Because of this, I think this idea might create more problems than it solves. It would make things more complicated and harder to manage.
But again, this is my opinion.. maybe others think differently :)
@omerap12 Thanks for your comments. I'll respond to each point individually. 1According to the documentation, a DaemonSet ensures that all (or some) Nodes run a copy of a Pod. I am not aware of any promise of "sameness" beyond that. There is a reference to using multiple DaemonSets to specify different resource requests. This proposal aims to ease the operational burden of managing multiple DaemonSets that would otherwise need to be created strictly to manage resource allocation. In addition, this proposal aims to address situations where determining the correct resource allocations in advance is infeasible because it depends on other workloads within the cluster. As another example, consider a cluster storage daemon. The correct resource allocations have little to do with characteristics of the node. Rather, it depends how storage is used by other workloads scheduled to the node. If none of the pods scheduled to the node use storage at all, then minimal resources are needed. If other pods scheduled to the node are heavy storage users, the daemon may need a large resource allocation. Without the changes described in this proposal, that operational burden falls to humans. This proposal allows that operational burden to shift to Kubernetes. 2The type of node is not considered under this proposal. Only the actual usage history is relevant. This is consistent with the current behavior of VPA, which also does not consider node types. I agree that considering types of nodes with VPA would add additional complexity. There may be use cases for that kind of behavior, but I consider that concern out of scope for the current proposal. 3There are several reasons a Kubernetes user may want to use incorporate spot instances into their cluster. Making their cluster easier to manage is not one of them. I think there are probably some cases where VPA over spot instances makes sense, and probably many where it doesn't. At any rate, this proposal does not close the door to any existing options for managing workloads, including over spot instances. |
@bernot-dev |
I disagree, at that point you are basically prescribing that a k8s cluster should only have homogenous workloads in order to keep pods per node the same for all workloads. The problem this design aims to solve is very real - Most k8s clusters do not have a homogenous set of nodes and/or workloads and most agents which is what DS are used for scale relative to the number and activity of the pods - Think log gatherers, metrics gatherers or security daemons that track syscalls. I do agree on the general point though that it will be hard/impossible to make this useful in scenarios where nodes are very short-lived. I also like your idea of using an instance-type average as starting point rather than the DS setting. |
@alvaroaleman Thanks for your thoughts! The challenge is implementing this in environments where nodes are short-lived. Even with the node type idea, it could still be tricky. For example, with Karpenter, the nodePool settings could generate many different node types, making them hard to track. I also considered using node labels, but that could lead to different node types sharing the same labels, which would be problematic as well. |
It seems like there may be disagreement on this non-goal:
There are many ways that you could try to estimate resource requests for a new pod that is created when a node node is added to a DaemonSet. However, in the options that I have considered, there is not a clear "winner." Using node types might be better than nothing in some scenarios, but it comes at the additional cost of tracking usage over each node type grouping. Furthermore, there is no guarantee that the node type has any relationship whatsoever with the daemon's workload. It is a rough heuristic that may suit some use cases, but certainly is not universally applicable. Also, it raises the question of what should happen when a new node type is introduced to the cluster. I don't believe there is an obvious answer for what the user would expect in that scenario. Another option could be applying the average over the entire DaemonSet, instead of a node type average. Again, it's possible you would get a better recommendation from the outset, but it is not guaranteed. Yet another option would be setting a resource request proportional to the node size. So, a 32M memory request on a node with 4G of memory translates to a 128M (4x) request on a node with 16G (4x) of memory capacity. You could even do maths to determine a weighted average over different node sizes. Each of these options assume that knowing the size of the node tells you something about the workload of the Daemon. I reject that assumption. I understand there likely exists a weak correlation between the resource consumption of a Daemon and the size of the node it is on for typical workloads. However, I do not believe that the correlation is strong enough to justify adding complexity to VPA. Furthermore, the lifespan of the initial resource requests is expected to be short under VPA. Making a better guess at the initial resource allocation may affect the pod for O(minutes) before it receives a recommendation founded in the desired data. The impact of the initial recommendation is minimal, except in short-lived workloads, which are probably not a good fit for VPA to begin with. Finally, a new node may not have workloads scheduled immediately. If a node is added and a corresponding log gatherer daemon is added, that log gatherer will have very little work to do until other workloads are added that start producing logs. That could happen quickly or it could happen slowly. It could be the case that the first pod scheduled is a application under load with extremely verbose logging enabled. Or it could just be that the workloads scheduled only produce sparse logs, and it never needs to scale up. Making a guess based on what is happening on other nodes does not necessarily improve the guess. For these reasons, I affirm my position that this should remain a non-goal. |
@bernot-dev Thanks for the proposal, I really think this would be a useful addition! I've seen the very same problem that you're describing: DaemonSet resource consumption can depend heavily on what specific Pods are on a Node and what they do (lots of outgoing/incoming traffic, lots of writes to disk, lots of logs produced, etc.). As pointed out by this proposal, this may not be true for specific DaemonSets and you still want VPA to adjust the requests according to usage, but on a per-Pod (i.e. per-Node) level. If you want/need this, you could enable this with the new @bernot-dev I'm interested a bit in the details of implementing this: While it seems a reasonable enough change on a high level, it also seems quite intrusive to the way vpa-recommender is currently aggregating usage samples, making recommendations and storing them. Have you put together some prototype to proof this is feasible already or some rough ideas on how to make this work? In the end, we would probably need to
Answers to these open points will most likely have a huge effect on VPAs API, therefore I think we should clarify them a bit more before we can merge this proposal. PS: I saw this part in the proposal
but this is too high level for me to understand what the changes on the internals and the API would be. In a bit more detail
Does this mean you are proposing to change
Oh, right, I forgot about that in my list above. Currently, we have a VPACheckpoint object per VPA and container, now we would need one per Node and also change the initialization logic to feed this into the internal histogram state on recommender reboot! |
I do find this non-goal to be a small step backwards. At the moment we use VPAs for DaemonSets. The results we get are mostly-ok, but certainly not perfect. However, it's now up to the user to pick the starting value for the pod, rather than allowing for the VPA to do it. |
+ // Scope indicates at what level the recommendations should be calculated and applied. | ||
+ // By default, the scope is all pods under the controller specified in `TargetRef`. Scope | ||
+ // may also be set to "node" when the TargetRef is a DaemonSet, to apply recommendations | ||
+ // independently to the pod scheduled to each node under the DaemonSet. | ||
+ // +kubebuilder:validation:Enum=;node | ||
+ Scope VPAScope `json:"scope,omitzero" protobuf:"bytes,4,opt,name=scope"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the spacing here doesn't match the top part of this struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense to expand this to be more general. For example, copying topologySpreadConstraints and allowing for any topologyKey
?
I assume it doesn't change the solution much, and may allow for more options in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really interesting idea.
kubernetes.io/hostname
definitely aligns with the goals here. I can't quite picture what problem is solved by using topology.kubernetes.io/zone
or topology.kubernetes.io/region
with VPA. But maybe there's something that makes sense.
It seems there are some references to topologyKey being deprecated. Not sure what the story is there. Do you know if it is still supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes.io/hostname
definitely aligns with the goals here. I can't quite picture what problem is solved by usingtopology.kubernetes.io/zone
ortopology.kubernetes.io/region
with VPA. But maybe there's something that makes sense.
Yeah, this is where I landed too. I really can't see any other use other than node, but may be somebody does have a use case that we just don't know about. It also feels like topologyKey is close to what we want.
I'm very much on the fence on this idea. I'm curious what others think?
It seems there are some references to topologyKey being deprecated. Not sure what the story is there. Do you know if it is still supported?
I don't know the history here. topologyKey
is used in topologySpreadConstraints
, and that doesn't seem to be going away
### Upgrade / Downgrade Strategy | ||
|
||
<!-- | ||
If applicable, how will the component be upgraded and downgraded? Make sure | ||
this is in the test plan. | ||
|
||
Consider the following in developing an upgrade/downgrade strategy for this | ||
enhancement: | ||
- What changes (in invocations, configurations, API use, etc.) is an existing | ||
cluster required to make on upgrade, in order to maintain previous behavior? | ||
- What changes (in invocations, configurations, API use, etc.) is an existing | ||
cluster required to make on upgrade, in order to make use of the enhancement? | ||
--> | ||
|
||
### Version Skew Strategy | ||
|
||
<!-- | ||
If applicable, how will the component handle version skew with other | ||
components? What are the guarantees? Make sure this is in the test plan. | ||
|
||
Consider the following in developing a version skew strategy for this | ||
enhancement: | ||
- Does this enhancement involve coordinating behavior in the control plane and nodes? | ||
- How does an n-3 kubelet or kube-proxy without this feature available behave when this feature is used? | ||
- How does an n-1 kube-controller-manager or kube-scheduler without this feature available behave when this feature is used? | ||
- Will any other components on the node change? For example, changes to CSI, | ||
CRI or CNI may require updating that component before the kubelet. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other recent AEP's we filled this section in.
@bernot-dev - would love to know if you've had a chance to prototype this? I think clarifying some of the points above from @voelzmo would be really great. Also, at least from my perspective, there is a big need for something like this and something I have heard internally and from customers time and time again as something that would be very useful. Really hoping we can all work together to refine this and ship something that can solve a longstanding problem here :) |
The user should pick a starting value for resources in the DaemonSet in any case. Especially because requests and limits remain proportional. The current behavior is cluster-scoped. With the node-scoped behavior, we don't have to track/query the cluster-scoped usage at all. To implement a starting value based on other nodes, we would have to track the aggregated usage in addition to the individual usage to fall back on unscoped behavior. This would be ongoing work that would only be used occasionally. Also, as I discussed previously, I also question the value of the initial recommendation. There's no particular reason to believe that common DaemonSets would immediately (or ever) scale to levels similar to other nodes. A daemon that collects observability signals will need other workloads to observe first. Compared to (for instance) adding a pod to a deployment with a load balancer in front of it, many common DaemonSets will have much less predictable workloads. |
Thanks so much for this, both @bernot-dev and @bboreham! If I understand correctly (and please correct me if I’m wrong - I haven’t had time to dive deep, but I did review the idea briefly), we are defining recommendations for DaemonSet pods in a 1:1 pod-to-node relationship. From what I see in the PR, it’s not strictly limited to DaemonSets, but let’s set that aside for now. Essentially, if pod X is scheduled on node Y, the recommendation is adjusted accordingly. That means this recommendation doesn’t account for other DaemonSet pods, correct? Since each pod is scheduled on a different node, they wouldn’t influence each other. What happens when a node deregisters from the cluster and a new node is added? Does the recommendation process restart in that case? I think this approach makes a lot of sense for static clusters where nodes are permanent, but I’m curious how it behaves in cloud provider environments where nodes are more dynamic. Edit: I had another idea - rather than using the node name or label, we could apply a hash function to the CPU and RAM allocatable resources of the node. I believe this approach and addresses the issues I mentioned earlier. |
I agree that the user should pick a starting value, but at the moment there is no requirement for a starting value to be picked, which makes this change feel backwards incompatible. For example, if someone has a cluster-scoped DaemonSet VPA configured, and the DaemonSet spec is configured without Pod resources, any new Pod will be created with resources set, thanks to the VPA. However, if they were to change that VPA to use a node-scoped DaemonSet VPA, then new Pods will start with no resources set, until enough time has passed and the node-scoped recommendation is applied. I worry that a user will naively enable the node-scoped feature, without knowing that this will change the behaviour of the VPA in a backwards incompatible way.
Why can't this new feature be implemented in a way to track both cluster-scoped and node-scoped metrics? This ongoing work will be used every time a new node is created. For any cluster with autoscaling enabled, this possibly happens many times a day.
From what I've observed, a DaemonSet Pod only exists because the workload needs it. if enough Pods are created at the same time, this node could start its life already full. |
What if we also deployed a second VPA that is cluster-scoped and in Trying to have a single VPA represent all of this sounds a bit complicated from the API perspective. |
I wonder if it makes sense to take a step back and figure out what we want from this and the right API to build it on top of? I wonder if in-place will allow other types of workloads (Deployments, StatefulSets, etc) to have per-pod recommendations, since the cost of resizing is now lower. I don't know if that's a possibility in our future, but it may make sense to build this feature to allow for future development. |
This would not work in the case where a node with the same number of CPUs goes at a different rate. |
I understand. so do you have any ideas regarding the issues I brought up? |
This is a bit more complicated to implement than node name, as it requires an extra api call from the admission-controller to see the labels. However that is a minor concern. |
I see. I’m just wondering how this behaves in a large cluster with many frequently replaced instances. |
Would it be too far-fetched to expand this proposal to also include StatefulSets? The Story supporting this use-case comes again from Observablity, this time considering the distributor/ingester setup on the write-path, which is common across a number of projects (Thanos, Cortex, Mimir, for example). Here requests are routed to individual pods of the StatefulSet by hashing some property of the incoming data (for example, a label set). When you also introduce replication, this can lead to some quite unbalanced resource usage across the pods in the StatefulSet. Per-pod resource recommendations would be tremendously valuable in this case. |
I think it's a reasonable request, but, at the moment I'm not sure if the admission-controller knows which host the pod will land on. DaemonSets have the host baked into the Pod spec, but I'm unsure on StatefulSets. May be the in-place feature will allow this in the future? |
What type of PR is this?
What this PR does / why we need it:
/kind feature
Enhancement proposal for VPA behavior for DaemonSets. See change for details.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: