Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keda 2.5 does not cleanly update from 2.4 #2381

Closed
bpinske opened this issue Dec 3, 2021 · 19 comments · Fixed by #2593
Closed

Keda 2.5 does not cleanly update from 2.4 #2381

bpinske opened this issue Dec 3, 2021 · 19 comments · Fixed by #2593
Labels
bug Something isn't working

Comments

@bpinske
Copy link
Contributor

bpinske commented Dec 3, 2021

Report

There appears to be a bug that prevents a clean and safe upgrade from keda 2.4 to keda 2.5, possibly related to this PR which changed metric names or This one . This affects pre-existing ScaledObjects are are present at the time of the 2.4 upgrade.

The symptom would be that the HPA loop would be attempting to evaluate a metric which did not actually exist within the Kubernetes external metrics API. Below is a snippet of the output of a kubectl describe hpa where the new keda 2.5 format of metrics would be queried and the log output of the external-metrics API

Metrics:                                               ( current / target )
  "s1-prometheus-burrow_lag" (target average value):   <unknown> / 120M
  resource cpu on pods  (as a percentage of request):  108% (3267m) / 100%
  Warning  FailedGetExternalMetric  3m6s (x791 over 3h22m)  horizontal-pod-autoscaler  unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag
kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1/namespaces/s001/s1-prometheus-burrow_lag?labelSelector=app=myApp' | jq
Error from server: No matching metrics found for s1-prometheus-burrow_lag

Reverting to Keda 2.4 would immediately fix the issue and resume using the old names.
When in an errored state, remediation was possible by deleting all ScaledObjects and recreating them. This appeared to cause a reconciliation for the recreated scaledObject to the point the new-style metric becomes available.

Expected Behavior

Expect Keda 2.5 to immediately and reliably work out of the box for existing scaledObject definitions.

Actual Behavior

Upgrading from Keda 2.4 to Keda 2.5 is disruptive for pre-existing scaledObject-managed HPAs. New style metrics are inaccessible from the keda metrics APIserver

Steps to Reproduce the Problem

  1. Deploy Keda 2.4
  2. Create scaledObjects of Prometheus
  3. Upgrade to Keda 2.5

This issue may be difficult to reproduce. This only occurred in 2 out of my 30 kubernetes clusters. But it consistently happened within those 2. I am entirely unclear as to why the 2 clusters persistently had the issue: they should be identically configured as the rest.

Logs from KEDA operator

 Warning  FailedGetExternalMetric  3m6s (x791 over 3h22m)  horizontal-pod-autoscaler  unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name=myApp:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag

KEDA Version

2.5.0

Kubernetes Version

1.20

Platform

Other

Scaler Details

Prometheus

Anything else?

No response

@bpinske bpinske added the bug Something isn't working label Dec 3, 2021
@bpinske bpinske changed the title Keda 2.5 does not cleaning update from 2.4 prometheus Keda 2.5 does not cleanly update from 2.4 prometheus Dec 4, 2021
@JorTurFer
Copy link
Member

Hi,
The change in the name is the expected behavior. Is the metric server updated to v2.5 too?
If you query available metrics manually, are you getting old names? The update should be automatic in both cases, the operator should update the HPA and the metric server should expose it without any extra action from your side.

@bpinske
Copy link
Contributor Author

bpinske commented Dec 5, 2021

This issue resurfaced on me again, after a several hour delay. This is the second time I have had the issue surface on me, each failing several hours after first release. I will note that following the incident of the first time, I had deleted and recreated all scaledObjects and HPA objects while already deployed to keda 2.5 to ensure that there wouldn't be any potentially stale references left over. As I have had the issue a second time in multiple environments: this has not helped.

I support many environments, across these many environments I have two sets of behaviours:

  1. One where keda 2.5 works without issue
  2. One where keda 2.5 works for approximately 9~ hours before the new style metrics begin failing to resolve. This has happened 5 times now across 3 days and 3 environments.

For scenario 1) I have the following example where I DO see a mismatch between the enumeration of the available resources and what's actually queryable. This behaviour is consistent and reproducible across

(⎈)➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "prometheus-https---thanos-example-com-burrow_lag",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    },
}

(⎈)➜  ~ kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1/namespaces/pool/s0-prometheus-burrow_lag?labelSelector=scaledobject.keda.sh/name=myApp' | jq
{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "metricName": "s0-prometheus-burrow_lag",
      "metricLabels": null,
      "timestamp": "2021-12-05T23:30:10Z",
      "value": "0"
    }
  ]
}

Just as part of writing this up, I note that if I restart the (already 2.5) keda metrics API server, it begins returning the correctly named metrics when enumerating, data works properly just the same

(⎈)➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "s0-prometheus-burrow_lag",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }

In 2) where keda actually fails I receive the following error messages and an inability to query the metrics.

apiVersion="autoscaling/v2beta2" type="Warning" reason="FailedGetExternalMetric" message="unable to get external metric s001/s2-prometheus-burrow_lag_sensor/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s2-prometheus-burrow_lag_sensor"

Unfortunately I don't have a live example of the API output from today, but when I was investigating this for the first time I had the following unusual output. I now wonder if there is a cache expiring, or maybe a leader election changing of some sort that is causing a revert of the metric names. I do believe the two metrics below of the new format were fixed as the result of deleting and re-creating the ScaledObject definition. I wonder if restarting the metrics API server would have again refreshed the metrics to the point they resolve correctly. But needing to restart keda every few hours remains undesirable behaviour :)

friday morning example broken

kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1' | jq '.resources[].name'
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag_sensor"
"prometheus-https---thanos-example-com-burrow_lag"
"s0-prometheus-burrow_lag_sensor"
"s1-prometheus-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"

@JorTurFer
Copy link
Member

JorTurFer commented Dec 6, 2021

But KEDA doesn't expose the same metric in different ways depending on the lifetime, I mean, maybe could you have more than 1 instance? The index is calculated inside the scaler in all GetMetricSpecForScaling and it's evaluated internally, so I can't understand for example this output:

kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1' | jq '.resources[].name'
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag_sensor"
"prometheus-https---thanos-example-com-burrow_lag"
"s0-prometheus-burrow_lag_sensor"
"s1-prometheus-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"

Despite if the index had been wrong, you should have all with s0-xxxx 🤔
I'm not sure if the metrics are cached at k8s level (maybe yes and that's the problem), I know that KEDA Metrics Server caches metric name, but again, the metric name should contain sx-xxxx
Do you know more about this @zroubalik @coderanger ?

I have tried in a EKS v1.21 and it works correctly and the same behavior is with an AKS v1.20.
I'm not able to reproduce it :(

@JorTurFer
Copy link
Member

Just to double-check, you have tried deleting and creating again the ScaledObjects, right? I mean, in the clusters where you have problems, did you also delete and create them again?

@bpinske
Copy link
Contributor Author

bpinske commented Dec 6, 2021

yes I had deleted and recreated the scaledObjects after upgrading to keda 2.5

Here is a screenshot of the metric values over time. Note the flatline occurring in the middle of the night. That during that flatline we were receiving the below error message which continued until we reverted to keda 2.4 at which points metrics began to flow again using the old name convention.
image

During the period of the flatline when the issue starts, the metrics server begins throwing 500s because it's unable to resolve requests for the metric name

  Warning  FailedGetExternalMetric  3m6s (x791 over 3h22m)  horizontal-pod-autoscaler  unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag

image

Given that I seem pretty capable of reproducing this issue in production :(, is there a guide anywhere documented for how to attach gdb or delve to the metricsServer right on the server when the issue arise? Or where I can find symbols to do that?

@JorTurFer
Copy link
Member

Could you run kubectl get pods -n {KEDA_NAMESPACE} -o jsonpath="{..imageID}" and paste the output please?

@bpinske
Copy link
Contributor Author

bpinske commented Dec 6, 2021

kedacore/keda@sha256:8fba3ab792c0e9d14ab046cda739e0925a39277c991122fd40474a59958bbd19 
ghcr.io/kedacore/keda-metrics-apiserver@sha256:77e4967dc13cb8b3c6f1dcf0b6c5ad9e44e09daa621b567d73f7318627551756

@JorTurFer
Copy link
Member

The images look good :/
Tomorrow I will prepare an environment with KEDA v2.4 and I will try to reproduce the problem with any ScaledObject with Prometheus triggers updating to v2.5 (here it's 2h00 now).

It's weird because the difference is not at trigger level, it's at ScaledObject level and I updated KEDA in our (company) clusters without any issues, basically the new version of the operator updated the HPAs and the new version of the metrics server exposes them. We use RabbitMQ triggers but as I said, the change is at ScaledObject level 🤔

Maybe there is any specific behavior in prometheus scaler but I don't think so... Let's see tomorrow

@glassnick
Copy link

Hi there.

Has there been any update on this issue?

We're also seeing a similar issue. We have 3 clusters where we have upgraded from 2.4.0 to 2.5.0, and two of them are producing errors.
We are using Azure Service Bus for the events, and getting this output
As you see below, the metric "s1-azure-servicebus-st-xxx" is showing as 'unknown'.
.

kubectl describe hpa keda-hpa-file-xxx

Name:                                                              keda-hpa-file-xxx
Namespace:                                                         default
Labels:                                                            app.kubernetes.io/managed-by=Helm
                                                                   scaledobject.keda.sh/name=file-xxx
Annotations:                                                       <none>
CreationTimestamp:                                                 Fri, 10 Dec 2021 11:51:54 +0000
Reference:                                                         Deployment/file-xxx
Metrics:                                                           ( current / target )
  "s1-azure-servicebus-st-xxx" (target average value):   <unknown> / 5
  "s1-azure-servicebus-mi-xxx" (target average value):  0 / 5
Min replicas:                                                      1
Max replicas:                                                      15
Deployment pods:                                                   1 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from external metric s1-azure-servicebus-mi-xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: file-xxx,},MatchExpressions:[]LabelSelectorRequirement{},})
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:
  Type     Reason                   Age                    From                       Message
  ----     ------                   ----                   ----                       -------
  Warning  FailedGetExternalMetric  34s (x782 over 3h18m)  horizontal-pod-autoscaler  unable to get external metric default/s1-azure-servicebus-st-xxx/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: file-xxx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-azure-servicebus-st-xxx

@JorTurFer
Copy link
Member

🤦
I have to apologize, I didn't have so much time and I totally forgot this issue :(
I will take a look during the next week
Sorry

@glassnick
Copy link

For this particular metric, when querying the metrics manually, we're seeing s0 instead of s1 in the metric name...

"s0-azure-servicebus-st-xxx"

FYI, we are seeing issues with multiple metrics

@JorTurFer
Copy link
Member

could your problem be related with this?

@bpinske bpinske changed the title Keda 2.5 does not cleanly update from 2.4 prometheus Keda 2.5 does not cleanly update from 2.4 Dec 23, 2021
@bpinske
Copy link
Contributor Author

bpinske commented Dec 23, 2021

^ That actually does sound somewhat related. I definitely observed mismatches between what the metricNames were according to the metricsServer and what the HPA loop was actually querying.

@JorTurFer
Copy link
Member

okay, there are 2 different workarounds if you are affected by this error:

  • Update the SO to bump the generation (update de manifest, not delete and create it again)
  • Restart KEDA pods to recreate the cache

Could you check if these workarounds mitigate the problem? Just to know if this is the root cause and not invest time digging in the problem

@bpinske
Copy link
Contributor Author

bpinske commented Dec 23, 2021

First I'll need to see if I can actually find a way to reproduce this issue reliably: So far I've only been able to trigger it in my largest 4 environments :)

I'll see about trying to reproduce this today in a dev space. If I find a way to reproduce, what I'll actually do is

  1. build keda myself and cherrypick in your cache deletion commit to see if that fixes it.
  2. If the above doesn't fix it, I'm going to revert this and this one at a time to try and bisect down to the real cause.

@JorTurFer
Copy link
Member

nice!
Thanks for your help ❤️
I'm thinking and probably updating the SO name is enough to avoid the cache because the (cache) key is generated using the name and namespace

@glassnick
Copy link

Our team will be testing this in the new year.

@tomkerkhove
Copy link
Member

@glassnick Did you manage to give this a try?

@bpinske
Copy link
Contributor Author

bpinske commented Jan 18, 2022

I tried myself to reproduce the error that I had originally been seeing, but I was unable to.

I had observed the error only in my largest environments so I had the suspicion that the error was related to number of objects in the cluster. Unfortunately just throwing large numbers of objects at a dev environment was not enough to reproduce at least the error cases I had seen.

My next step is kind of the nuclear one with running keda in my largest environments where I can reproduce the bug under the delve debugger and breakpointing exactly what's failing. It might be a while before I have time to continue at that level of debugging for my own particular issues though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants