AWS Cloudwatch Scaler metrics pulling logic is not optimized #2242

fivesheep · 2021-11-01T22:49:10Z

Report

The metrics pulling logic is not optimized for cloudwatch scaler, and it has a chance to not getting any data point back from aws due to cloudwatch's eventually consistent model, some of the well know tricks for exporting cloudwatch data with the GetMetricData API needs to be applied:

Round down the start time and end time to improve the performance and chance to get data from cloudwatch. This is also suggested by the Request Parameters section of GetMetricsData API doc from AWS
Have the ability to avoid getting the most recent data point too early, to avoid dirty data. This is also caused by the eventually consistent model, say we have a metrics with 5 min resolution (period) to pull, the current time is 11:06 if we try to get the latest datapoint (which will be recored for 11:05), we set starttime to 11:00 and endtime to 11:05, very likely cloudwatch will return no datapoint to the api call, and also very likely for the next cycle with 11:05 to 11:10, it won't return any data either if the endtime is too close to the current time. One way to avoid this is to extend the time range with additional cycles, for example, always add two to three cycles to the time range for getting data. to get the data point from 11:05, set the starttime to 10:55 or even 11:50 (it won't increase the cost), also set ScanBy the request param to TimestampDescending, and always take the first data point return from the output. In the current scaler code, there's a param metricCollectionTime can be used for that, however, ScanBy is not set, which the dehavoir of datapoint returning order is undefined(not documented). It would make sense to always set this param when sending the request
Same for the eventually consistent model, the most recent data, especially the data with high resolution(1min and below), you get from cloudwatch might not be accurate, it would make sense to have an option to skip the most recent data, but getting the data point before the most recent one instead.
Unit param is not supported, as documented by AWS, this param cannot be skipped in some cases
If you omit Unit in your request, all data that was collected with any unit is returned, along with the corresponding units that were specified when the data was reported to CloudWatch. If you specify a unit, the operation returns only data that was collected with that unit specified. If you specify a unit that does not match the data collected, the results of the operation are null. CloudWatch does not perform unit conversions.

Expected Behavior

Cloudwatch scaler shall always be able to get a stable value back from aws

Actual Behavior

Due to the eventually consistent model of cloudwatch api, the scaler might not be able to get a value

Steps to Reproduce the Problem

Choose a metrics with 1min resolution such as ELB RequestCount Metrics, example config:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: aws-cloudwatch-test-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: my-deployment
  pollingInterval: 60
  minReplicaCount: 4
  maxReplicaCount: 100
  triggers:
  - type: aws-cloudwatch
    metadata:
      identityOwner: operator
      namespace: AWS/ApplicationELB
      dimensionName: TargetGroup
      dimensionValue: targetgroup/my-targetgroup/123456
      metricName: RequestCountPerTarget
      targetMetricValue: "100"
      minMetricValue: "0"
      metricStat: "Sum"
      metricStatPeriod: "60"
      metricCollectionTime: "60"
      awsRegion: "us-west-2"

identify the metrics from metrics server with command kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 | jq
use a loop to monitor the value from metrics api server and compare it with aws dashboard

while true; do  kubectl get --raw /apis/external.metrics.k8s.io/v1beta1/namespaces/default/s0-aws-cloudwatch-AWS-ApplicationELB-TargetGroup-targetgroup-xxxxxx | jq .items; sleep 15; done

It returns 0 some times

Logs from KEDA operator

example

KEDA Version

2.4.0

Kubernetes Version

No response

Platform

Amazon Web Services

Scaler Details

Cloudwatch

Anything else?

The cloudwatch client can also be cached, it has the same lifecycle as the scaler, and it can renew token by itself, we don't need to make 2 calls to create this client every time when pulling a metrics data. it will improve the performance and cost

The text was updated successfully, but these errors were encountered:

fivesheep · 2021-11-01T22:49:32Z

cc @zroubalik

fivesheep · 2021-11-01T22:51:03Z

FYI https://www.linuxjournal.com/content/cloudwatch-devil-i-must-use-it

zroubalik · 2021-11-02T14:39:12Z

The cloudwatch client can also be cached, it has the same lifecycle as the scaler, and it can renew token by itself, we don't need to make 2 calls to create this client every time when pulling a metrics data. it will improve the performance and cost

#2187 should fix this problem, am I right @fivesheep?

fivesheep · 2021-11-02T18:16:05Z

@zroubalik Not exactly, they serve different purpose. what I mean is we can move the following code from GetCloudwatchMetrics() to the NewAwsCloudwatchScaler function


func (c *awsCloudwatchScaler) GetCloudwatchMetrics() (float64, error) {
	sess := session.Must(session.NewSession(&aws.Config{
		Region: aws.String(c.metadata.awsRegion),
	}))

	var cloudwatchClient *cloudwatch.CloudWatch
	if c.metadata.awsAuthorization.podIdentityOwner {
		creds := credentials.NewStaticCredentials(c.metadata.awsAuthorization.awsAccessKeyID, c.metadata.awsAuthorization.awsSecretAccessKey, "")

		if c.metadata.awsAuthorization.awsRoleArn != "" {
			creds = stscreds.NewCredentials(sess, c.metadata.awsAuthorization.awsRoleArn)
		}

		cloudwatchClient = cloudwatch.New(sess, &aws.Config{
			Region:      aws.String(c.metadata.awsRegion),
			Credentials: creds,
		})
	} else {
		cloudwatchClient = cloudwatch.New(sess, &aws.Config{
			Region: aws.String(c.metadata.awsRegion),
		})
	}

and let awsCloudwatchScaler owns a reference to the client, and use it directly in the GetCloudwatchMetrics call, to avoid creating the cloudwatch client all the time, as it can be reused since it will refresh token by itself, and it has a same life cycle as awsCloudwatchScaler.

I did not make a change for this in PR #2243, since I saw in some other places such as the AWS SQS shall also use the same pattern, and it will also need to be improved the same way.

zroubalik · 2021-11-02T18:25:38Z

Yeah now I see what you meant, you are right. Do you think that you can add this change in a separate PR once #2243 is merged? It would be nice improvement.

zroubalik · 2021-11-02T18:26:49Z

The AWS related stuff hasn't been updated for a while, so it definitely deserves some improvements. Unfortunately there hasn't been anybody with enough experience so far.

fivesheep · 2021-11-03T17:42:38Z

@zroubalik sure, I will send a separate PR after that

zroubalik · 2021-11-12T14:33:43Z

@fivesheep I guess we can close this one, right? Or is there anything else to be done?

fivesheep · 2021-11-12T16:43:09Z

@zroubalik Yup, I think we can close this one now. thx!

fivesheep added the bug Something isn't working label Nov 1, 2021

fivesheep mentioned this issue Nov 1, 2021

Improve AWS Cloudwatch Scaler metric exporting logic #2243

Merged

5 tasks

fivesheep mentioned this issue Nov 4, 2021

update aws related scalers to reuse the aws clients #2255

Merged

5 tasks

zroubalik closed this as completed Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Cloudwatch Scaler metrics pulling logic is not optimized #2242

AWS Cloudwatch Scaler metrics pulling logic is not optimized #2242

fivesheep commented Nov 1, 2021

fivesheep commented Nov 1, 2021

fivesheep commented Nov 1, 2021

zroubalik commented Nov 2, 2021 •

edited

Loading

fivesheep commented Nov 2, 2021

zroubalik commented Nov 2, 2021 •

edited

Loading

zroubalik commented Nov 2, 2021

fivesheep commented Nov 3, 2021

zroubalik commented Nov 12, 2021

fivesheep commented Nov 12, 2021

AWS Cloudwatch Scaler metrics pulling logic is not optimized #2242

AWS Cloudwatch Scaler metrics pulling logic is not optimized #2242

Comments

fivesheep commented Nov 1, 2021

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

fivesheep commented Nov 1, 2021

fivesheep commented Nov 1, 2021

zroubalik commented Nov 2, 2021 • edited Loading

fivesheep commented Nov 2, 2021

zroubalik commented Nov 2, 2021 • edited Loading

zroubalik commented Nov 2, 2021

fivesheep commented Nov 3, 2021

zroubalik commented Nov 12, 2021

fivesheep commented Nov 12, 2021

zroubalik commented Nov 2, 2021 •

edited

Loading

zroubalik commented Nov 2, 2021 •

edited

Loading