Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Cloudwatch Scaler metrics pulling logic is not optimized #2242

Closed
fivesheep opened this issue Nov 1, 2021 · 9 comments
Closed

AWS Cloudwatch Scaler metrics pulling logic is not optimized #2242

fivesheep opened this issue Nov 1, 2021 · 9 comments
Labels
bug Something isn't working

Comments

@fivesheep
Copy link
Contributor

Report

The metrics pulling logic is not optimized for cloudwatch scaler, and it has a chance to not getting any data point back from aws due to cloudwatch's eventually consistent model, some of the well know tricks for exporting cloudwatch data with the GetMetricData API needs to be applied:

  • Round down the start time and end time to improve the performance and chance to get data from cloudwatch. This is also suggested by the Request Parameters section of GetMetricsData API doc from AWS
  • Have the ability to avoid getting the most recent data point too early, to avoid dirty data. This is also caused by the eventually consistent model, say we have a metrics with 5 min resolution (period) to pull, the current time is 11:06 if we try to get the latest datapoint (which will be recored for 11:05), we set starttime to 11:00 and endtime to 11:05, very likely cloudwatch will return no datapoint to the api call, and also very likely for the next cycle with 11:05 to 11:10, it won't return any data either if the endtime is too close to the current time. One way to avoid this is to extend the time range with additional cycles, for example, always add two to three cycles to the time range for getting data. to get the data point from 11:05, set the starttime to 10:55 or even 11:50 (it won't increase the cost), also set ScanBy the request param to TimestampDescending, and always take the first data point return from the output. In the current scaler code, there's a param metricCollectionTime can be used for that, however, ScanBy is not set, which the dehavoir of datapoint returning order is undefined(not documented). It would make sense to always set this param when sending the request
  • Same for the eventually consistent model, the most recent data, especially the data with high resolution(1min and below), you get from cloudwatch might not be accurate, it would make sense to have an option to skip the most recent data, but getting the data point before the most recent one instead.
  • Unit param is not supported, as documented by AWS, this param cannot be skipped in some cases
    If you omit Unit in your request, all data that was collected with any unit is returned, along with the corresponding units that were specified when the data was reported to CloudWatch. If you specify a unit, the operation returns only data that was collected with that unit specified. If you specify a unit that does not match the data collected, the results of the operation are null. CloudWatch does not perform unit conversions.

Expected Behavior

Cloudwatch scaler shall always be able to get a stable value back from aws

Actual Behavior

Due to the eventually consistent model of cloudwatch api, the scaler might not be able to get a value

Steps to Reproduce the Problem

  1. Choose a metrics with 1min resolution such as ELB RequestCount Metrics, example config:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: aws-cloudwatch-test-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: my-deployment
  pollingInterval: 60
  minReplicaCount: 4
  maxReplicaCount: 100
  triggers:
  - type: aws-cloudwatch
    metadata:
      identityOwner: operator
      namespace: AWS/ApplicationELB
      dimensionName: TargetGroup
      dimensionValue: targetgroup/my-targetgroup/123456
      metricName: RequestCountPerTarget
      targetMetricValue: "100"
      minMetricValue: "0"
      metricStat: "Sum"
      metricStatPeriod: "60"
      metricCollectionTime: "60"
      awsRegion: "us-west-2"
  1. identify the metrics from metrics server with command kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 | jq
  2. use a loop to monitor the value from metrics api server and compare it with aws dashboard
while true; do  kubectl get --raw /apis/external.metrics.k8s.io/v1beta1/namespaces/default/s0-aws-cloudwatch-AWS-ApplicationELB-TargetGroup-targetgroup-xxxxxx | jq .items; sleep 15; done

It returns 0 some times

Logs from KEDA operator

example

KEDA Version

2.4.0

Kubernetes Version

No response

Platform

Amazon Web Services

Scaler Details

Cloudwatch

Anything else?

The cloudwatch client can also be cached, it has the same lifecycle as the scaler, and it can renew token by itself, we don't need to make 2 calls to create this client every time when pulling a metrics data. it will improve the performance and cost

@fivesheep fivesheep added the bug Something isn't working label Nov 1, 2021
@fivesheep
Copy link
Contributor Author

cc @zroubalik

@fivesheep
Copy link
Contributor Author

@zroubalik
Copy link
Member

zroubalik commented Nov 2, 2021

The cloudwatch client can also be cached, it has the same lifecycle as the scaler, and it can renew token by itself, we don't need to make 2 calls to create this client every time when pulling a metrics data. it will improve the performance and cost

#2187 should fix this problem, am I right @fivesheep?

@fivesheep
Copy link
Contributor Author

@zroubalik Not exactly, they serve different purpose. what I mean is we can move the following code from GetCloudwatchMetrics() to the NewAwsCloudwatchScaler function


func (c *awsCloudwatchScaler) GetCloudwatchMetrics() (float64, error) {
	sess := session.Must(session.NewSession(&aws.Config{
		Region: aws.String(c.metadata.awsRegion),
	}))

	var cloudwatchClient *cloudwatch.CloudWatch
	if c.metadata.awsAuthorization.podIdentityOwner {
		creds := credentials.NewStaticCredentials(c.metadata.awsAuthorization.awsAccessKeyID, c.metadata.awsAuthorization.awsSecretAccessKey, "")

		if c.metadata.awsAuthorization.awsRoleArn != "" {
			creds = stscreds.NewCredentials(sess, c.metadata.awsAuthorization.awsRoleArn)
		}

		cloudwatchClient = cloudwatch.New(sess, &aws.Config{
			Region:      aws.String(c.metadata.awsRegion),
			Credentials: creds,
		})
	} else {
		cloudwatchClient = cloudwatch.New(sess, &aws.Config{
			Region: aws.String(c.metadata.awsRegion),
		})
	}

and let awsCloudwatchScaler owns a reference to the client, and use it directly in the GetCloudwatchMetrics call, to avoid creating the cloudwatch client all the time, as it can be reused since it will refresh token by itself, and it has a same life cycle as awsCloudwatchScaler.

I did not make a change for this in PR #2243, since I saw in some other places such as the AWS SQS shall also use the same pattern, and it will also need to be improved the same way.

@zroubalik
Copy link
Member

zroubalik commented Nov 2, 2021

Yeah now I see what you meant, you are right. Do you think that you can add this change in a separate PR once #2243 is merged? It would be nice improvement.

@zroubalik
Copy link
Member

The AWS related stuff hasn't been updated for a while, so it definitely deserves some improvements. Unfortunately there hasn't been anybody with enough experience so far.

@fivesheep
Copy link
Contributor Author

@zroubalik sure, I will send a separate PR after that

@zroubalik
Copy link
Member

@fivesheep I guess we can close this one, right? Or is there anything else to be done?

@fivesheep
Copy link
Contributor Author

@zroubalik Yup, I think we can close this one now. thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants