Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Timestamp for Health Status (#16972) #18660

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

mkieweg
Copy link

@mkieweg mkieweg commented Jun 14, 2024

This PR contains an implementation of the enhancement proposal #16972.
Closes #16972

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

@mkieweg mkieweg requested a review from a team as a code owner June 14, 2024 10:52
@mkieweg mkieweg marked this pull request as draft June 14, 2024 10:59
@mkieweg mkieweg marked this pull request as ready for review June 17, 2024 10:23
Copy link
Member

@jannfis jannfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have a couple of initial things, PTAL :)

controller/health.go Outdated Show resolved Hide resolved
controller/health.go Outdated Show resolved Hide resolved
Copy link
Contributor

@svghadi svghadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkieweg for the PR. I have left a suggestion. I think below section also needs an update.

if orig.Status.Health.Status != newStatus.Health.Status {
message := fmt.Sprintf("Updated health status: %s -> %s", orig.Status.Health.Status, newStatus.Health.Status)
ctrl.logAppEvent(orig, argo.EventInfo{Reason: argo.EventReasonResourceUpdated, Type: v1.EventTypeNormal}, message, context.TODO())
}
var newAnnotations map[string]string
if orig.GetAnnotations() != nil {
newAnnotations = make(map[string]string)
for k, v := range orig.GetAnnotations() {
newAnnotations[k] = v
}
delete(newAnnotations, appv1.AnnotationKeyRefresh)
}
patch, modified, err := diff.CreateTwoWayMergePatch(
&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: orig.GetAnnotations()}, Status: orig.Status},
&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: newAnnotations}, Status: *newStatus}, appv1.Application{})

Looks like CreateTwoWayMergePatch will always detect a drift here due to the new timestamp field and try to patch the status when there is no actual change in app health status. We might need to explicitly set it to old/orig value if there is not change in app health status. Update: not required.

controller/health.go Outdated Show resolved Hide resolved
@mkieweg
Copy link
Author

mkieweg commented Jun 19, 2024

Thanks @mkieweg for the PR. I have left a suggestion. I think below section also needs an update.

if orig.Status.Health.Status != newStatus.Health.Status {
message := fmt.Sprintf("Updated health status: %s -> %s", orig.Status.Health.Status, newStatus.Health.Status)
ctrl.logAppEvent(orig, argo.EventInfo{Reason: argo.EventReasonResourceUpdated, Type: v1.EventTypeNormal}, message, context.TODO())
}
var newAnnotations map[string]string
if orig.GetAnnotations() != nil {
newAnnotations = make(map[string]string)
for k, v := range orig.GetAnnotations() {
newAnnotations[k] = v
}
delete(newAnnotations, appv1.AnnotationKeyRefresh)
}
patch, modified, err := diff.CreateTwoWayMergePatch(
&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: orig.GetAnnotations()}, Status: orig.Status},
&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: newAnnotations}, Status: *newStatus}, appv1.Application{})

Looks like CreateTwoWayMergePatch will always detect a drift here due to the new timestamp field and try to patch the status when there is no actual change in app health status. We might need to explicitly set it to old/orig value if there is not change in app health status.

Thanks for pointing that out, I'll give it a look

@svghadi
Copy link
Contributor

svghadi commented Jun 19, 2024

Hi @mkieweg, I think you can ignore my previous suggestion regarding changes due to CreateTwoWayMergePatch in persistAppStatus func. When I looked at the code again, I see the values passed to persistAppStatus func are actually deep copy of each other

app := origApp.DeepCopy()

and will differ only when explicitly updated, which happens at 2 places. We need to ensure timestamp is updated at these places (looks like 2. is already handled in your change).

  1. app.Status.Health.Status = health.HealthStatusUnknown

  2. app.Status.Health = *compareResult.healthStatus

@mkieweg
Copy link
Author

mkieweg commented Jun 19, 2024

Hi @mkieweg, I think you can ignore my previous suggestion regarding changes due to CreateTwoWayMergePatch in persistAppStatus func. When I looked at the code again, I see the values passed to persistAppStatus func are actually deep copy of each other

app := origApp.DeepCopy()

and will differ only when explicitly updated, which happens at 2 places. We need to ensure timestamp is updated at these places (looks like 2. is already handled in your change).

1. https://github.com/argoproj/argo-cd/blob/029b5acd5462e153b72fe0f7b1c5fa38e311d05c/controller/appcontroller.go#L1545

2. https://github.com/argoproj/argo-cd/blob/029b5acd5462e153b72fe0f7b1c5fa38e311d05c/controller/appcontroller.go#L1634

Hi @svghadi, I've added setting the timestamp in case we set the health status to unknown. The second one is indeed already handled by CompareAppState

@mkieweg mkieweg requested review from jannfis and svghadi June 20, 2024 10:45
@jessesuen
Copy link
Member

jessesuen commented Jun 21, 2024

We actually used to have a timestamp, but the updating of the timestamp as part of every app reconciliation would cause a ton of K8s API server pressure. I need to find the PR that removed this, but this could actually cause a performance problem.

@jessesuen
Copy link
Member

I found the old issue. It is here: #1340

Basically, we used to update status.observedAt, but this update would cause immense pressure on K8s API.

@jessesuen
Copy link
Member

this could actually cause a performance problem.

I took a closer look at this implementation, and noticed that we don't needlessly update timestamps (as we did with the problematic status.observedAt field)

			// If the status didn't change, we don't want to update the timestamp
			if healthStatus.Status == statuses[i].Health.Status {
				now = lastTransitionTime
			}

So my performance concerns may be alleviated.

@blakepettersson
Copy link
Member

In the E2E test TestNamespacedConfigMap the output for some reason has changed from my-map Synced configmap/my-map created to my-map Synced Healthy configmap/my-map created.

Copy link
Contributor

@svghadi svghadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some tests and found some corner cases that need to be handled:

  1. Timestamp for individual resources is updated to Application's LastTransitionTime even when there is no change in the resource status.
  2. The timestamp for Application's health is not updated when the status changes from Non Healthy to Healthy state.

Comment on lines 66 to 71
// If the status didn't change, we don't want to update the timestamp
if healthStatus.Status == statuses[i].Health.Status {
now = lastTransitionTime
}

resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section sets health of individual resources of the application. If there is no status change, I feel it should be set to lastTransitionTime of that particular resource instead of setting it to application's lastTransitionTime, as the values may not necessarily be the same.

Suggested change
// If the status didn't change, we don't want to update the timestamp
if healthStatus.Status == statuses[i].Health.Status {
now = lastTransitionTime
}
resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now}
// If the status didn't change, we don't want to update the timestamp
if healthStatus.Status == statuses[i].Health.Status {
now = statuses[i].Health.LastTransitionTime
}
resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me have a look and get back to you on this. I vaguely remember, that the current implementation only persists the timestamp on an application level. So while setting it on the resource level here might make sense, but I'm not sure if it has any benefit if the rest of the implementation remains unchanged. Which I would advocate for at the moment.

Copy link
Contributor

@svghadi svghadi Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the code. The resource-level timestamp is indeed difficult to achieve due to the current implementation.
Since the resource timestamp will contain the app health timestamp, I feel it may not be that useful. How about we skip it at the resource level? Just have it at the application health level, i.e. .status.health. If you change the timestamp to a pointer type, I think we can ignore it at resource level and it won't show up in .status.resources.

Comment on lines 89 to 95
if persistResourceHealth {
appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime
} else {
appHealth.LastTransitionTime = now
}
} else if healthStatus.Status == health.HealthStatusHealthy {
appHealth.LastTransitionTime = lastTransitionTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to handle case where status transition happens from Non Healthy -> Healthy. Since we have access to old status values, would it make sense to set the time value outside the for loop after the new status of application is calculated? Something like

diff --git a/controller/health.go b/controller/health.go
index b1c457776..e1a2c5d19 100644
--- a/controller/health.go
+++ b/controller/health.go
@@ -86,15 +86,11 @@ func setApplicationHealth(resources []managedResource, statuses []appv1.Resource
 
 		if health.IsWorse(appHealth.Status, healthStatus.Status) {
 			appHealth.Status = healthStatus.Status
-			if persistResourceHealth {
-				appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime
-			} else {
-				appHealth.LastTransitionTime = now
-			}
-		} else if healthStatus.Status == health.HealthStatusHealthy {
-			appHealth.LastTransitionTime = lastTransitionTime
 		}
 	}
+	if persistResourceHealth {
+          if app.Status.Health.Status == appHealth.Status {
+		  appHealth.LastTransitionTime = lastTransitionTime
+          } else {
+                 appHealth.LastTransitionTime = metav1.Now()
+          }
+	}
 	if persistResourceHealth {
 		app.Status.ResourceHealthSource = appv1.ResourceHealthLocationInline
 	} else {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to repro the issue. Unfortunately it persists with your suggestion. I'll look into it deeper tomorrow

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added another test to appcontroller_test.go where the issue you're describing doesn't happen. I think the code works as expected, but maybe you can have another look :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a manual test by creating a pod with delayed start. The time is updated for Missing->Progressing state change but it doesn't get updated when state moves from Progressing->Healthy.

timestamp-no-update.mov

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. Let me have a closer look

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some more extensive testing. Currently they break because I couldn't find a way to make them work with the way we're using mocks. Any suggestions on how to make TestUpdateHealthStatusProgression work are welcome

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to repro the issue. Unfortunately it persists with your suggestion.

Hmm, I haven't been able to repro this after applying @svghadi's suggestion.

kubectl get app guestbook -o jsonpath='{.status.health}'
{"lastTransitionTime":"2024-06-27T11:13:06Z","status":"Progressing"}
kubectl get app guestbook -o jsonpath='{.status.health}'
{"lastTransitionTime":"2024-06-27T11:13:31Z","status":"Healthy"}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkieweg I added some changes which should fix your tests

Copy link
Member

@blakepettersson blakepettersson Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hiddeco wrote in #16972 that this doesn't quite account for all edge cases.

While I am super supportive of this being added, I have an edge case that I think falls into this category but is not covered by just this addition.

Within Argo CD, the sync operations are completely separated from the application health. However, in certain scenarios, you do want to know if the current state of the health check is the outcome of a previous sync operation.

With the proposal, this would become possible by comparing the .finishedAt from the .status.operationState against the .lastUpdateTime. This, however, only works when you sync workload resources, which means that it will not help in scenarios where you only have non-workload resources, which will always end up Healthy without causing a transition.
...
I wonder if it would be an idea to introduce a "heartbeat" kind of timestamp that updates at most every 30 seconds (to not cause any API pressure) to indicate the "freshness" of the observation even without it transitioning.

By modifying @svghadi's suggestion slightly I think that would be achievable by doing this (haven't tested this so caveat emptor):

diff --git a/controller/health.go b/controller/health.go
index b1c457776..e1a2c5d19 100644
--- a/controller/health.go
+++ b/controller/health.go
@@ -86,15 +86,11 @@ func setApplicationHealth(resources []managedResource, statuses []appv1.Resource
 
 		if health.IsWorse(appHealth.Status, healthStatus.Status) {
 			appHealth.Status = healthStatus.Status
-			if persistResourceHealth {
-				appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime
-			} else {
-				appHealth.LastTransitionTime = now
-			}
-		} else if healthStatus.Status == health.HealthStatusHealthy {
-			appHealth.LastTransitionTime = lastTransitionTime
 		}
 	}

 	if persistResourceHealth {
+          if app.Status.Health.Status == appHealth.Status && time.Since(lastTransitionTime.Time) < 30*time.Second {
+		  appHealth.LastTransitionTime = lastTransitionTime
+          } else {
+                 appHealth.LastTransitionTime = metav1.Now()
+          }
 		app.Status.ResourceHealthSource = appv1.ResourceHealthLocationInline
 	} else {

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion however would be to have the heartbeat timestamp as a separate field (as e.g. "last observed" or "heartbeat"), as otherwise it could create a false impression that a transition did take place.

controller/appcontroller_test.go Show resolved Hide resolved
controller/appcontroller_test.go Outdated Show resolved Hide resolved
controller/appcontroller_test.go Outdated Show resolved Hide resolved
@svghadi
Copy link
Contributor

svghadi commented Sep 3, 2024

Hi @mkieweg, did you get a chance to work on this again? If you're busy with other tasks, I'm happy to jump in and continue with it.

@mkieweg
Copy link
Author

mkieweg commented Sep 3, 2024

Hi @svghadi, sorry I had to pause working on this due to shifting priorities. I was planning on picking this back up this week. I'm not sure how much time I can actually commit, so if you're happy to take over that would be okay with me

@blakepettersson
Copy link
Member

@svghadi I can also help out if needed

@svghadi
Copy link
Contributor

svghadi commented Sep 3, 2024

@mkieweg - Sure. Could I get collaborator access to your argo-cd fork, or should I cherry-pick the changes and continue in a new PR?

@blakepettersson - Thank you. Should we address @hiddeco’s use case in this implementation, or would it be better to keep the scope limited and handle it in a separate PR?

@mkieweg
Copy link
Author

mkieweg commented Sep 3, 2024

@svghadi I've added you as a contributor to my fork. Can do the same for @blakepettersson if needed

@blakepettersson
Copy link
Member

@blakepettersson - Thank you. Should we address @hiddeco’s use case in this implementation, or would it be better to keep the scope limited and handle it in a separate PR?

Ideally I'd like to see his use case addressed in this PR as well, since we have a particular use case in mind for it (it's related to Kargo).

@blakepettersson
Copy link
Member

Can do the same for @blakepettersson if needed

Sounds good to me, thanks!

@svghadi
Copy link
Contributor

svghadi commented Sep 4, 2024

Hi @blakepettersson , I was thinking about how to incorporate @hiddeco 's use case into this PR. It seems a bit challenging without introducing a new field, as @mkieweg's use case requires a timestamp that doesn't update periodically.

I propose to add a lastUpdateTime timestamp to app.status.health indicating how long the current status is active for.

In contrast, @hiddeco 's use case needs a periodically updating status timestamp.

introduce a "heartbeat" kind of timestamp that updates at most every 30 seconds (to not cause any API pressure) to indicate the "freshness" of the observation even without it transitioning.

Here are my thoughts:

  • Introduce a new observedAt field under application health, which updates periodically to indicate freshness.
  • Make this configurable to address @jessesuen 's concern about performance impact.

The health status would look something like this:

status:
  health:
    lastTransitionTime: "2024-09-03T08:51:32Z"
    observedAt: "2024-09-03T08:51:32Z"   # configurable 
    status: Progressing

Thoughts/suggestions?

@blakepettersson
Copy link
Member

blakepettersson commented Sep 4, 2024

Hi @svghadi,

I agree that we'll need two separate fields, one for the health status and one to indicate "freshness".

Introduce a new observedAt field under application health, which updates periodically to indicate freshness.

👍

Make this configurable to address @jessesuen 's concern about performance impact.

I think it's fine for now to keep it at max 30 secs - if there is a need to configure this at a later stage we can address that in another PR.

Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 88.88889% with 3 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@f4c519a). Learn more about missing BASE report.
Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
controller/appcontroller.go 88.23% 1 Missing and 1 partial ⚠️
controller/state.go 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18660   +/-   ##
=========================================
  Coverage          ?   55.86%           
=========================================
  Files             ?      320           
  Lines             ?    44422           
  Branches          ?        0           
=========================================
  Hits              ?    24815           
  Misses            ?    17042           
  Partials          ?     2565           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@svghadi
Copy link
Contributor

svghadi commented Sep 12, 2024

Hi @blakepettersson, the PR is ready for review.

Changes:

  • Adds a new .status.health.lastTransitionTime field that reflects when the last health status change occurred.
  • Adds a new .status.health.observedAt field that is periodically updated (every 30s) to reflect the freshness of the observed state.
  • A new event handler is set up to update the observedAt field every 30s. It enqueues events into the appRefreshQueue with a comparison option that ensures only the status is refreshed without performing a resource-level comparison. If there are concerns regarding overloading the queue, we can setup a new queue to process these events.

The screen grab below shows that observedAt is updated every 30s without affecting the lastTransitionTime field.

Screen.Recording.2024-09-12.at.6.53.06.PM.mov

Copy link
Member

@blakepettersson blakepettersson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (pending the minor lint fixes), thanks @svghadi!

mkieweg and others added 14 commits September 19, 2024 15:31
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Co-authored-by: Blake Pettersson <blake.pettersson@gmail.com>
Signed-off-by: Manuel Kieweg <2939765+mkieweg@users.noreply.github.com>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Due to implementation limitations, setting LastTransitionTime at the resource level is challenging.
Converting it to a pointer type allows it to be skipped at the resource level and prevents it from appearing
in .status.resources of the Application CR. Additionally, it doesn’t provide much value or have a known
use case right now.

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Ready for final review
Development

Successfully merging this pull request may close these issues.

Add Timestamps to Health Status
6 participants