feat: Timestamp for Health Status (#16972) #18660

mkieweg · 2024-06-14T10:52:14Z

This PR contains an implementation of the enhancement proposal #16972.
Closes #16972

Checklist:

jannfis

Thanks! I have a couple of initial things, PTAL :)

controller/health.go

svghadi

Thanks @mkieweg for the PR. I have left a suggestion. I think below section also needs an update.

argo-cd/controller/appcontroller.go

Lines 1777 to 1791 in 029b5ac

    
           if orig.Status.Health.Status != newStatus.Health.Status { 
        
           	message := fmt.Sprintf("Updated health status: %s -> %s", orig.Status.Health.Status, newStatus.Health.Status) 
        
           	ctrl.logAppEvent(orig, argo.EventInfo{Reason: argo.EventReasonResourceUpdated, Type: v1.EventTypeNormal}, message, context.TODO()) 
        
           } 
        
           var newAnnotations map[string]string 
        
           if orig.GetAnnotations() != nil { 
        
           	newAnnotations = make(map[string]string) 
        
           	for k, v := range orig.GetAnnotations() { 
        
           		newAnnotations[k] = v 
        
           	} 
        
           	delete(newAnnotations, appv1.AnnotationKeyRefresh) 
        
           } 
        
           patch, modified, err := diff.CreateTwoWayMergePatch( 
        
           	&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: orig.GetAnnotations()}, Status: orig.Status}, 
        
           	&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: newAnnotations}, Status: *newStatus}, appv1.Application{})

Looks like CreateTwoWayMergePatch will always detect a drift here due to the new timestamp field and try to patch the status when there is no actual change in app health status. We might need to explicitly set it to old/orig value if there is not change in app health status. Update: not required.

controller/health.go

mkieweg · 2024-06-19T10:29:06Z

Thanks @mkieweg for the PR. I have left a suggestion. I think below section also needs an update.

argo-cd/controller/appcontroller.go

Lines 1777 to 1791 in 029b5ac

if orig.Status.Health.Status != newStatus.Health.Status {

message := fmt.Sprintf("Updated health status: %s -> %s", orig.Status.Health.Status, newStatus.Health.Status)

ctrl.logAppEvent(orig, argo.EventInfo{Reason: argo.EventReasonResourceUpdated, Type: v1.EventTypeNormal}, message, context.TODO())

}

var newAnnotations map[string]string

if orig.GetAnnotations() != nil {

newAnnotations = make(map[string]string)

for k, v := range orig.GetAnnotations() {

newAnnotations[k] = v

}

delete(newAnnotations, appv1.AnnotationKeyRefresh)

}

patch, modified, err := diff.CreateTwoWayMergePatch(

&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: orig.GetAnnotations()}, Status: orig.Status},

&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: newAnnotations}, Status: *newStatus}, appv1.Application{})

Looks like CreateTwoWayMergePatch will always detect a drift here due to the new timestamp field and try to patch the status when there is no actual change in app health status. We might need to explicitly set it to old/orig value if there is not change in app health status.

Thanks for pointing that out, I'll give it a look

svghadi · 2024-06-19T13:44:43Z

Hi @mkieweg, I think you can ignore my previous suggestion regarding changes due to CreateTwoWayMergePatch in persistAppStatus func. When I looked at the code again, I see the values passed to persistAppStatus func are actually deep copy of each other

argo-cd/controller/appcontroller.go

Line 1504 in 029b5ac

app := origApp.DeepCopy()

and will differ only when explicitly updated, which happens at 2 places. We need to ensure timestamp is updated at these places (looks like 2. is already handled in your change).

argo-cd/controller/appcontroller.go

Line 1545 in 029b5ac

app.Status.Health.Status = health.HealthStatusUnknown
argo-cd/controller/appcontroller.go

Line 1634 in 029b5ac

app.Status.Health = *compareResult.healthStatus

mkieweg · 2024-06-19T17:02:00Z

Hi @mkieweg, I think you can ignore my previous suggestion regarding changes due to CreateTwoWayMergePatch in persistAppStatus func. When I looked at the code again, I see the values passed to persistAppStatus func are actually deep copy of each other

argo-cd/controller/appcontroller.go

Line 1504 in 029b5ac

app := origApp.DeepCopy()

and will differ only when explicitly updated, which happens at 2 places. We need to ensure timestamp is updated at these places (looks like 2. is already handled in your change).
1. https://github.com/argoproj/argo-cd/blob/029b5acd5462e153b72fe0f7b1c5fa38e311d05c/controller/appcontroller.go#L1545

2. https://github.com/argoproj/argo-cd/blob/029b5acd5462e153b72fe0f7b1c5fa38e311d05c/controller/appcontroller.go#L1634

Hi @svghadi, I've added setting the timestamp in case we set the health status to unknown. The second one is indeed already handled by CompareAppState

jessesuen · 2024-06-21T18:41:47Z

We actually used to have a timestamp, but the updating of the timestamp as part of every app reconciliation would cause a ton of K8s API server pressure. I need to find the PR that removed this, but this could actually cause a performance problem.

jessesuen · 2024-06-21T20:14:31Z

I found the old issue. It is here: #1340

Basically, we used to update status.observedAt, but this update would cause immense pressure on K8s API.

jessesuen · 2024-06-21T21:59:01Z

this could actually cause a performance problem.

I took a closer look at this implementation, and noticed that we don't needlessly update timestamps (as we did with the problematic status.observedAt field)

			// If the status didn't change, we don't want to update the timestamp
			if healthStatus.Status == statuses[i].Health.Status {
				now = lastTransitionTime
			}

So my performance concerns may be alleviated.

blakepettersson · 2024-06-21T22:37:38Z

In the E2E test TestNamespacedConfigMap the output for some reason has changed from my-map Synced configmap/my-map created to my-map Synced Healthy configmap/my-map created.

svghadi

I ran some tests and found some corner cases that need to be handled:

Timestamp for individual resources is updated to Application's LastTransitionTime even when there is no change in the resource status.
The timestamp for Application's health is not updated when the status changes from Non Healthy to Healthy state.

svghadi · 2024-06-22T09:34:29Z

controller/health.go

+			// If the status didn't change, we don't want to update the timestamp
+			if healthStatus.Status == statuses[i].Health.Status {
+				now = lastTransitionTime
+			}
+
+			resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now}


This section sets health of individual resources of the application. If there is no status change, I feel it should be set to lastTransitionTime of that particular resource instead of setting it to application's lastTransitionTime, as the values may not necessarily be the same.

Suggested change

// If the status didn't change, we don't want to update the timestamp

if healthStatus.Status == statuses[i].Health.Status {

now = lastTransitionTime

}

resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now}

// If the status didn't change, we don't want to update the timestamp

if healthStatus.Status == statuses[i].Health.Status {

now = statuses[i].Health.LastTransitionTime

}

resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now}

Let me have a look and get back to you on this. I vaguely remember, that the current implementation only persists the timestamp on an application level. So while setting it on the resource level here might make sense, but I'm not sure if it has any benefit if the rest of the implementation remains unchanged. Which I would advocate for at the moment.

I looked at the code. The resource-level timestamp is indeed difficult to achieve due to the current implementation.
Since the resource timestamp will contain the app health timestamp, I feel it may not be that useful. How about we skip it at the resource level? Just have it at the application health level, i.e. .status.health. If you change the timestamp to a pointer type, I think we can ignore it at resource level and it won't show up in .status.resources.

svghadi · 2024-06-22T09:56:22Z

controller/health.go

+			if persistResourceHealth {
+				appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime
+			} else {
+				appHealth.LastTransitionTime = now
+			}
+		} else if healthStatus.Status == health.HealthStatusHealthy {
+			appHealth.LastTransitionTime = lastTransitionTime


I think we need to handle case where status transition happens from Non Healthy -> Healthy. Since we have access to old status values, would it make sense to set the time value outside the for loop after the new status of application is calculated? Something like

diff --git a/controller/health.go b/controller/health.go index b1c457776..e1a2c5d19 100644 --- a/controller/health.go +++ b/controller/health.go @@ -86,15 +86,11 @@ func setApplicationHealth(resources []managedResource, statuses []appv1.Resource if health.IsWorse(appHealth.Status, healthStatus.Status) { appHealth.Status = healthStatus.Status - if persistResourceHealth { - appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime - } else { - appHealth.LastTransitionTime = now - } - } else if healthStatus.Status == health.HealthStatusHealthy { - appHealth.LastTransitionTime = lastTransitionTime } } + if persistResourceHealth { + if app.Status.Health.Status == appHealth.Status { + appHealth.LastTransitionTime = lastTransitionTime + } else { + appHealth.LastTransitionTime = metav1.Now() + } + } if persistResourceHealth { app.Status.ResourceHealthSource = appv1.ResourceHealthLocationInline } else {

I've managed to repro the issue. Unfortunately it persists with your suggestion. I'll look into it deeper tomorrow

I've added another test to appcontroller_test.go where the issue you're describing doesn't happen. I think the code works as expected, but maybe you can have another look :)

I ran a manual test by creating a pod with delayed start. The time is updated for Missing->Progressing state change but it doesn't get updated when state moves from Progressing->Healthy.

timestamp-no-update.mov

That's interesting. Let me have a closer look

I've added some more extensive testing. Currently they break because I couldn't find a way to make them work with the way we're using mocks. Any suggestions on how to make TestUpdateHealthStatusProgression work are welcome

I've managed to repro the issue. Unfortunately it persists with your suggestion.

Hmm, I haven't been able to repro this after applying @svghadi's suggestion.

kubectl get app guestbook -o jsonpath='{.status.health}' {"lastTransitionTime":"2024-06-27T11:13:06Z","status":"Progressing"} kubectl get app guestbook -o jsonpath='{.status.health}' {"lastTransitionTime":"2024-06-27T11:13:31Z","status":"Healthy"}

@mkieweg I added some changes which should fix your tests

@hiddeco wrote in #16972 that this doesn't quite account for all edge cases.

While I am super supportive of this being added, I have an edge case that I think falls into this category but is not covered by just this addition.

Within Argo CD, the sync operations are completely separated from the application health. However, in certain scenarios, you do want to know if the current state of the health check is the outcome of a previous sync operation.

With the proposal, this would become possible by comparing the .finishedAt from the .status.operationState against the .lastUpdateTime. This, however, only works when you sync workload resources, which means that it will not help in scenarios where you only have non-workload resources, which will always end up Healthy without causing a transition.
...
I wonder if it would be an idea to introduce a "heartbeat" kind of timestamp that updates at most every 30 seconds (to not cause any API pressure) to indicate the "freshness" of the observation even without it transitioning.

By modifying @svghadi's suggestion slightly I think that would be achievable by doing this (haven't tested this so caveat emptor):

diff --git a/controller/health.go b/controller/health.go index b1c457776..e1a2c5d19 100644 --- a/controller/health.go +++ b/controller/health.go @@ -86,15 +86,11 @@ func setApplicationHealth(resources []managedResource, statuses []appv1.Resource if health.IsWorse(appHealth.Status, healthStatus.Status) { appHealth.Status = healthStatus.Status - if persistResourceHealth { - appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime - } else { - appHealth.LastTransitionTime = now - } - } else if healthStatus.Status == health.HealthStatusHealthy { - appHealth.LastTransitionTime = lastTransitionTime } } if persistResourceHealth { + if app.Status.Health.Status == appHealth.Status && time.Since(lastTransitionTime.Time) < 30*time.Second { + appHealth.LastTransitionTime = lastTransitionTime + } else { + appHealth.LastTransitionTime = metav1.Now() + } app.Status.ResourceHealthSource = appv1.ResourceHealthLocationInline } else {

My suggestion however would be to have the heartbeat timestamp as a separate field (as e.g. "last observed" or "heartbeat"), as otherwise it could create a false impression that a transition did take place.

controller/appcontroller_test.go

svghadi · 2024-09-03T09:15:24Z

Hi @mkieweg, did you get a chance to work on this again? If you're busy with other tasks, I'm happy to jump in and continue with it.

mkieweg · 2024-09-03T11:10:59Z

Hi @svghadi, sorry I had to pause working on this due to shifting priorities. I was planning on picking this back up this week. I'm not sure how much time I can actually commit, so if you're happy to take over that would be okay with me

blakepettersson · 2024-09-03T11:33:57Z

@svghadi I can also help out if needed

svghadi · 2024-09-03T12:35:28Z

@mkieweg - Sure. Could I get collaborator access to your argo-cd fork, or should I cherry-pick the changes and continue in a new PR?

@blakepettersson - Thank you. Should we address @hiddeco’s use case in this implementation, or would it be better to keep the scope limited and handle it in a separate PR?

mkieweg · 2024-09-03T14:32:37Z

@svghadi I've added you as a contributor to my fork. Can do the same for @blakepettersson if needed

blakepettersson · 2024-09-03T14:55:30Z

@blakepettersson - Thank you. Should we address @hiddeco’s use case in this implementation, or would it be better to keep the scope limited and handle it in a separate PR?

Ideally I'd like to see his use case addressed in this PR as well, since we have a particular use case in mind for it (it's related to Kargo).

blakepettersson · 2024-09-03T14:55:45Z

Can do the same for @blakepettersson if needed

Sounds good to me, thanks!

svghadi · 2024-09-04T05:47:53Z

Hi @blakepettersson , I was thinking about how to incorporate @hiddeco 's use case into this PR. It seems a bit challenging without introducing a new field, as @mkieweg's use case requires a timestamp that doesn't update periodically.

I propose to add a lastUpdateTime timestamp to app.status.health indicating how long the current status is active for.

In contrast, @hiddeco 's use case needs a periodically updating status timestamp.

introduce a "heartbeat" kind of timestamp that updates at most every 30 seconds (to not cause any API pressure) to indicate the "freshness" of the observation even without it transitioning.

Here are my thoughts:

Introduce a new observedAt field under application health, which updates periodically to indicate freshness.
Make this configurable to address @jessesuen 's concern about performance impact.

The health status would look something like this:

status:
  health:
    lastTransitionTime: "2024-09-03T08:51:32Z"
    observedAt: "2024-09-03T08:51:32Z"   # configurable 
    status: Progressing

Thoughts/suggestions?

blakepettersson · 2024-09-04T08:53:28Z

Hi @svghadi,

I agree that we'll need two separate fields, one for the health status and one to indicate "freshness".

Introduce a new observedAt field under application health, which updates periodically to indicate freshness.

👍

Make this configurable to address @jessesuen 's concern about performance impact.

I think it's fine for now to keep it at max 30 secs - if there is a need to configure this at a later stage we can address that in another PR.

codecov · 2024-09-09T12:31:21Z

Codecov Report

Attention: Patch coverage is 88.88889% with 3 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@f4c519a). Learn more about missing BASE report.
Report is 13 commits behind head on master.

Files with missing lines	Patch %	Lines
controller/appcontroller.go	88.23%	1 Missing and 1 partial ⚠️
controller/state.go	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18660   +/-   ##
=========================================
  Coverage          ?   55.86%           
=========================================
  Files             ?      320           
  Lines             ?    44422           
  Branches          ?        0           
=========================================
  Hits              ?    24815           
  Misses            ?    17042           
  Partials          ?     2565

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

svghadi · 2024-09-12T13:46:59Z

Hi @blakepettersson, the PR is ready for review.

Changes:

Adds a new .status.health.lastTransitionTime field that reflects when the last health status change occurred.
Adds a new .status.health.observedAt field that is periodically updated (every 30s) to reflect the freshness of the observed state.
A new event handler is set up to update the observedAt field every 30s. It enqueues events into the appRefreshQueue with a comparison option that ensures only the status is refreshed without performing a resource-level comparison. If there are concerns regarding overloading the queue, we can setup a new queue to process these events.

The screen grab below shows that observedAt is updated every 30s without affecting the lastTransitionTime field.

Screen.Recording.2024-09-12.at.6.53.06.PM.mov

blakepettersson

LGTM (pending the minor lint fixes), thanks @svghadi!

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

Co-authored-by: Blake Pettersson <blake.pettersson@gmail.com> Signed-off-by: Manuel Kieweg <2939765+mkieweg@users.noreply.github.com>

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

Due to implementation limitations, setting LastTransitionTime at the resource level is challenging. Converting it to a pointer type allows it to be skipped at the resource level and prevents it from appearing in .status.resources of the Application CR. Additionally, it doesn’t provide much value or have a known use case right now. Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

mkieweg requested a review from a team as a code owner June 14, 2024 10:52

mkieweg mentioned this pull request Jun 14, 2024

feat: Timestamp for Health Status (#16972) #17263

Closed

14 tasks

mkieweg marked this pull request as draft June 14, 2024 10:59

mkieweg force-pushed the status-transition-time branch from 11944ca to 6d65cda Compare June 14, 2024 13:33

mkieweg marked this pull request as ready for review June 17, 2024 10:23

jannfis reviewed Jun 18, 2024

View reviewed changes

controller/health.go Outdated Show resolved Hide resolved

controller/health.go Outdated Show resolved Hide resolved

svghadi reviewed Jun 18, 2024

View reviewed changes

controller/health.go Outdated Show resolved Hide resolved

mkieweg requested review from jannfis and svghadi June 20, 2024 10:45

svghadi reviewed Jun 22, 2024

View reviewed changes

mkieweg force-pushed the status-transition-time branch from 8d4030b to 1be8574 Compare June 25, 2024 10:18

hiddeco mentioned this pull request Jun 25, 2024

Add Timestamps to Health Status #16972

Open

blakepettersson reviewed Jun 27, 2024

View reviewed changes

controller/appcontroller_test.go Show resolved Hide resolved

controller/appcontroller_test.go Outdated Show resolved Hide resolved

controller/appcontroller_test.go Outdated Show resolved Hide resolved

svghadi force-pushed the status-transition-time branch from e62a31f to 1036899 Compare September 9, 2024 11:36

svghadi force-pushed the status-transition-time branch from 1036899 to f83469b Compare September 12, 2024 13:24

blakepettersson approved these changes Sep 19, 2024

View reviewed changes

blakepettersson requested a review from jessesuen September 19, 2024 09:45

mkieweg and others added 14 commits September 19, 2024 15:31

add lastTransitionTime to health status

f5ba87a

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

address first feedback

708710b

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

set transition time if health status is unknown

6a0275d

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

extend health improvement tests

4138bd5

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

add apoplication controller test

faf2e0a

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

use require for NoError

dfc3ab2

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

more extensive tests for health state changes

70d7e62

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

Apply suggestions from code review

5a66f83

Co-authored-by: Blake Pettersson <blake.pettersson@gmail.com> Signed-off-by: Manuel Kieweg <2939765+mkieweg@users.noreply.github.com>

Code review suggestions

7c1a328

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

remove obsolete assert

822ef3a

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

remove obsolete assert

036aa96

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

Add observedAt field in health status

2ad33c4

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

Fix ci lint

bfe15ee

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

svghadi force-pushed the status-transition-time branch from f83469b to bfe15ee Compare September 19, 2024 10:07

blakepettersson requested a review from ishitasequeira September 20, 2024 09:57

blakepettersson added the ready-for-review label Sep 20, 2024

	if orig.Status.Health.Status != newStatus.Health.Status {
	message := fmt.Sprintf("Updated health status: %s -> %s", orig.Status.Health.Status, newStatus.Health.Status)
	ctrl.logAppEvent(orig, argo.EventInfo{Reason: argo.EventReasonResourceUpdated, Type: v1.EventTypeNormal}, message, context.TODO())
	}
	var newAnnotations map[string]string
	if orig.GetAnnotations() != nil {
	newAnnotations = make(map[string]string)
	for k, v := range orig.GetAnnotations() {
	newAnnotations[k] = v
	}
	delete(newAnnotations, appv1.AnnotationKeyRefresh)
	}
	patch, modified, err := diff.CreateTwoWayMergePatch(
	&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: orig.GetAnnotations()}, Status: orig.Status},
	&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: newAnnotations}, Status: *newStatus}, appv1.Application{})

feat: Timestamp for Health Status (#16972) #18660

Are you sure you want to change the base?

feat: Timestamp for Health Status (#16972) #18660

Conversation

mkieweg commented Jun 14, 2024 • edited by blakepettersson Loading

jannfis left a comment

Choose a reason for hiding this comment

svghadi left a comment • edited Loading

Choose a reason for hiding this comment

mkieweg commented Jun 19, 2024

svghadi commented Jun 19, 2024

mkieweg commented Jun 19, 2024

jessesuen commented Jun 21, 2024 • edited Loading

jessesuen commented Jun 21, 2024

jessesuen commented Jun 21, 2024

blakepettersson commented Jun 21, 2024

svghadi left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svghadi Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blakepettersson Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svghadi commented Sep 3, 2024

mkieweg commented Sep 3, 2024

blakepettersson commented Sep 3, 2024

svghadi commented Sep 3, 2024

mkieweg commented Sep 3, 2024

blakepettersson commented Sep 3, 2024

blakepettersson commented Sep 3, 2024

svghadi commented Sep 4, 2024 • edited Loading

blakepettersson commented Sep 4, 2024 • edited Loading

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

svghadi commented Sep 12, 2024 • edited Loading

blakepettersson left a comment

Choose a reason for hiding this comment

mkieweg commented Jun 14, 2024 •

edited by blakepettersson

Loading

svghadi left a comment •

edited

Loading

jessesuen commented Jun 21, 2024 •

edited

Loading

svghadi left a comment •

edited

Loading

svghadi Sep 3, 2024 •

edited

Loading

blakepettersson Jun 27, 2024 •

edited

Loading

svghadi commented Sep 4, 2024 •

edited

Loading

blakepettersson commented Sep 4, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

svghadi commented Sep 12, 2024 •

edited

Loading