Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout for sync operations #6055

Open
dominykas opened this issue Apr 19, 2021 · 18 comments
Open

Timeout for sync operations #6055

dominykas opened this issue Apr 19, 2021 · 18 comments

Comments

@dominykas
Copy link

dominykas commented Apr 19, 2021

Summary

At the moment, if for whatever reason the sync process gets stuck (e.g. because some resource fails to start up properly and keeps on retrying), the sync will never complete and will keep on "Syncing".

There should be an option to add a timeout, after which the sync process would terminate. Depending on selfHeal rules, etc, there may be a need to automatically retry, or alternatively, the application should just stay in the failed state until manually resolved.

Did my best to search for similar requests, aside from a brief note in #1886, couldn't find anything - sorry if I missed it.

Motivation

At the moment, we've set up alerting for sync operations that are taking too long, which at least notifies someone to look at things and usually means a manual intervention.

When an application is in a "Syncing" state, manual intervention becomes rather tricky - one cannot delete resource to get them recreated (esp. when things are stuck in some sync wave), or perform a partial sync, etc.

Moreover, simply hitting "Terminate" is not always sufficient if the application has autosync enabled, as it would just retry, putting it into a forever "Syncing" state. Disabling autosync in some cases might also be problematic and require multiple steps, because it might be set from a parent application - which means that the parent application autosync also needs to be disabled (so that it does not just resync and re-enable the autosync).

Proposal

syncPolicy:
    syncTimeout: 600 # seconds, default: unlimited
    onSyncTimeout: "fail" # or "retry" (?), or "waitForUpdate" (?)

Some of the things that might need consideration:

  • Should selfHeal just retry? Or should that be configurable? The previous sync might not have completed in full, so hooks/postsync actions might not have executed.
  • Should new commits result in a new sync operation? Same as above, essentially. Arguably, new commits could be the fix.
@dominykas dominykas added the enhancement New feature or request label Apr 19, 2021
@RaviHari
Copy link

I would like to work on this issue.

@hanzala1234
Copy link
Contributor

Is there any update on that?

@prima101112
Copy link

@RaviHari is there any update on this. been in this issue because pre-hook failed and its locked to always sync state

@RaviHari
Copy link

@prima101112 and @hanzala1234 sorry for delay.. I will get started on this and keep you posted in this thread.

@LS80
Copy link

LS80 commented Jun 9, 2022

@RaviHari Did you get round to starting on this?

@grezar
Copy link

grezar commented Jul 29, 2022

+1

1 similar comment
@yabeenico
Copy link

+1

@crenshaw-dev
Copy link
Member

crenshaw-dev commented Aug 12, 2022

Moreover, simply hitting "Terminate" is not always sufficient

I've also seen "Terminate" simply cause the sync operation to get stuck in "Terminating." This was in an app with ~1k resources.

If Ravi or anyone else puts up a PR, I'd be happy to review.

@pritam-acquia
Copy link

+1

@mhonorio
Copy link

Looking forward to this feature too. I have a lot of applications getting stuck and timeout would be great to not block the others resources that it's not related.

@neiser
Copy link

neiser commented Jan 3, 2023

It seems like @RaviHari has lost interest in this, at least he stopped responding. We'd still appreciate that feature very much (we're using the app-of-apps pattern and sometimes it just gets stuck, and a timeout would really help). Any chance someone else can implement this?

@Sayrus
Copy link

Sayrus commented Feb 21, 2023

To work around sync being stuck due to hooks or operations taking too long, I've implemented the following:

Sayrus@817bc34

It's equivalent to clicking Terminate after reaching the timeout. This will end up as a Sync Failed thus blocking self healing from auto syncing the application (Skipping auto-sync: failed previous sync attempt to xxxx). This is probably not the best way to do it but it works.

@LS80
Copy link

LS80 commented Oct 9, 2023

Another way to work around it is to run the following as a CronJob.

from datetime import datetime, timedelta
import logging
import os
import sys

from kubernetes import client, config
import requests

logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'info').upper(), format='[%(levelname)s] %(message)s')

try:
    timeout_minutes = int(sys.argv[1])
except IndexError:
    timeout_minutes = 60

argocd_server = os.environ['ARGOCD_SERVER']
argocd_token = os.environ['ARGOCD_TOKEN']

config.load_incluster_config()

api = client.CustomObjectsApi()

apps = api.list_namespaced_custom_object(
    group='argoproj.io',
    version='v1alpha1',
    namespace='argocd',
    plural='applications'
)['items']

syncing_apps = [app for app in apps if app.get('status', {}).get('operationState', {}).get('phase') == 'Running']

def apps_to_timeout():
    now = datetime.utcnow()
    logging.debug(f"Time now {now.isoformat()}")

    for app in syncing_apps:
        app_name = app['metadata']['name']
        sync_started = datetime.fromisoformat(app['status']['operationState']['startedAt'].removesuffix('Z'))
        logging.debug(f"App '{app_name}' started syncing at {sync_started.isoformat()}")

        if now - sync_started > timedelta(minutes=timeout_minutes):
            yield app_name

apps = list(apps_to_timeout())
logging.info(f"Number of apps syncing longer than timeout of {timeout_minutes} minutes: {len(apps)}")

session = requests.session()
session.cookies.set('argocd.token', argocd_token)

for app_name in apps:
    session.delete(f"https://{argocd_server}/api/v1/applications/{app_name}/operation")
    logging.info(f"Terminated sync operation for '{app_name}'")

@aslafy-z
Copy link
Contributor

aslafy-z commented Oct 9, 2023

@alexec would you mind giving a look to #15603?

@travis-jorge
Copy link

Has there been any progress on implementing this? We have this issue daily.

@riuvshyn
Copy link

riuvshyn commented Oct 4, 2024

same here, we are using external cron job to detect and terminate "stuck" syncs which is very inconvenient and painful to maintain. Been waiting for this for years already 🙏🏽

@jessebye
Copy link

jessebye commented Oct 7, 2024

@riuvshyn could you share the cronjob? 🙏 we could really use that while waiting for this feature to get implemented.

@LS80
Copy link

LS80 commented Oct 8, 2024

We currently have this as a CronJob.

from datetime import datetime, timedelta, UTC
import logging
import os
import sys

from kubernetes import client, config
import requests

logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'info').upper(), format='[%(levelname)s] %(message)s')
logging.getLogger("kubernetes.client.rest").setLevel(os.environ.get("KUBE_LOG_LEVEL", "info").upper())

try:
    timeout_minutes = int(sys.argv[1])
except IndexError:
    timeout_minutes = 60

argocd_server = os.environ['ARGOCD_SERVER']
argocd_token = os.environ['ARGOCD_TOKEN']

try:
    config.load_incluster_config()
except config.config_exception.ConfigException:
    try:
        config.load_kube_config()
    except config.config_exception.ConfigException:
        raise Exception("Could not configure kubernetes client.")

api = client.CustomObjectsApi()

apps = api.list_cluster_custom_object(
    group='argoproj.io',
    version='v1alpha1',
    plural='applications'
)['items']

syncing_apps = [app for app in apps if app.get('status', {}).get('operationState', {}).get('phase') == 'Running']

def apps_to_timeout():
    now = datetime.now(UTC)
    logging.debug(f"Time now {now.isoformat()}")

    for app in syncing_apps:
        app_name = app['metadata']['name']
        app_namespace = app['metadata']['namespace']
        sync_started = datetime.fromisoformat(app['status']['operationState']['startedAt'])
        logging.debug(f"App '{app_namespace}/{app_name}' started syncing at {sync_started.isoformat()}")

        if now - sync_started > timedelta(minutes=timeout_minutes):
            yield app_namespace, app_name

apps = list(apps_to_timeout())
logging.info(f"Number of apps syncing longer than timeout of {timeout_minutes} minutes: {len(apps)}")

session = requests.session()
session.cookies.set('argocd.token', argocd_token)
session.headers.update({'Content-Type': 'application/json'})

responses = []
for app_namespace, app_name in apps:
    logging.debug(f"Terminating sync operation for '{app_namespace}/{app_name}'")
    response = session.delete(
        f"https://{argocd_server}/api/v1/applications/{app_name}/operation",
        params={'appNamespace': app_namespace}
    )
    logging.debug(f"[{response.status_code}] {response.text}")
    logging.info(f"Terminated sync operation for '{app_namespace}/{app_name}'")
    responses.append(response)

if not all(response.ok for response in responses):
    logging.error("Some sync operations failed to terminate")
    sys.exit(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests