-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery/SQS micro-optimizations for performance improvement #1290
base: main
Are you sure you want to change the base?
Conversation
note: this still keeps email mfa in the celery worker task for speediness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share more details why you think it's worth tweaking polling_interval
, acks_late
and concurrency
?
Yes, of course. Poll Interval: Concurrency:
Ack Late: https://docs.celeryproject.org/en/stable/userguide/tasks.html The description The acks_late setting would be used when you need the task to be executed again if the worker (for some reason) crashes mid-execution. It’s important to note that the worker isn’t known to crash, and if it does it’s usually an unrecoverable error that requires human intervention (bug in the worker, or task code). |
@jimleroyer @Moro-Code we've been doing a lot of work around this lately - is this still relevant? |
@sastels Yes I'd say we should keep this. It seems like the tweaking can squeeze more performance. We could test thoroughly with our new load tests (which we didn't have yet when this PR was created). |
@jimleroyer can we shut this? |
Let's keep this please, there are good micro-optimizations that can help us that Quan did but we never merged in. |
Summary | Résumé
Trello
feat: moving email to another celery worker instance
note: mfa email still go through the normal celery worker
fix: updated celery polling to sqs to 0.5 seconds
Test instructions | Instructions pour tester la modification
Testing setup:
Tested different settings locally while monitoring SQS queues on staging/aws
Tested with --pool=gevent, prefetch-multiplier[10,100,124], polling_interval [1,0.5,0.3], acks_late
Locally with one instance of worker_email, I was able to do 10-15 rps to the rest endpoint. Theoretically this equates, with 10 pods, 100-150 emails per second, and 6000-9000 emails per minute.
The main observations is that the bottleneck within this system is the SQS signalling to initiate celery workers, future work we might want to consider moving to a more performant solution like Rabbitmq or Redis and or bulking notification.id together so a worker can handle more notifications in one instance.
Tried playing around with the prefetch with large numbers (124) but this didn't have much impact on the throughput
Unresolved questions / Out of scope | Questions non résolues ou hors sujet
Note: if we agree to move forward on this PR, it will not get committed until the notification-manifest PR gets created and approved (to be tested on staging)