Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery/SQS micro-optimizations for performance improvement #1290

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

QuanNguyen-CDS
Copy link
Contributor

@QuanNguyen-CDS QuanNguyen-CDS commented May 6, 2021

Summary | Résumé

Trello
feat: moving email to another celery worker instance
note: mfa email still go through the normal celery worker
fix: updated celery polling to sqs to 0.5 seconds

Test instructions | Instructions pour tester la modification

Testing setup:

  • created a simple rest endpoint [flask] and it's able to do a modest 800 rps - tested with apache bench (ab)
  • modified the awssesclient to send requests to this rest endpoint
  • created a python script to push emails to notification-api, ran 3 instances to simulate load
    Tested different settings locally while monitoring SQS queues on staging/aws
    Tested with --pool=gevent, prefetch-multiplier[10,100,124], polling_interval [1,0.5,0.3], acks_late

Locally with one instance of worker_email, I was able to do 10-15 rps to the rest endpoint. Theoretically this equates, with 10 pods, 100-150 emails per second, and 6000-9000 emails per minute.
The main observations is that the bottleneck within this system is the SQS signalling to initiate celery workers, future work we might want to consider moving to a more performant solution like Rabbitmq or Redis and or bulking notification.id together so a worker can handle more notifications in one instance.
Tried playing around with the prefetch with large numbers (124) but this didn't have much impact on the throughput

Unresolved questions / Out of scope | Questions non résolues ou hors sujet

Note: if we agree to move forward on this PR, it will not get committed until the notification-manifest PR gets created and approved (to be tested on staging)

Copy link
Contributor

@AntoineAugusti AntoineAugusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share more details why you think it's worth tweaking polling_interval, acks_late and concurrency?

@QuanNguyen-CDS
Copy link
Contributor Author

Could you share more details why you think it's worth tweaking polling_interval, acks_late and concurrency?

Yes, of course.

Poll Interval:
Setting up the rest server, i was able to also setup a manual metric for Requests Per Second. Running the email task locally and queuing hundreds of requests within SQS, It was observed that there were fluctuations within the processing of tasks. I'd describe it as "bursty". I had hypothesized that the celery tasks were sitting idle until the next poll cycle.
When I increased the poll cycle to 0.5 seconds. The "burstiness" observations went away.

Concurrency:
The sending of emails is highly specialized; A burst of emails thousands of emails in a short period of time.
Additionally, the task itself is I/O bound:

  • fetching of the notification from the database
  • fetching the template from the database
  • sending the data for attachments
  • sending the email itself using the provider client
  • and then another database update on notification
    The purpose of the increase in concurrency is to try to take I/O bound tasks and spread them out. If the tasks were CPU bound then we would want to minimize the concurrency to the number of CPUs allocated. However we want to account for the task waiting for I/O actions to complete.

Ack Late:
watching this video: https://www.youtube.com/watch?v=ceJ-vy7fvus @ 13:52 gave me ideas to play around with -Ofair, prefetch_multiplier and ack_late. It seems they came to the conclusions of these settings while running into problems running celery tasks, slow and fast tasks. I played around with the prefetch multiplier but didn't have any real effects on the performance of the system. Reading the description for ack late seems to make sense to enable it:

https://docs.celeryproject.org/en/stable/userguide/tasks.html
https://docs.celeryproject.org/en/stable/faq.html#faq-acks-late-vs-retry

The description
If your task is idempotent you can set the acks_late option to have the worker acknowledge the message after the task returns instead.

The acks_late setting would be used when you need the task to be executed again if the worker (for some reason) crashes mid-execution. It’s important to note that the worker isn’t known to crash, and if it does it’s usually an unrecoverable error that requires human intervention (bug in the worker, or task code).

@sastels
Copy link
Collaborator

sastels commented Aug 3, 2021

@jimleroyer @Moro-Code we've been doing a lot of work around this lately - is this still relevant?

@jimleroyer
Copy link
Member

jimleroyer commented Aug 5, 2021

@sastels Yes I'd say we should keep this. It seems like the tweaking can squeeze more performance. We could test thoroughly with our new load tests (which we didn't have yet when this PR was created).

Base automatically changed from master to main February 17, 2022 18:17
@jzbahrai
Copy link
Collaborator

@jimleroyer can we shut this?

@jimleroyer
Copy link
Member

Let's keep this please, there are good micro-optimizations that can help us that Quan did but we never merged in.

@jimleroyer jimleroyer changed the title notification emails to be pushed to another celery worker Celery/SQS micro-optimizations for performance improvement May 27, 2022
@jimleroyer jimleroyer changed the title Celery/SQS micro-optimizations for performance improvement Prototype: Celery/SQS micro-optimizations for performance improvement May 27, 2022
@sastels sastels added the Prototype Prototype changes label Jun 1, 2022
@sastels sastels changed the title Prototype: Celery/SQS micro-optimizations for performance improvement Celery/SQS micro-optimizations for performance improvement Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Prototype Prototype changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants