Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] A large backlog of Key_Shared subscription messages will result in fullgc and OOM #21045

Open
1 of 2 tasks
jdfrozen opened this issue Aug 22, 2023 · 7 comments
Open
1 of 2 tasks
Labels
area/broker category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost help wanted Stale

Comments

@jdfrozen
Copy link

Search before asking

  • I searched in the issues and found nothing similar.

Version

2.7.x

Minimal reproduce step

1、A large backlog of Key_Shared subscription messages
2、The subscription has multiple consumers

What did you expect to see?

broker functioning

What did you see instead?

1、broker frequent gc
2、broker fullgc
3、broker OOM

This is broker gc monitoring
image

Add parameters to the use of boot “-XX:+HeapDumpOnOutOfMemoryError”, When fullgc is sent, the analysis is done through mat
image

Anything else?

Root cause: redeliveryMessages contains a large number of messages

PersistentStickyKeyDispatcherMultipleConsumers.java

@Override
protected synchronized Set<PositionImpl> getMessagesToReplayNow(int maxMessagesToRead) {
    if (isDispatcherStuckOnReplays) {
        // If we're stuck on replay, we want to move forward reading on the topic (until the overall max-unacked
        // messages kicks in), instead of keep replaying the same old messages, since the consumer that these
        // messages are routing to might be busy at the moment
        this.isDispatcherStuckOnReplays = false;
        return Collections.emptySet();
    } else {
        return super.getMessagesToReplayNow(maxMessagesToRead);
    }
}

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@jdfrozen jdfrozen added the type/bug The PR fixed a bug or issue reported a bug label Aug 22, 2023
@jdfrozen
Copy link
Author

So when this Key_Shared subscription has a lot of consumers, and some consumers are slow consumers, and some consumers start messaging and find out that stickyKeyHash is for slow consumers, Then these messages will add MessagetoReplay, and a large backlog will cause this problem

@jdfrozen
Copy link
Author

Add parameters to the use of boot "-XX:+HeapDumpBeforeFullGC"

@mattisonchao mattisonchao added category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost area/broker and removed type/bug The PR fixed a bug or issue reported a bug labels Aug 23, 2023
@mattisonchao
Copy link
Member

mattisonchao commented Aug 23, 2023

The KEY_SHARE mode is a somewhat strict type. That is very sensitive to the consumption(acknowledgement) rate since it should ensure the message order. when adding some consumers to the subscription, the key hash should be recalculated, and some new messages index should keep in the broker memory to avoid breaking delivery order.(one key deliver to one consumer at the moment)

Therefore, It's expected behaviour. You can check why some of your consumers can't catch up or consider If you can try to use another subscription mode like SHARED.

But anyway. You are right. We should have a limit on this container's memory usage to avoid one topic affecting the whole broker.
:)

@jdfrozen
Copy link
Author

I verified and tested the set-max-unacked-messages-per-subscription as small as 1000 to avoid fullgc.
When I verify, I use the namespace policy "pulsar-admin namespaces get-max-unacked messages-per-subscription"
We want to set the topics level policy. We are using version 2.7.4. Is the topics level policy stable enough?

@mattisonchao
Copy link
Member

mattisonchao commented Aug 25, 2023

Hi, @jdfrozen
2.7.x is a kinda old version, I am unsure if it can work properly. But you can give it a try. :)

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Sep 24, 2023
@lhotari
Copy link
Member

lhotari commented Sep 4, 2024

One of the root causes behind this issue is described in #23200 . It's addressed by #23231 and #23226.
I believe that the OOM issue got mitigated already by #17804.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost help wanted Stale
Projects
None yet
Development

No branches or pull requests

3 participants