Refactor zookeeper session timeout handling into an interface #6347

codelipenghui · 2020-02-17T16:41:12Z

Master Issue: #6259

Motivation

In Pulsar, brokers use Zookeeper as the configuration store and broker metadata maintaining. We can also call them Global Zookeeper and Local Zookeeper.

The Global Zookeeper maintains the namespace policies, cluster metadata, and partitioned topic metadata. To reduce read operations on Zookeeper, each broker has a cache for global Zookeeper. The Global Zookeeper cache updates on znode changed. Currently, when the present session timeout happens on global Zookeeper, a new session starts. Broker does not create any EPHEMERAL znodes on global Zookeeper.

The Local Zookeeper maintains the local cluster metadata, such as broker load data, topic ownership data, managed ledger metadata, and Bookie rack information. All of broker load data and topic ownership data are create EPHEMERAL nodes on Local Zookeeper. Currently, when session timeout happens on Local Zookeeper, the broker shutdown itself.

Shutdown broker results in ownership change of topics that the broker owned. However, we encountered lots of problems related to the current session timeout handling. Such as broker with long JVM GC pause, Local Zookeeper under high workload. Especially the latter may cause all broker shutdowns.

So, the purpose of this proposal is to improve session timeout handling on Local Zookeeper to avoid unnecessary broker shutdown.

Approach

Same as the Global Zookeeper session timeout handling and Zookeeper session timeout handling in BookKeeper, a new session should start when the present session timeout.

If a new session failed to start, the broker would retry several times. The retry times depend on the configuration of the broker. After the number of retries, if still can't start session success, the broker still needs to be shut down since this may be a problem with the Zookeeper cluster. The user needs to restart the broker after the zookeeper cluster returns to normal.

If a new session starts success, the issue is slightly more complicated. So, I will introduce every scene separately.

Topic ownership data handling

The topic ownership data maintain all namespace bundles that owned by the broker. In Zookeeper, create an EPHEMERAL znode for each namespace bundle. When the session timeout happens on the local Zookeeper, all of the EPHEMERAL znode maintained by this broker will delete automatically. We need some mechanism to avoid the unnecessary ownership transfer of the bundles. Since the broker cached the owned bundles in memory, the broker can use the cache to re-own the bundles.

Firstly, the broker should check if exists the znode for the bundle and the bundle owner is this broker. If the znode exists and the owner is this broker, it may be that znode has not been deleted. The broker should check if the ephemeral owner is the current session ID. If not, the broker should wait for the znode deletion.

Then the broker tries to own the bundle. If the broker owns the bundle success means the bundle not owned by other brokers, the broker should check whether to preload the topics under the bundle. If the broker failed to own the bundle means the bundle owned by another broker. The broker should unload the bundle.

Theoretically, the mechanism can guarantee that the ownership of most bundles will not change during the session timeout.

Broker load data handling

The load data used for namespace bundle load balancing, so there is no need to be overly complicated in handling. The only effect is that it will interfere with the choice of the broker when finding a candidate broker for a namespace bundle. Even without selecting the optimal broker, it will continue to relocate the namespace bundles.

So for broker load data handling, we need to guarantee the load data of the broker can report success.

Other scene handing

There are also some usage scenarios of the local Zookeeper, BookKeeper client, managed ledger meta, bookie rack information, and schema metadata. All of these scenarios do not create any EPHEMERAL znodes on the Zookeeper. Pulsar introduces the Zookeeper cache for the local Zookeeper. The cache is invalidated when the session timeout occurs.

Verifying this change

The test is coming.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (no)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

Documentation

Does this pull request introduce a new feature? (no)

sijie

@codelipenghui thank you for working on this. I am just wondering how is the re-connection handled by this change? I didn't see how did you re-create zookeeper sessions.

codelipenghui · 2020-02-18T03:43:46Z

@sijie
If I understand correctly, I think the zookeeper client already handles session re-creation. The broker uses https://github.com/apache/pulsar/blob/master/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZookeeperBkClientFactoryImpl.java#L53 to create a zookeeper client.

And the created zookeeper client has handled the session re-creation when the session expired :
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/zookeeper/ZooKeeperClient.java#L357. https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/zookeeper/ZooKeeperClient.java#L123

addisonj · 2020-02-19T02:33:10Z

@codelipenghui I am dealing with an issues in the pulsar proxy where the list of brokers gets out of sync and requires a proxy restart. I came across this PR while investigating and while this doesn't fix some other issues in the proxy, I imagine it would help. However, I just noticed that there are two instances of ZookeeperCacheLoader that have almost the same contents, one is the instance you changed (org.apache.pulsar.discovery.service.web.ZookeeperCacheLoader) the other is for the proxy (org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader).

If we also want to alleviate some of these same issues for the proxy, we will need to either make the change in the proxy version as well OR just refactor to have them both share the class (not sure why it isn't doing it right now, it doesn't appear problematic)

joefk · 2020-02-19T14:17:14Z

@sijie @mateo @codelipenghui
1)This is a very high risk fix. This should have been written up as a PIP and discussed. I think an issue and PR is not the right approach to make this change. Specifically i would like to see it spelled out, how the broker ceases topic activity, in the window between expiry of timeout and recreation of the session. Especially there should be a verifiable assertion that write ordering on a topic is not broken by this change.

2)For backward compatibility, there should be an option to completely turn this off ( retries = 0), and it be the default.

3)The desirable outcome of this change ..."Theoretically, the mechanism can guarantee that the ownership of most bundles will not change during the session timeout. " ..seems possible only in situations where bundle/topic ownership changes are slow. This change will benefit a particular use where topic activity is slow. This change is not desirable for fast responsive systems, where if a broker is unresponsive ( long JVM GC pause etc), the desired behaviour is for the broker to shutdown, so that a quick failover of the bundle is triggered by the next lookup.

4)The other issues mentioned (JVM GC, ZookeeperCacheLoader etc) are problems that need addressing, but this fix is not the solution to them.

codelipenghui · 2020-02-19T14:29:21Z

@joefk Thanks for reminding me. I am drafting a PIP and will send it to the email thread to start a discussion soon.

jiazhai

+1 for this change. seems the old behavior is not changed, and will not affect the current code.

codelipenghui · 2020-05-27T01:52:27Z

/pulsarbot run-failure-checks

codelipenghui · 2020-05-27T03:01:41Z

/pulsarbot run-failure-checks

codelipenghui · 2020-05-27T07:58:02Z

/pulsarbot run-failure-checks

sijie

@codelipenghui the change looks good. Although I don't think it completely implement PIP-57. It might be worth updating the issue to "refactor zookeeper session timeout handling into an interface". Because this pull request is more about abstracting the interface rather than implementing the retry logic.

codelipenghui · 2020-06-05T03:02:29Z

@sijie Ok

…#6347) Refactor zookeeper session timeout handling into an interface

codelipenghui self-assigned this Feb 17, 2020

codelipenghui added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Feb 17, 2020

codelipenghui added this to the 2.6.0 milestone Feb 17, 2020

codelipenghui linked an issue Feb 17, 2020 that may be closed by this pull request

Retry when zookeeper session expired at broker side. #6259

Closed

codelipenghui requested review from sijie and jiazhai February 17, 2020 16:47

sijie reviewed Feb 18, 2020

View reviewed changes

sijie mentioned this pull request May 20, 2020

[discussion] Pulsar release 2.6.0 #5819

Closed

codelipenghui added 2 commits May 22, 2020 18:24

Add zookeeper session expired policy.

43161e5

Fix log

8f39e20

codelipenghui force-pushed the zk_session_expire branch from 94b5c5e to 8f39e20 Compare May 22, 2020 10:24

codelipenghui added 2 commits May 22, 2020 18:25

delete unused code

e205849

Fix missed imports

30347ff

jiazhai approved these changes May 26, 2020

View reviewed changes

Fix tests issue

127ea4e

sijie approved these changes Jun 2, 2020

View reviewed changes

codelipenghui changed the title ~~PIP 57: Improve Zookeeper Session Timeout Handling~~ Refactor zookeeper session timeout handling into an interface Jun 5, 2020

codelipenghui removed a link to an issue Jun 5, 2020

Retry when zookeeper session expired at broker side. #6259

Closed

codelipenghui added 2 commits June 5, 2020 11:06

Merge branch 'master' into zk_session_expire

8203932

Fix tests

013199a

codelipenghui merged commit 6826040 into apache:master Jun 5, 2020

codelipenghui deleted the zk_session_expire branch June 5, 2020 08:48

cdbartholomew pushed a commit to kafkaesque-io/pulsar that referenced this pull request Jul 24, 2020

Refactor zookeeper session timeout handling into an interface (apache…

df7a8ce

…#6347) Refactor zookeeper session timeout handling into an interface

huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020

Refactor zookeeper session timeout handling into an interface (apache…

637bbc3

…#6347) Refactor zookeeper session timeout handling into an interface

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor zookeeper session timeout handling into an interface #6347

Refactor zookeeper session timeout handling into an interface #6347

codelipenghui commented Feb 17, 2020 •

edited

Loading

sijie left a comment

codelipenghui commented Feb 18, 2020

addisonj commented Feb 19, 2020 •

edited

Loading

joefk commented Feb 19, 2020

codelipenghui commented Feb 19, 2020

jiazhai left a comment

codelipenghui commented May 27, 2020

codelipenghui commented May 27, 2020

codelipenghui commented May 27, 2020

sijie left a comment

codelipenghui commented Jun 5, 2020

Refactor zookeeper session timeout handling into an interface #6347

Refactor zookeeper session timeout handling into an interface #6347

Conversation

codelipenghui commented Feb 17, 2020 • edited Loading

Motivation

Approach

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

sijie left a comment

Choose a reason for hiding this comment

codelipenghui commented Feb 18, 2020

addisonj commented Feb 19, 2020 • edited Loading

joefk commented Feb 19, 2020

codelipenghui commented Feb 19, 2020

jiazhai left a comment

Choose a reason for hiding this comment

codelipenghui commented May 27, 2020

codelipenghui commented May 27, 2020

codelipenghui commented May 27, 2020

sijie left a comment

Choose a reason for hiding this comment

codelipenghui commented Jun 5, 2020

codelipenghui commented Feb 17, 2020 •

edited

Loading

addisonj commented Feb 19, 2020 •

edited

Loading