Allow to use Khepri database to store metadata instead of Mnesia #7206

dcorbacho · 2023-02-07T13:59:16Z

Why

Mnesia is a very powerful and convenient tool for Erlang applications: it is a persistent disc-based database, it handles replication accross multiple Erlang nodes and it is available out-of-the-box from the Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its metadata:

virtual hosts' properties
intenal users
queue, exchange and binding declarations (not queues data)
runtime parameters and policies
...

Unfortunately Mnesia makes it difficult to handle network partition and, as a consequence, the merge conflicts between Erlang nodes once the network partition is resolved. RabbitMQ provides several partition handling strategies but they are not bullet-proof. Users still hit situations where it is a pain to repair a cluster following a network partition.

How

@kjnilsson created Ra, a Raft consensus library that RabbitMQ already uses successfully to implement quorum queues and streams for instance. Those queues do not suffer from network partitions.

We created Khepri, a new persistent and replicated database engine based on Ra and we want to use it in place of Mnesia in RabbitMQ to solve the problems with network partitions.

This pull request integrates Khepri as an experimental feature. When enabled, RabbitMQ will store all its metadata in Khepri instead of Mnesia.

Behavior changes

While Khepri remains disabled, you should see no changes to the behavior of RabbitMQ. If there are changes, it is a bug.

After Khepri is enabled, there are significant changes of behavior that you should be aware of!

Because it is based on the Raft consensus algorithm, when there is a network partition, only the cluster members that are in the partition with at least (Number of nodes in the cluster ÷ 2) + 1 number of nodes can "make progress". In other words, only those nodes may write to the Khepri database and read from the database and expect a consistent result.

For instance in a cluster of 5 RabbitMQ nodes:

If there are two partitions, one with 3 nodes, one with 2 nodes, only the group of 3 nodes will be able to write to the database.
If there are three partitions, two with 2 nodes, one with 1 node, none of the group can write to the database.

Because the Khepri database will be used for all kind of metadata, it means that RabbitMQ nodes that can't write to the database will be unable to perform some operations. Here is a list of operations and what to expect:

Operation	Works in a minority partition
Opening a TCP connection	✅
Create an AMQP channel	✅
Declare an exchange	❌
Declare a binding	❌
Declare a queue	❌
Publish a message to an exchange	✅ No guarantee of the message being successfully routed to all queues, but local classic queues might work
Consume from a queue	✅ No guarantee of message consumption, but local classic queues might work
Declare a virtual host	❌
Modify a virtual host	❌
Delete a virtual host	❌
Declare a internal user	❌
Modify an internal user	❌
Delete an internal user	❌
Declare a runtime parameter or policy	❌
Modify a runtime parameter or policy	❌
Delete a runtime parameter or policy	❌
Add a RabbitMQ node to a cluster	❌
Remove a RabbitMQ node from a cluster	❌
Enable a feature flag	❌
Enable a RabbitMQ plugin	Depends on the plugin needs w.r.t. the database
Disable a RabbitMQ plugin	✅

This requirement from Raft also affects the startup of RabbitMQ nodes in a cluster. Indeed, at least a quorum number of nodes must be started at once to allow nodes to become ready.

TODO: Review the list above and possibly expand it
TODO: Write behavior changes in plugins

How to switch to Khepri

To enable Khepri, you need to enable the khepri_db feature flag:

rabbitmqctl enable_feature_flag khepri_db

When the khepri_db feature flag is enabled, the migration code performs the following two tasks:

It synchronizes the Khepri cluster membership from the Mnesia cluster. It uses mnesia_to_khepri:sync_cluster_membership/1 from the khepri_mnesia_migration application.
It copies data from relevant Mnesia tables to Khepri, doing some conversion if necessary on the way. Again, it uses mnesia_to_khepri:copy_tables/4 from khepri_mnesia_migration to do it.

This can be performed on a running standalone RabbitMQ node or cluster. Data will be migrated from Mnesia to Khepri without any service interruption. Note that during the migration, the performance may decrease and the memory footprint may go up.

Because this feature flag is considered experimental, it is not enabled by default even on a brand new RabbitMQ deployment.

Implementation details on the integration itself

In the past months, all accesses to Mnesia were isolated in a collection of rabbit_db* modules. This is where the integration of Khepri mostly takes place: we use a function called rabbit_khepri:handle_fallback/1 which selects the database and perform the query or the transaction. Here is an example from rabbit_db_vhost:

Up until RabbitMQ 3.12.x:

get(VHostName) when is_binary(VHostName) ->
    get_in_mnesia(VHostName).

Starting with RabbitMQ 3.13.0:

get(VHostName) when is_binary(VHostName) ->
    rabbit_khepri:handle_fallback(
      #{mnesia => fun() -> get_in_mnesia(VHostName) end,
        khepri => fun() -> get_in_khepri(VHostName) end}).

This rabbit_khepri:handle_fallback/1 function relies on two things:

the fact that the khepri_db feature flag is enabled, in which case it always executes the Khepri-based variant.
the ability or not to read and write to Mnesia tables otherwise.

Before the feature flag is enabled, or during the migration, the function will try to execute the Mnesia-based variant. If it succeeds, then it returns the result. If it fails because one or more Mnesia tables can't be used, it restarts from scratch: it means the feature flag is being enabled and depending on the outcome, either the Mnesia-based variant will succeed (the feature flag couldn't be enabled) or the feature flag will be marked as enabled and it will call the Khepri-based variant. The meat of this function really lives in the khepri_mnesia_migration application and rabbit_khepri:handle_fallback/1 is a wrapper on top of it that knows about the feature flag.

However, some calls to the database do not depend on the existence of Mnesia tables, such as functions where we need to learn about the members of a cluster. For those, we can't rely on exceptions from Mnesia. Therefore, we just look at the state of the feature flag to determine which database to use. There are two situations though:

Sometimes, we need the feature flag state query to block because the function interested in it can't return a valid answer during the migration. Here is an example:
```
case rabbit_khepri:is_enabled(RemoteNode) of
    true  -> can_join_using_khepri(RemoteNode);
    false -> can_join_using_mnesia(RemoteNode)
end
```

Sometimes, we need the feature flag state query to NOT block (for instance because it would cause a deadlock). Here is an example:

case rabbit_khepri:get_feature_state() of
    enabled -> members_using_khepri();
    _       -> members_using_mnesia()
end

Direct accesses to Mnesia still exists. They are limited to code that is specific to Mnesia such as classic queue mirroring or network partitions handling strategies.

Now, to discover the Mnesia tables to migrate and how to migrate them, we use an Erlang module attribute called rabbit_mnesia_tables_to_khepri_db which indicates a list of Mnesia tables and an associated converter module. Here is an example in the rabbitmq_recent_history_exchange plugin:

-rabbit_mnesia_tables_to_khepri_db(
   [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]).

The converter module — rabbit_db_rh_exchange_m2k_converter in this example — is is fact a "sub" converter module called but rabbit_db_m2k_converter. See the documentation of a mnesia_to_khepri converter module to learn more about these modules.

@kjnilsson

[Why] Mnesia is a very powerful and convenient tool for Erlang applications: it is a persistent disc-based database, it handles replication accross multiple Erlang nodes and it is available out-of-the-box from the Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its metadata: * virtual hosts' properties * intenal users * queue, exchange and binding declarations (not queues data) * runtime parameters and policies * ... Unfortunately Mnesia makes it difficult to handle network partition and, as a consequence, the merge conflicts between Erlang nodes once the network partition is resolved. RabbitMQ provides several partition handling strategies but they are not bullet-proof. Users still hit situations where it is a pain to repair a cluster following a network partition. [How] @kjnilsson created Ra [1], a Raft consensus library that RabbitMQ already uses successfully to implement quorum queues and streams for instance. Those queues do not suffer from network partitions. We created Khepri [2], a new persistent and replicated database engine based on Ra and we want to use it in place of Mnesia in RabbitMQ to solve the problems with network partitions. This patch integrates Khepri as an experimental feature. When enabled, RabbitMQ will store all its metadata in Khepri instead of Mnesia. This change comes with behavior changes. While Khepri remains disabled, you should see no changes to the behavior of RabbitMQ. If there are changes, it is a bug. After Khepri is enabled, there are significant changes of behavior that you should be aware of. Because it is based on the Raft consensus algorithm, when there is a network partition, only the cluster members that are in the partition with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes can "make progress". In other words, only those nodes may write to the Khepri database and read from the database and expect a consistent result. For instance in a cluster of 5 RabbitMQ nodes: * If there are two partitions, one with 3 nodes, one with 2 nodes, only the group of 3 nodes will be able to write to the database. * If there are three partitions, two with 2 nodes, one with 1 node, none of the group can write to the database. Because the Khepri database will be used for all kind of metadata, it means that RabbitMQ nodes that can't write to the database will be unable to perform some operations. A list of operations and what to expect is documented in the associated pull request and the RabbitMQ website. This requirement from Raft also affects the startup of RabbitMQ nodes in a cluster. Indeed, at least a quorum number of nodes must be started at once to allow nodes to become ready. To enable Khepri, you need to enable the `khepri_db` feature flag: rabbitmqctl enable_feature_flag khepri_db When the `khepri_db` feature flag is enabled, the migration code performs the following two tasks: 1. It synchronizes the Khepri cluster membership from the Mnesia cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from the `khepri_mnesia_migration` application [3]. 2. It copies data from relevant Mnesia tables to Khepri, doing some conversion if necessary on the way. Again, it uses `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do it. This can be performed on a running standalone RabbitMQ node or cluster. Data will be migrated from Mnesia to Khepri without any service interruption. Note that during the migration, the performance may decrease and the memory footprint may go up. Because this feature flag is considered experimental, it is not enabled by default even on a brand new RabbitMQ deployment. More about the implementation details below: In the past months, all accesses to Mnesia were isolated in a collection of `rabbit_db*` modules. This is where the integration of Khepri mostly takes place: we use a function called `rabbit_khepri:handle_fallback/1` which selects the database and perform the query or the transaction. Here is an example from `rabbit_db_vhost`: * Up until RabbitMQ 3.12.x: get(VHostName) when is_binary(VHostName) -> get_in_mnesia(VHostName). * Starting with RabbitMQ 3.13.0: get(VHostName) when is_binary(VHostName) -> rabbit_khepri:handle_fallback( #{mnesia => fun() -> get_in_mnesia(VHostName) end, khepri => fun() -> get_in_khepri(VHostName) end}). This `rabbit_khepri:handle_fallback/1` function relies on two things: 1. the fact that the `khepri_db` feature flag is enabled, in which case it always executes the Khepri-based variant. 4. the ability or not to read and write to Mnesia tables otherwise. Before the feature flag is enabled, or during the migration, the function will try to execute the Mnesia-based variant. If it succeeds, then it returns the result. If it fails because one or more Mnesia tables can't be used, it restarts from scratch: it means the feature flag is being enabled and depending on the outcome, either the Mnesia-based variant will succeed (the feature flag couldn't be enabled) or the feature flag will be marked as enabled and it will call the Khepri-based variant. The meat of this function really lives in the `khepri_mnesia_migration` application [3] and `rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows about the feature flag. However, some calls to the database do not depend on the existence of Mnesia tables, such as functions where we need to learn about the members of a cluster. For those, we can't rely on exceptions from Mnesia. Therefore, we just look at the state of the feature flag to determine which database to use. There are two situations though: * Sometimes, we need the feature flag state query to block because the function interested in it can't return a valid answer during the migration. Here is an example: case rabbit_khepri:is_enabled(RemoteNode) of true -> can_join_using_khepri(RemoteNode); false -> can_join_using_mnesia(RemoteNode) end * Sometimes, we need the feature flag state query to NOT block (for instance because it would cause a deadlock). Here is an example: case rabbit_khepri:get_feature_state() of enabled -> members_using_khepri(); _ -> members_using_mnesia() end Direct accesses to Mnesia still exists. They are limited to code that is specific to Mnesia such as classic queue mirroring or network partitions handling strategies. Now, to discover the Mnesia tables to migrate and how to migrate them, we use an Erlang module attribute called `rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia tables and an associated converter module. Here is an example in the `rabbitmq_recent_history_exchange` plugin: -rabbit_mnesia_tables_to_khepri_db( [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]). The converter module — `rabbit_db_rh_exchange_m2k_converter` in this example — is is fact a "sub" converter module called but `rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri` converter module to learn more about these modules. [1] https://github.com/rabbitmq/ra [2] https://github.com/rabbitmq/khepri [3] https://github.com/rabbitmq/khepri_mnesia_migration See #7206. Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com> Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com> Co-authored-by: Michael Davis <mcarsondavis@gmail.com>

@kjnilsson

[Why] Mnesia is a very powerful and convenient tool for Erlang applications: it is a persistent disc-based database, it handles replication accross multiple Erlang nodes and it is available out-of-the-box from the Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its metadata: * virtual hosts' properties * intenal users * queue, exchange and binding declarations (not queues data) * runtime parameters and policies * ... Unfortunately Mnesia makes it difficult to handle network partition and, as a consequence, the merge conflicts between Erlang nodes once the network partition is resolved. RabbitMQ provides several partition handling strategies but they are not bullet-proof. Users still hit situations where it is a pain to repair a cluster following a network partition. [How] @kjnilsson created Ra [1], a Raft consensus library that RabbitMQ already uses successfully to implement quorum queues and streams for instance. Those queues do not suffer from network partitions. We created Khepri [2], a new persistent and replicated database engine based on Ra and we want to use it in place of Mnesia in RabbitMQ to solve the problems with network partitions. This patch integrates Khepri as an experimental feature. When enabled, RabbitMQ will store all its metadata in Khepri instead of Mnesia. This change comes with behavior changes. While Khepri remains disabled, you should see no changes to the behavior of RabbitMQ. If there are changes, it is a bug. After Khepri is enabled, there are significant changes of behavior that you should be aware of. Because it is based on the Raft consensus algorithm, when there is a network partition, only the cluster members that are in the partition with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes can "make progress". In other words, only those nodes may write to the Khepri database and read from the database and expect a consistent result. For instance in a cluster of 5 RabbitMQ nodes: * If there are two partitions, one with 3 nodes, one with 2 nodes, only the group of 3 nodes will be able to write to the database. * If there are three partitions, two with 2 nodes, one with 1 node, none of the group can write to the database. Because the Khepri database will be used for all kind of metadata, it means that RabbitMQ nodes that can't write to the database will be unable to perform some operations. A list of operations and what to expect is documented in the associated pull request and the RabbitMQ website. This requirement from Raft also affects the startup of RabbitMQ nodes in a cluster. Indeed, at least a quorum number of nodes must be started at once to allow nodes to become ready. To enable Khepri, you need to enable the `khepri_db` feature flag: rabbitmqctl enable_feature_flag khepri_db When the `khepri_db` feature flag is enabled, the migration code performs the following two tasks: 1. It synchronizes the Khepri cluster membership from the Mnesia cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from the `khepri_mnesia_migration` application [3]. 2. It copies data from relevant Mnesia tables to Khepri, doing some conversion if necessary on the way. Again, it uses `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do it. This can be performed on a running standalone RabbitMQ node or cluster. Data will be migrated from Mnesia to Khepri without any service interruption. Note that during the migration, the performance may decrease and the memory footprint may go up. Because this feature flag is considered experimental, it is not enabled by default even on a brand new RabbitMQ deployment. More about the implementation details below: In the past months, all accesses to Mnesia were isolated in a collection of `rabbit_db*` modules. This is where the integration of Khepri mostly takes place: we use a function called `rabbit_khepri:handle_fallback/1` which selects the database and perform the query or the transaction. Here is an example from `rabbit_db_vhost`: * Up until RabbitMQ 3.12.x: get(VHostName) when is_binary(VHostName) -> get_in_mnesia(VHostName). * Starting with RabbitMQ 3.13.0: get(VHostName) when is_binary(VHostName) -> rabbit_khepri:handle_fallback( #{mnesia => fun() -> get_in_mnesia(VHostName) end, khepri => fun() -> get_in_khepri(VHostName) end}). This `rabbit_khepri:handle_fallback/1` function relies on two things: 1. the fact that the `khepri_db` feature flag is enabled, in which case it always executes the Khepri-based variant. 4. the ability or not to read and write to Mnesia tables otherwise. Before the feature flag is enabled, or during the migration, the function will try to execute the Mnesia-based variant. If it succeeds, then it returns the result. If it fails because one or more Mnesia tables can't be used, it restarts from scratch: it means the feature flag is being enabled and depending on the outcome, either the Mnesia-based variant will succeed (the feature flag couldn't be enabled) or the feature flag will be marked as enabled and it will call the Khepri-based variant. The meat of this function really lives in the `khepri_mnesia_migration` application [3] and `rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows about the feature flag. However, some calls to the database do not depend on the existence of Mnesia tables, such as functions where we need to learn about the members of a cluster. For those, we can't rely on exceptions from Mnesia. Therefore, we just look at the state of the feature flag to determine which database to use. There are two situations though: * Sometimes, we need the feature flag state query to block because the function interested in it can't return a valid answer during the migration. Here is an example: case rabbit_khepri:is_enabled(RemoteNode) of true -> can_join_using_khepri(RemoteNode); false -> can_join_using_mnesia(RemoteNode) end * Sometimes, we need the feature flag state query to NOT block (for instance because it would cause a deadlock). Here is an example: case rabbit_khepri:get_feature_state() of enabled -> members_using_khepri(); _ -> members_using_mnesia() end Direct accesses to Mnesia still exists. They are limited to code that is specific to Mnesia such as classic queue mirroring or network partitions handling strategies. Now, to discover the Mnesia tables to migrate and how to migrate them, we use an Erlang module attribute called `rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia tables and an associated converter module. Here is an example in the `rabbitmq_recent_history_exchange` plugin: -rabbit_mnesia_tables_to_khepri_db( [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]). The converter module — `rabbit_db_rh_exchange_m2k_converter` in this example — is is fact a "sub" converter module called but `rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri` converter module to learn more about these modules. [1] https://github.com/rabbitmq/ra [2] https://github.com/rabbitmq/khepri [3] https://github.com/rabbitmq/khepri_mnesia_migration See #7206. Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com> Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com> Co-authored-by: Michael Davis <mcarsondavis@gmail.com>

@kjnilsson

[Why] Mnesia is a very powerful and convenient tool for Erlang applications: it is a persistent disc-based database, it handles replication accross multiple Erlang nodes and it is available out-of-the-box from the Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its metadata: * virtual hosts' properties * intenal users * queue, exchange and binding declarations (not queues data) * runtime parameters and policies * ... Unfortunately Mnesia makes it difficult to handle network partition and, as a consequence, the merge conflicts between Erlang nodes once the network partition is resolved. RabbitMQ provides several partition handling strategies but they are not bullet-proof. Users still hit situations where it is a pain to repair a cluster following a network partition. [How] @kjnilsson created Ra [1], a Raft consensus library that RabbitMQ already uses successfully to implement quorum queues and streams for instance. Those queues do not suffer from network partitions. We created Khepri [2], a new persistent and replicated database engine based on Ra and we want to use it in place of Mnesia in RabbitMQ to solve the problems with network partitions. This patch integrates Khepri as an experimental feature. When enabled, RabbitMQ will store all its metadata in Khepri instead of Mnesia. This change comes with behavior changes. While Khepri remains disabled, you should see no changes to the behavior of RabbitMQ. If there are changes, it is a bug. After Khepri is enabled, there are significant changes of behavior that you should be aware of. Because it is based on the Raft consensus algorithm, when there is a network partition, only the cluster members that are in the partition with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes can "make progress". In other words, only those nodes may write to the Khepri database and read from the database and expect a consistent result. For instance in a cluster of 5 RabbitMQ nodes: * If there are two partitions, one with 3 nodes, one with 2 nodes, only the group of 3 nodes will be able to write to the database. * If there are three partitions, two with 2 nodes, one with 1 node, none of the group can write to the database. Because the Khepri database will be used for all kind of metadata, it means that RabbitMQ nodes that can't write to the database will be unable to perform some operations. A list of operations and what to expect is documented in the associated pull request and the RabbitMQ website. This requirement from Raft also affects the startup of RabbitMQ nodes in a cluster. Indeed, at least a quorum number of nodes must be started at once to allow nodes to become ready. To enable Khepri, you need to enable the `khepri_db` feature flag: rabbitmqctl enable_feature_flag khepri_db When the `khepri_db` feature flag is enabled, the migration code performs the following two tasks: 1. It synchronizes the Khepri cluster membership from the Mnesia cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from the `khepri_mnesia_migration` application [3]. 2. It copies data from relevant Mnesia tables to Khepri, doing some conversion if necessary on the way. Again, it uses `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do it. This can be performed on a running standalone RabbitMQ node or cluster. Data will be migrated from Mnesia to Khepri without any service interruption. Note that during the migration, the performance may decrease and the memory footprint may go up. Because this feature flag is considered experimental, it is not enabled by default even on a brand new RabbitMQ deployment. More about the implementation details below: In the past months, all accesses to Mnesia were isolated in a collection of `rabbit_db*` modules. This is where the integration of Khepri mostly takes place: we use a function called `rabbit_khepri:handle_fallback/1` which selects the database and perform the query or the transaction. Here is an example from `rabbit_db_vhost`: * Up until RabbitMQ 3.12.x: get(VHostName) when is_binary(VHostName) -> get_in_mnesia(VHostName). * Starting with RabbitMQ 3.13.0: get(VHostName) when is_binary(VHostName) -> rabbit_khepri:handle_fallback( #{mnesia => fun() -> get_in_mnesia(VHostName) end, khepri => fun() -> get_in_khepri(VHostName) end}). This `rabbit_khepri:handle_fallback/1` function relies on two things: 1. the fact that the `khepri_db` feature flag is enabled, in which case it always executes the Khepri-based variant. 4. the ability or not to read and write to Mnesia tables otherwise. Before the feature flag is enabled, or during the migration, the function will try to execute the Mnesia-based variant. If it succeeds, then it returns the result. If it fails because one or more Mnesia tables can't be used, it restarts from scratch: it means the feature flag is being enabled and depending on the outcome, either the Mnesia-based variant will succeed (the feature flag couldn't be enabled) or the feature flag will be marked as enabled and it will call the Khepri-based variant. The meat of this function really lives in the `khepri_mnesia_migration` application [3] and `rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows about the feature flag. However, some calls to the database do not depend on the existence of Mnesia tables, such as functions where we need to learn about the members of a cluster. For those, we can't rely on exceptions from Mnesia. Therefore, we just look at the state of the feature flag to determine which database to use. There are two situations though: * Sometimes, we need the feature flag state query to block because the function interested in it can't return a valid answer during the migration. Here is an example: case rabbit_khepri:is_enabled(RemoteNode) of true -> can_join_using_khepri(RemoteNode); false -> can_join_using_mnesia(RemoteNode) end * Sometimes, we need the feature flag state query to NOT block (for instance because it would cause a deadlock). Here is an example: case rabbit_khepri:get_feature_state() of enabled -> members_using_khepri(); _ -> members_using_mnesia() end Direct accesses to Mnesia still exists. They are limited to code that is specific to Mnesia such as classic queue mirroring or network partitions handling strategies. Now, to discover the Mnesia tables to migrate and how to migrate them, we use an Erlang module attribute called `rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia tables and an associated converter module. Here is an example in the `rabbitmq_recent_history_exchange` plugin: -rabbit_mnesia_tables_to_khepri_db( [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]). The converter module — `rabbit_db_rh_exchange_m2k_converter` in this example — is is fact a "sub" converter module called but `rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri` converter module to learn more about these modules. [1] https://github.com/rabbitmq/ra [2] https://github.com/rabbitmq/khepri [3] https://github.com/rabbitmq/khepri_mnesia_migration See #7206. Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com> Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com> Co-authored-by: Michael Davis <mcarsondavis@gmail.com>

@kjnilsson

[Why] Mnesia is a very powerful and convenient tool for Erlang applications: it is a persistent disc-based database, it handles replication accross multiple Erlang nodes and it is available out-of-the-box from the Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its metadata: * virtual hosts' properties * intenal users * queue, exchange and binding declarations (not queues data) * runtime parameters and policies * ... Unfortunately Mnesia makes it difficult to handle network partition and, as a consequence, the merge conflicts between Erlang nodes once the network partition is resolved. RabbitMQ provides several partition handling strategies but they are not bullet-proof. Users still hit situations where it is a pain to repair a cluster following a network partition. [How] @kjnilsson created Ra [1], a Raft consensus library that RabbitMQ already uses successfully to implement quorum queues and streams for instance. Those queues do not suffer from network partitions. We created Khepri [2], a new persistent and replicated database engine based on Ra and we want to use it in place of Mnesia in RabbitMQ to solve the problems with network partitions. This patch integrates Khepri as an experimental feature. When enabled, RabbitMQ will store all its metadata in Khepri instead of Mnesia. This change comes with behavior changes. While Khepri remains disabled, you should see no changes to the behavior of RabbitMQ. If there are changes, it is a bug. After Khepri is enabled, there are significant changes of behavior that you should be aware of. Because it is based on the Raft consensus algorithm, when there is a network partition, only the cluster members that are in the partition with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes can "make progress". In other words, only those nodes may write to the Khepri database and read from the database and expect a consistent result. For instance in a cluster of 5 RabbitMQ nodes: * If there are two partitions, one with 3 nodes, one with 2 nodes, only the group of 3 nodes will be able to write to the database. * If there are three partitions, two with 2 nodes, one with 1 node, none of the group can write to the database. Because the Khepri database will be used for all kind of metadata, it means that RabbitMQ nodes that can't write to the database will be unable to perform some operations. A list of operations and what to expect is documented in the associated pull request and the RabbitMQ website. This requirement from Raft also affects the startup of RabbitMQ nodes in a cluster. Indeed, at least a quorum number of nodes must be started at once to allow nodes to become ready. To enable Khepri, you need to enable the `khepri_db` feature flag: rabbitmqctl enable_feature_flag khepri_db When the `khepri_db` feature flag is enabled, the migration code performs the following two tasks: 1. It synchronizes the Khepri cluster membership from the Mnesia cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from the `khepri_mnesia_migration` application [3]. 2. It copies data from relevant Mnesia tables to Khepri, doing some conversion if necessary on the way. Again, it uses `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do it. This can be performed on a running standalone RabbitMQ node or cluster. Data will be migrated from Mnesia to Khepri without any service interruption. Note that during the migration, the performance may decrease and the memory footprint may go up. Because this feature flag is considered experimental, it is not enabled by default even on a brand new RabbitMQ deployment. More about the implementation details below: In the past months, all accesses to Mnesia were isolated in a collection of `rabbit_db*` modules. This is where the integration of Khepri mostly takes place: we use a function called `rabbit_khepri:handle_fallback/1` which selects the database and perform the query or the transaction. Here is an example from `rabbit_db_vhost`: * Up until RabbitMQ 3.12.x: get(VHostName) when is_binary(VHostName) -> get_in_mnesia(VHostName). * Starting with RabbitMQ 3.13.0: get(VHostName) when is_binary(VHostName) -> rabbit_khepri:handle_fallback( #{mnesia => fun() -> get_in_mnesia(VHostName) end, khepri => fun() -> get_in_khepri(VHostName) end}). This `rabbit_khepri:handle_fallback/1` function relies on two things: 1. the fact that the `khepri_db` feature flag is enabled, in which case it always executes the Khepri-based variant. 4. the ability or not to read and write to Mnesia tables otherwise. Before the feature flag is enabled, or during the migration, the function will try to execute the Mnesia-based variant. If it succeeds, then it returns the result. If it fails because one or more Mnesia tables can't be used, it restarts from scratch: it means the feature flag is being enabled and depending on the outcome, either the Mnesia-based variant will succeed (the feature flag couldn't be enabled) or the feature flag will be marked as enabled and it will call the Khepri-based variant. The meat of this function really lives in the `khepri_mnesia_migration` application [3] and `rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows about the feature flag. However, some calls to the database do not depend on the existence of Mnesia tables, such as functions where we need to learn about the members of a cluster. For those, we can't rely on exceptions from Mnesia. Therefore, we just look at the state of the feature flag to determine which database to use. There are two situations though: * Sometimes, we need the feature flag state query to block because the function interested in it can't return a valid answer during the migration. Here is an example: case rabbit_khepri:is_enabled(RemoteNode) of true -> can_join_using_khepri(RemoteNode); false -> can_join_using_mnesia(RemoteNode) end * Sometimes, we need the feature flag state query to NOT block (for instance because it would cause a deadlock). Here is an example: case rabbit_khepri:get_feature_state() of enabled -> members_using_khepri(); _ -> members_using_mnesia() end Direct accesses to Mnesia still exists. They are limited to code that is specific to Mnesia such as classic queue mirroring or network partitions handling strategies. Now, to discover the Mnesia tables to migrate and how to migrate them, we use an Erlang module attribute called `rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia tables and an associated converter module. Here is an example in the `rabbitmq_recent_history_exchange` plugin: -rabbit_mnesia_tables_to_khepri_db( [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]). The converter module — `rabbit_db_rh_exchange_m2k_converter` in this example — is is fact a "sub" converter module called but `rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri` converter module to learn more about these modules. [1] https://github.com/rabbitmq/ra [2] https://github.com/rabbitmq/khepri [3] https://github.com/rabbitmq/khepri_mnesia_migration See #7206. Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com> Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com> Co-authored-by: Michael Davis <mcarsondavis@gmail.com>

@kjnilsson

[Why] Mnesia is a very powerful and convenient tool for Erlang applications: it is a persistent disc-based database, it handles replication accross multiple Erlang nodes and it is available out-of-the-box from the Erlang/OTP distribution. RabbitMQ relies on Mnesia to manage all its metadata: * virtual hosts' properties * intenal users * queue, exchange and binding declarations (not queues data) * runtime parameters and policies * ... Unfortunately Mnesia makes it difficult to handle network partition and, as a consequence, the merge conflicts between Erlang nodes once the network partition is resolved. RabbitMQ provides several partition handling strategies but they are not bullet-proof. Users still hit situations where it is a pain to repair a cluster following a network partition. [How] @kjnilsson created Ra [1], a Raft consensus library that RabbitMQ already uses successfully to implement quorum queues and streams for instance. Those queues do not suffer from network partitions. We created Khepri [2], a new persistent and replicated database engine based on Ra and we want to use it in place of Mnesia in RabbitMQ to solve the problems with network partitions. This patch integrates Khepri as an experimental feature. When enabled, RabbitMQ will store all its metadata in Khepri instead of Mnesia. This change comes with behavior changes. While Khepri remains disabled, you should see no changes to the behavior of RabbitMQ. If there are changes, it is a bug. After Khepri is enabled, there are significant changes of behavior that you should be aware of. Because it is based on the Raft consensus algorithm, when there is a network partition, only the cluster members that are in the partition with at least `(Number of nodes in the cluster ÷ 2) + 1` number of nodes can "make progress". In other words, only those nodes may write to the Khepri database and read from the database and expect a consistent result. For instance in a cluster of 5 RabbitMQ nodes: * If there are two partitions, one with 3 nodes, one with 2 nodes, only the group of 3 nodes will be able to write to the database. * If there are three partitions, two with 2 nodes, one with 1 node, none of the group can write to the database. Because the Khepri database will be used for all kind of metadata, it means that RabbitMQ nodes that can't write to the database will be unable to perform some operations. A list of operations and what to expect is documented in the associated pull request and the RabbitMQ website. This requirement from Raft also affects the startup of RabbitMQ nodes in a cluster. Indeed, at least a quorum number of nodes must be started at once to allow nodes to become ready. To enable Khepri, you need to enable the `khepri_db` feature flag: rabbitmqctl enable_feature_flag khepri_db When the `khepri_db` feature flag is enabled, the migration code performs the following two tasks: 1. It synchronizes the Khepri cluster membership from the Mnesia cluster. It uses `mnesia_to_khepri:sync_cluster_membership/1` from the `khepri_mnesia_migration` application [3]. 2. It copies data from relevant Mnesia tables to Khepri, doing some conversion if necessary on the way. Again, it uses `mnesia_to_khepri:copy_tables/4` from `khepri_mnesia_migration` to do it. This can be performed on a running standalone RabbitMQ node or cluster. Data will be migrated from Mnesia to Khepri without any service interruption. Note that during the migration, the performance may decrease and the memory footprint may go up. Because this feature flag is considered experimental, it is not enabled by default even on a brand new RabbitMQ deployment. More about the implementation details below: In the past months, all accesses to Mnesia were isolated in a collection of `rabbit_db*` modules. This is where the integration of Khepri mostly takes place: we use a function called `rabbit_khepri:handle_fallback/1` which selects the database and perform the query or the transaction. Here is an example from `rabbit_db_vhost`: * Up until RabbitMQ 3.12.x: get(VHostName) when is_binary(VHostName) -> get_in_mnesia(VHostName). * Starting with RabbitMQ 3.13.0: get(VHostName) when is_binary(VHostName) -> rabbit_khepri:handle_fallback( #{mnesia => fun() -> get_in_mnesia(VHostName) end, khepri => fun() -> get_in_khepri(VHostName) end}). This `rabbit_khepri:handle_fallback/1` function relies on two things: 1. the fact that the `khepri_db` feature flag is enabled, in which case it always executes the Khepri-based variant. 4. the ability or not to read and write to Mnesia tables otherwise. Before the feature flag is enabled, or during the migration, the function will try to execute the Mnesia-based variant. If it succeeds, then it returns the result. If it fails because one or more Mnesia tables can't be used, it restarts from scratch: it means the feature flag is being enabled and depending on the outcome, either the Mnesia-based variant will succeed (the feature flag couldn't be enabled) or the feature flag will be marked as enabled and it will call the Khepri-based variant. The meat of this function really lives in the `khepri_mnesia_migration` application [3] and `rabbit_khepri:handle_fallback/1` is a wrapper on top of it that knows about the feature flag. However, some calls to the database do not depend on the existence of Mnesia tables, such as functions where we need to learn about the members of a cluster. For those, we can't rely on exceptions from Mnesia. Therefore, we just look at the state of the feature flag to determine which database to use. There are two situations though: * Sometimes, we need the feature flag state query to block because the function interested in it can't return a valid answer during the migration. Here is an example: case rabbit_khepri:is_enabled(RemoteNode) of true -> can_join_using_khepri(RemoteNode); false -> can_join_using_mnesia(RemoteNode) end * Sometimes, we need the feature flag state query to NOT block (for instance because it would cause a deadlock). Here is an example: case rabbit_khepri:get_feature_state() of enabled -> members_using_khepri(); _ -> members_using_mnesia() end Direct accesses to Mnesia still exists. They are limited to code that is specific to Mnesia such as classic queue mirroring or network partitions handling strategies. Now, to discover the Mnesia tables to migrate and how to migrate them, we use an Erlang module attribute called `rabbit_mnesia_tables_to_khepri_db` which indicates a list of Mnesia tables and an associated converter module. Here is an example in the `rabbitmq_recent_history_exchange` plugin: -rabbit_mnesia_tables_to_khepri_db( [{?RH_TABLE, rabbit_db_rh_exchange_m2k_converter}]). The converter module — `rabbit_db_rh_exchange_m2k_converter` in this example — is is fact a "sub" converter module called but `rabbit_db_m2k_converter`. See the documentation of a `mnesia_to_khepri` converter module to learn more about these modules. [1] https://github.com/rabbitmq/ra [2] https://github.com/rabbitmq/khepri [3] https://github.com/rabbitmq/khepri_mnesia_migration See #7206. Co-authored-by: Jean-Sébastien Pédron <jean-sebastien@rabbitmq.com> Co-authored-by: Diana Parra Corbacho <dparracorbac@vmware.com> Co-authored-by: Michael Davis <mcarsondavis@gmail.com>

Khepri needs ra, and unless khepri is a native bazel dep, we still need to declare ra in the classic fashion

…_subset_of_nodes_coming_online` [Why] The testcase was broken as part of the work on Khepri (#7206): all nodes were started, making it an equivalent of the `successful_discovery` testcase. [How] We drop the first entry in the list of nodes given to `rabbit_ct_broker_helpers`. This way, it won't be started at all while still being listed in the classic config parameter.

dcorbacho marked this pull request as draft February 7, 2023 13:59

mergify bot added bazel make labels Feb 7, 2023

dcorbacho force-pushed the khepri branch from 90e9862 to 35f12db Compare February 14, 2023 14:01

the-mikedavis mentioned this pull request Feb 14, 2023

erlang_ls: Look for lib includes in extra_deps #7300

Merged

12 tasks

dcorbacho force-pushed the khepri branch 2 times, most recently from c9ac116 to be22e98 Compare February 16, 2023 15:39

dcorbacho force-pushed the khepri branch from 0468fdb to d9a6e89 Compare February 27, 2023 14:06

dcorbacho force-pushed the khepri branch 5 times, most recently from 0f152cb to 9110c61 Compare March 27, 2023 10:53

dcorbacho force-pushed the khepri branch 5 times, most recently from 416b25d to 7159b36 Compare April 17, 2023 10:18

dcorbacho force-pushed the khepri branch from cac5143 to aaa6cac Compare April 18, 2023 11:53

dcorbacho force-pushed the khepri branch 8 times, most recently from 44a6eac to 869f259 Compare April 28, 2023 10:23

the-mikedavis force-pushed the khepri branch from d077065 to a36e9f9 Compare May 1, 2023 19:01

dcorbacho force-pushed the khepri branch 2 times, most recently from c059f92 to e5cc386 Compare May 11, 2023 21:24

dumbbell force-pushed the khepri branch from b2ece0e to 0d2537a Compare September 28, 2023 15:58

dumbbell force-pushed the khepri branch from 60fac85 to 05f3bb0 Compare September 29, 2023 07:32

dumbbell force-pushed the khepri branch from 57b5ff5 to d11cc0e Compare September 29, 2023 08:13

dumbbell force-pushed the khepri branch from 4f69889 to 2b9564d Compare September 29, 2023 10:01

dumbbell force-pushed the khepri branch from 2b9564d to e038944 Compare September 29, 2023 13:20

dcorbacho and others added 2 commits September 29, 2023 16:00

Partially revert commit 3253fe4

0bbb188

Khepri needs ra, and unless khepri is a native bazel dep, we still need to declare ra in the classic fashion

dumbbell force-pushed the khepri branch from e038944 to 0bbb188 Compare September 29, 2023 14:00

dumbbell marked this pull request as ready for review September 29, 2023 14:45

dcorbacho merged commit 4b6292b into main Sep 29, 2023
18 checks passed

dcorbacho deleted the khepri branch September 29, 2023 14:50

michaelklishin mentioned this pull request Sep 29, 2023

bazel run gazelle #9585

Merged

dumbbell mentioned this pull request Dec 8, 2023

Use Khepri to store vhosts, users and runtime parameters #4703

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to use Khepri database to store metadata instead of Mnesia #7206

Allow to use Khepri database to store metadata instead of Mnesia #7206

dcorbacho commented Feb 7, 2023 •

edited by dumbbell

Loading

Allow to use Khepri database to store metadata instead of Mnesia #7206

Allow to use Khepri database to store metadata instead of Mnesia #7206

Conversation

dcorbacho commented Feb 7, 2023 • edited by dumbbell Loading

Why

How

Behavior changes

How to switch to Khepri

Implementation details on the integration itself

dcorbacho commented Feb 7, 2023 •

edited by dumbbell

Loading