Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional idempotence check to cover Kafka server restart, while EthConnect stays running #227

Merged
merged 8 commits into from
Aug 31, 2022

Conversation

peterbroadhurst
Copy link
Contributor

@peterbroadhurst peterbroadhurst commented Aug 25, 2022

EthConnect works on an "at target" nonce allocation (see comparision with "at source" in
readme of FFTM), meaning it gives an "at least once" delivery assurance to the blockchain, backed by Apache Kafka.

In the case that the REST API Gateway is tuned for high performance, with an in-flight count in the many hundreds, and the Kafka servers are restarted, the consumer groups might redeliver many messages. This is undesirable.

EthConnect already has the concept of an idempotency key, on the front-side of the REST API Gateway, using ackmode as added in #175 to get an immediate receipt, combined with supplying your own custom ID. However, that is only checked in the REST API Gateway boundary layer today.

See the following diagram as a reference, showing how a Kafka at-least-once redelivery results in this duplication:
EthConnect at-least-once issue (1)

This PR proposes adding an additional idempotency check, at the point we receive the message from Kafka. Note this does not change the fundamental nature of the "at target" architecture from being at-least-once, and in some failure scenarios (for full idempotent delivery e2e with Ethereum nonces you would need the "at source" ordering architecture of https://github.com/hyperledger/firefly-evmconnect based on the new FFTM architecture).

But this PR does protect against something like a planned HA rolling restart of a Kafka cluster, from causing redelivery.

ethconnect_kafka_idempotency_enhancement

The new check is only enabled when:

The check covers two scenarios:

  1. The transaction is already in-flight in TX Processor when the redelivery occurs
  2. The transaction has already been assigned a transaction hash when the redelivery occurs

One complexity in the change, was making it so the two different components could both access the receipt store. For that I moved out a new package called receipts.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
…essor impl

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
…een persisted

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
// Then check LevelDB - we should find the entry
r, err := p.receiptStore.GetReceipt(inflight.msgID)
if err != nil {
return false, err
Copy link
Contributor

@vdamle vdamle Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a richer return status to account for transient receipt store access issues? I'm wondering what the implication of indicating a false negative is for the application. Here, we appear to be saying "we don't know if there is a receipt with this ID or not, assume this is not idempotent".

Copy link
Contributor Author

@peterbroadhurst peterbroadhurst Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did track through the result of this error return, and it will actually end up coming back as an error reply in Kafka, which would overwrite any "good" reply if there was one that was earlier.

The error would be very generic, to just whatever came from the LevelDB/MongoDB persistence layer - rather than being specific to the idempotency check.

I couldn't think of a better option here:

  • Infinite retry felt wrong under the lock
  • Silently ignoring felt wrong, because we can't sure sure an event went back at all in that case

But, you're absolutely right I should wrap this is a more detailed explanation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @vdamle - I've added a more descriptive error, but am open to other suggestions too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, looks great now.

…tore

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@codecov-commenter
Copy link

codecov-commenter commented Aug 26, 2022

Codecov Report

Merging #227 (59541cb) into main (a2a305b) will decrease coverage by 0.27%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #227      +/-   ##
==========================================
- Coverage   97.23%   96.95%   -0.28%     
==========================================
  Files          70       70              
  Lines        7660     7717      +57     
==========================================
+ Hits         7448     7482      +34     
- Misses        163      184      +21     
- Partials       49       51       +2     
Impacted Files Coverage Δ
ethconnect/internal/events/logprocessor.go 98.31% <0.00%> (-1.69%) ⬇️
ethconnect/cmd/ethconnect.go 91.12% <0.00%> (-1.38%) ⬇️
ethconnect/internal/errors/errors.go 100.00% <0.00%> (ø)
ethconnect/internal/tx/txnprocessor.go 100.00% <0.00%> (ø)
ethconnect/internal/messages/messages.go 100.00% <0.00%> (ø)
ethconnect/internal/rest/mongoreceipts.go
ethconnect/internal/rest/leveldbreceipts.go
ethconnect/internal/rest/memreceipts.go
ethconnect/internal/rest/mongwrapper.go
ethconnect/internal/receipts/mongoreceipts.go 100.00% <0.00%> (ø)
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst peterbroadhurst marked this pull request as ready for review August 31, 2022 14:11
@vdamle vdamle merged commit 0e6f5b0 into hyperledger:main Aug 31, 2022
@vdamle vdamle deleted the kafka-dup-check branch August 31, 2022 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants