Add additional idempotence check to cover Kafka server restart, while EthConnect stays running #227

peterbroadhurst · 2022-08-25T20:50:05Z

EthConnect works on an "at target" nonce allocation (see comparision with "at source" in
readme of FFTM), meaning it gives an "at least once" delivery assurance to the blockchain, backed by Apache Kafka.

In the case that the REST API Gateway is tuned for high performance, with an in-flight count in the many hundreds, and the Kafka servers are restarted, the consumer groups might redeliver many messages. This is undesirable.

EthConnect already has the concept of an idempotency key, on the front-side of the REST API Gateway, using ackmode as added in #175 to get an immediate receipt, combined with supplying your own custom ID. However, that is only checked in the REST API Gateway boundary layer today.

See the following diagram as a reference, showing how a Kafka at-least-once redelivery results in this duplication:

This PR proposes adding an additional idempotency check, at the point we receive the message from Kafka. Note this does not change the fundamental nature of the "at target" architecture from being at-least-once, and in some failure scenarios (for full idempotent delivery e2e with Ethereum nonces you would need the "at source" ordering architecture of https://github.com/hyperledger/firefly-evmconnect based on the new FFTM architecture).

But this PR does protect against something like a planned HA rolling restart of a Kafka cluster, from causing redelivery.

The new check is only enabled when:

fly-ackmode=receipt has been specified in the transaction submission via the APIs described in Add circuit breaker to stop runaway producers losing messages, and immediate receipt option #175
The REST API Gateway and the Kafka<->Ethereum module are co-located in the same address space

The check covers two scenarios:

The transaction is already in-flight in TX Processor when the redelivery occurs
The transaction has already been assigned a transaction hash when the redelivery occurs

One complexity in the change, was making it so the two different components could both access the receipt store. For that I moved out a new package called receipts.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

…essor impl Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

…een persisted Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

vdamle · 2022-08-25T22:06:37Z

internal/tx/txnprocessor.go

+	// Then check LevelDB - we should find the entry
+	r, err := p.receiptStore.GetReceipt(inflight.msgID)
+	if err != nil {
+		return false, err


Do we need to add a richer return status to account for transient receipt store access issues? I'm wondering what the implication of indicating a false negative is for the application. Here, we appear to be saying "we don't know if there is a receipt with this ID or not, assume this is not idempotent".

I did track through the result of this error return, and it will actually end up coming back as an error reply in Kafka, which would overwrite any "good" reply if there was one that was earlier.

The error would be very generic, to just whatever came from the LevelDB/MongoDB persistence layer - rather than being specific to the idempotency check.

I couldn't think of a better option here:

Infinite retry felt wrong under the lock

Silently ignoring felt wrong, because we can't sure sure an event went back at all in that case

But, you're absolutely right I should wrap this is a more detailed explanation!

Thanks for this @vdamle - I've added a more descriptive error, but am open to other suggestions too

Perfect, looks great now.

…tore Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

codecov-commenter · 2022-08-26T01:36:41Z

Codecov Report

Merging #227 (59541cb) into main (a2a305b) will decrease coverage by 0.27%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #227      +/-   ##
==========================================
- Coverage   97.23%   96.95%   -0.28%     
==========================================
  Files          70       70              
  Lines        7660     7717      +57     
==========================================
+ Hits         7448     7482      +34     
- Misses        163      184      +21     
- Partials       49       51       +2

Impacted Files	Coverage Δ
ethconnect/internal/events/logprocessor.go	`98.31% <0.00%> (-1.69%)`	⬇️
ethconnect/cmd/ethconnect.go	`91.12% <0.00%> (-1.38%)`	⬇️
ethconnect/internal/errors/errors.go	`100.00% <0.00%> (ø)`
ethconnect/internal/tx/txnprocessor.go	`100.00% <0.00%> (ø)`
ethconnect/internal/messages/messages.go	`100.00% <0.00%> (ø)`
ethconnect/internal/rest/mongoreceipts.go
ethconnect/internal/rest/leveldbreceipts.go
ethconnect/internal/rest/memreceipts.go
ethconnect/internal/rest/mongwrapper.go
ethconnect/internal/receipts/mongoreceipts.go	`100.00% <0.00%> (ø)`
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

peterbroadhurst added 4 commits August 25, 2022 08:09

Idempotency check on the way into the inflight pool

a5b01bc

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Update initialization for idempotency check receipt store, and txproc…

9a92d57

…essor impl Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Add unit tests for idempotence

5993b07

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Check on TX Hash, in case redelivery happens after full receipt has b…

c8ed44c

…een persisted Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

vdamle reviewed Aug 25, 2022

View reviewed changes

Provide more detailed error when idempotency check fails on receipt s…

20aa7de

…tore Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

peterbroadhurst added 3 commits August 28, 2022 21:35

Better logging and handle redelivery with extra receipt

02eac70

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Do not store receipts when we get a redelivery notification

59541cb

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

Store special record if we lose the reply

b873a7f

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>

peterbroadhurst marked this pull request as ready for review August 31, 2022 14:11

peterbroadhurst requested review from jimthematrix and awrichar as code owners August 31, 2022 14:11

vdamle approved these changes Aug 31, 2022

View reviewed changes

vdamle merged commit 0e6f5b0 into hyperledger:main Aug 31, 2022

vdamle deleted the kafka-dup-check branch August 31, 2022 15:26

peterbroadhurst mentioned this pull request Sep 1, 2022

Two phase init was not passing smartContractGW to receipt store #228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional idempotence check to cover Kafka server restart, while EthConnect stays running #227

Add additional idempotence check to cover Kafka server restart, while EthConnect stays running #227

peterbroadhurst commented Aug 25, 2022 •

edited

Loading

vdamle Aug 25, 2022 •

edited

Loading

peterbroadhurst Aug 25, 2022 •

edited

Loading

peterbroadhurst Aug 26, 2022

vdamle Aug 31, 2022

codecov-commenter commented Aug 26, 2022 •

edited

Loading

Add additional idempotence check to cover Kafka server restart, while EthConnect stays running #227

Add additional idempotence check to cover Kafka server restart, while EthConnect stays running #227

Conversation

peterbroadhurst commented Aug 25, 2022 • edited Loading

vdamle Aug 25, 2022 • edited Loading

Choose a reason for hiding this comment

peterbroadhurst Aug 25, 2022 • edited Loading

Choose a reason for hiding this comment

peterbroadhurst Aug 26, 2022

Choose a reason for hiding this comment

vdamle Aug 31, 2022

Choose a reason for hiding this comment

codecov-commenter commented Aug 26, 2022 • edited Loading

Codecov Report

peterbroadhurst commented Aug 25, 2022 •

edited

Loading

vdamle Aug 25, 2022 •

edited

Loading

peterbroadhurst Aug 25, 2022 •

edited

Loading

codecov-commenter commented Aug 26, 2022 •

edited

Loading