Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd send a corrupt snapshot or missing hash snapshot to a snapshot api call which causes the restoration to fail. #18340

Open
4 tasks done
ishan16696 opened this issue Jul 17, 2024 · 7 comments

Comments

@ishan16696
Copy link
Contributor

ishan16696 commented Jul 17, 2024

Bug report criteria

What happened?

It has been observed that during the restoration of etcd cluster from the etcd snapshot (taken via snapshot api call) that snapshot was missing the hash value or snapshot was corrupted, which caused the restoration to fail.

FATA[0002] Failed to restore snapshot: failed to restore from the base snapshot: snapshot missing hash but --skip-hash-check=false

What did you expect to happen?

Is there a way to detect corruption or missing hash of snapshot (taken via snapshot) early rather than waiting till restoration ?

I have following methods in my mind:

  1. Take a snapshot via snapshot api call, start a embedded etcd with different data-dir path and try to restore from this latest snapshot, if restoration fails then take a snapshot again (retry till restoration succeeds) else it's fine(snapshot isn't corrupted).

This method will work but starting a embedded etcd and wait for restoration to complete can be time taken and costly process.

  1. Is there a way by which we can just calculate the hash of db till x revision and compare it with hash of snapshot (removing the appended hash). If it matched then our snapshot integrity is intact else re-try to take the snapshot again till hash matches.

But, I'm not sure about how to calculate the hash of db till x revision ? Is there any api call available for that ?

I guess api call HashKV won't work here as value return by HashKV api call till x revision can't be equal to the Hash of snapshot taken upto x revision (removing the appended hash) as HashKV api call calculates the hash of all MVCC key-values, whereas snapshot is snapshot of etcd db which also contains cluster information, hence the hash will not be same.

  1. Calculate the hash of snapshot (removing the appended hash), compare it with hash value which is appended by etcd on snapshot. If it matches then our snapshot integrity is intact else re-try to take the snapshot again till hash matches.

How can we reproduce it (as minimally and precisely as possible)?

Not sure, it should be a rare scenario but it might occur more frequently as well since we get to know about this error only during the restoration which itself is a rare occurrence as we don't do restoration frequently(due to persistent volumes).

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.4.26

$ etcdctl version
etcdctl version: 3.4.26
API version: 3.4

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

We don't have etcd logs.
@ahrtr
Copy link
Member

ahrtr commented Jul 19, 2024

There are two possible reasons,

  • The "snapshot" wasn't actually a snapshot, it might be just copied from the db file directly. Please run etcdutl snapshot restore path-2-snapshot --skip-hash-check=true to double check;
  • Some errors happened when generating the snapshot. The command mentioned above should fail. Did you see any error on either the client side or the server side when generating the snapshot?

@ishan16696
Copy link
Contributor Author

ishan16696 commented Jul 21, 2024

The "snapshot" wasn't actually a snapshot, it might be just copied from the db file directly.

it was the snapshot as we call snapshot api to take the snapshot (named it as full-snapshot)

Please run etcdutl snapshot restore path-2-snapshot --skip-hash-check=true to double check;

we tried but restoration failed with fatal error:

INFO[0002] successfully fetched data of base snapshot in 1.5047977750000001 seconds [CompressionPolicy:gzip]  actor=restorer
unexpected fault address 0x776bdb053000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x776bdb053000 pc=0xf19fd6]

goroutine 1 [running]:
runtime.throw({0x207bb9d?, 0x0?})

Did you see any error on either the client side or the server side when generating the snapshot?

unfortunately I don't have logs, I saw this occurrence twice. First, in one of our test cluster which don't have observability stack, hence I'm unable to get logs and another occurrence is reported by one of our community user: gardener/etcd-backup-restore#749

@ahrtr
Copy link
Member

ahrtr commented Jul 21, 2024

we tried but restoration failed with fatal error:

INFO[0002] successfully fetched data of base snapshot in 1.5047977750000001 seconds [CompressionPolicy:gzip]  actor=restorer
unexpected fault address 0x776bdb053000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x776bdb053000 pc=0xf19fd6]

goroutine 1 [running]:
runtime.throw({0x207bb9d?, 0x0?})

It means the snapshot operation actually failed. So the received snapshot isn't a completed snapshot.

@ishan16696
Copy link
Contributor Author

So the received snapshot isn't a completed snapshot.

yes, it seems ... Is there a way to verify the integrity of snapshot either on etcd side before sending the snapshot or on etcd client side ?

For verifying the integrity of snapshot on etcd client side, I thought of this. It's a similar way how's restoration is verifying the snapshot before restoration.

Calculate the hash of snapshot (removing the appended hash), compare it with hash value which is appended by etcd on snapshot. If it matches then our snapshot integrity is intact else re-try to take the snapshot again till hash matches.

What do you think ?

@ahrtr
Copy link
Member

ahrtr commented Jul 22, 2024

You need to use the client side error to detect such failure.

resp, err := ss.Recv()
if err != nil {
switch err {
case io.EOF:
m.lg.Info("completed snapshot read; closing")
default:
m.lg.Warn("failed to receive from snapshot stream; closing", zap.Error(err))
}
pw.CloseWithError(err)
return
}

Also I just had a quick read on the server side implementation, it seems that there is a minor issue on the pb.SnapshotResponse.RemainingBytes. The value of total doesn't include the sha256 checksum, so when the RemainingBytes == 0, the server side may not have sent out the checksum yet. But your issue isn't caused by this minor issue.

total := snap.Size()
size := humanize.Bytes(uint64(total))
start := time.Now()
ms.lg.Info("sending database snapshot to client",
zap.Int64("total-bytes", total),
zap.String("size", size),
zap.String("storage-version", storageVersion),
)
for total-sent > 0 {
// buffer just holds read bytes from stream
// response size is multiple of OS page size, fetched in boltdb
// e.g. 4*1024
// NOTE: srv.Send does not wait until the message is received by the client.
// Therefore the buffer can not be safely reused between Send operations
buf := make([]byte, snapshotSendBufferSize)
n, err := io.ReadFull(pr, buf)
if err != nil && err != io.EOF && err != io.ErrUnexpectedEOF {
return togRPCError(err)
}
sent += int64(n)
// if total is x * snapshotSendBufferSize. it is possible that
// resp.RemainingBytes == 0
// resp.Blob == zero byte but not nil
// does this make server response sent to client nil in proto
// and client stops receiving from snapshot stream before
// server sends snapshot SHA?
// No, the client will still receive non-nil response
// until server closes the stream with EOF
resp := &pb.SnapshotResponse{
RemainingBytes: uint64(total - sent),
Blob: buf[:n],
Version: storageVersion,
}
if err = srv.Send(resp); err != nil {
return togRPCError(err)
}
h.Write(buf[:n])
}
// send SHA digest for integrity checks
// during snapshot restore operation
sha := h.Sum(nil)
ms.lg.Info("sending database sha256 checksum to client",
zap.Int64("total-bytes", total),
zap.Int("checksum-size", len(sha)),
)
hresp := &pb.SnapshotResponse{RemainingBytes: 0, Blob: sha, Version: storageVersion}

@ishan16696
Copy link
Contributor Author

You need to use the client side error to detect such failure.

we do handle the error at client side while taking the etcd snapshot but I guess client side didn't throw any error.

@ishan16696
Copy link
Contributor Author

The value of total doesn't include the sha256 checksum, so when the RemainingBytes == 0, the server side may not have sent out the checksum yet. But your issue isn't caused by this minor issue.

why this issue is not caused by this ? TBH, to me it feels it caused by this as it sends the snapshot but failed to send the sha256 checksum and due to this there was no client side error detected as it feels snapshot taken was successful but it fails during restoration as it fails hash check/validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants