Skip to content

Commit

Permalink
Add retry around wclayer operations for process isolated containers
Browse files Browse the repository at this point in the history
This change adds a simple retry loop to handle some behavior on RS5. Loopback VHDs
used to be mounted in a different manor on RS5 (ws2019) which led to some
very odd cases where things would succeed when they shouldn't have, or we'd simply
timeout if an operation took too long. Many parallel invocations of this code path
and stressing the machine seem to bring out the issues, but all of the possible failure
paths that bring about the errors we have observed aren't known.

On 19h1+ this retry loop shouldn't be needed, but the logic is to leave the loop if everything succeeded so this is harmless
and shouldn't need a version check.

Signed-off-by: Daniel Canter <dcanter@microsoft.com>
  • Loading branch information
dcantah committed Aug 3, 2021
1 parent 4775815 commit 76cd63c
Showing 1 changed file with 43 additions and 14 deletions.
57 changes: 43 additions & 14 deletions internal/layers/layers.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import (
"path/filepath"

hcsschema "github.com/Microsoft/hcsshim/internal/hcs/schema2"
"github.com/Microsoft/hcsshim/internal/hcserror"
"github.com/Microsoft/hcsshim/internal/log"
"github.com/Microsoft/hcsshim/internal/ospath"
"github.com/Microsoft/hcsshim/internal/uvm"
Expand Down Expand Up @@ -84,23 +85,51 @@ func MountContainerLayers(ctx context.Context, layerFolders []string, guestRoot,
}
path := layerFolders[len(layerFolders)-1]
rest := layerFolders[:len(layerFolders)-1]
if err := wclayer.ActivateLayer(ctx, path); err != nil {
return "", err
}
defer func() {
if err != nil {
_ = wclayer.DeactivateLayer(ctx, path)
// Simple retry loop to handle some behavior on RS5. Loopback VHDs used to be mounted in a different manor on RS5 (ws2019) which led to some
// very odd cases where things would succeed when they shouldn't have, or we'd simply timeout if an operation took too long. Many
// parallel invocations of this code path and stressing the machine seem to bring out the issues, but all of the possible failure paths
// that bring about the errors we have observed aren't known.
//
// On 19h1+ this *shouldn't* be needed, but the logic is to break if everything succeeded so this is harmless and shouldn't need a version check.
for i := 0; i < 5; i++ {
lErr := func() error {
if err := wclayer.ActivateLayer(ctx, path); err != nil {

This comment has been minimized.

Copy link
@msscotb

msscotb Aug 3, 2021

Contributor

Why does the loop include ActivateLayer? Have we seen problems there in addition to PrepareLayer?

return err
}
defer func() {
if err != nil {
_ = wclayer.DeactivateLayer(ctx, path)
}
}()

if err := wclayer.PrepareLayer(ctx, path, rest); err != nil {
return err
}
defer func() {
if err != nil {
_ = wclayer.UnprepareLayer(ctx, path)
}
}()

return nil
}()

if lErr != nil {
// Common errors seen from the RS5 behavior mentioned above is ERROR_NOT_READY and ERROR_DEVICE_NOT_CONNECTED. The former occurs when HCS
// tries to grab the volume path of the disk but it doesn't succeed, usually because the disk isn't actually mounted. DEVICE_NOT_CONNECTED
// has been observed after launching multiple containers in parallel on a machine under high load. This has also been observed to be a trigger
// for ERROR_NOT_READY as well.
if hcserr, ok := lErr.(*hcserror.HcsError); ok {
if hcserr.Err == windows.ERROR_NOT_READY || hcserr.Err == windows.ERROR_DEVICE_NOT_CONNECTED {
continue
}
}
// This was a failure case outside of the commonly known error conditions, don't retry here.
return "", lErr
}
}()

if err := wclayer.PrepareLayer(ctx, path, rest); err != nil {
return "", err
break
}
defer func() {
if err != nil {
_ = wclayer.UnprepareLayer(ctx, path)
}
}()

mountPath, err := wclayer.GetLayerMountPath(ctx, path)
if err != nil {
Expand Down

0 comments on commit 76cd63c

Please sign in to comment.