Add retry around wclayer operations for process isolated containers #1091

dcantah · 2021-08-02T23:47:27Z

This change adds a simple retry loop to handle some behavior on RS5. Loopback VHDs
used to be mounted in a different manor on RS5 (ws2019) which led to some
very odd cases where things would succeed when they shouldn't have, or we'd simply
timeout if an operation took too long. Many parallel invocations of this code path
and stressing the machine seem to bring out the issues, but all of the possible failure
paths that bring about the errors we have observed aren't known.

On 19h1+ this retry loop shouldn't be needed, but the logic is to leave the loop if everything succeeded so this is harmless
and shouldn't need a version check.

Signed-off-by: Daniel Canter dcanter@microsoft.com

dcantah · 2021-08-03T00:04:31Z

Should hopefully help #919

dcantah · 2021-08-03T17:13:31Z

@msscotb I did when trying to get the PrepareLayer issue to reproduce 😆 I got ERROR_DEVICE_NOT_CONNECTED

dcantah · 2021-08-05T00:20:52Z

@msscotb Any other feedback for this?

This change adds a simple retry loop to handle some behavior on RS5. Loopback VHDs used to be mounted in a different manor on RS5 (ws2019) which led to some very odd cases where things would succeed when they shouldn't have, or we'd simply timeout if an operation took too long. Many parallel invocations of this code path and stressing the machine seem to bring out the issues, but all of the possible failure paths that bring about the errors we have observed aren't known. On 19h1+ this retry loop shouldn't be needed, but the logic is to leave the loop if everything succeeded so this is harmless and shouldn't need a version check. Signed-off-by: Daniel Canter <dcanter@microsoft.com>

msscotb · 2021-08-05T05:58:27Z

internal/layers/layers.go

+				}
+
+				defer func() {
+					if err != nil {


Doesn't err need to be set to PrepareLayer result for the deferred DeactivateLayer to execute?

Nope, if you have a named return value, e.g. (err error) then the return value of line 107 or the PrepareLayer call will get assigned to err after completion. So when defer runs it will have the return value of PrepareLayer to check against.

Here's a quick example: https://play.golang.org/p/cID3RHPwl88

katiewasnothere

lgtm

AbelHu · 2021-11-11T02:41:45Z

@dcantah @msscotb Will this fix be included in moby? We have seen this failure in docker many times.

Related work items: microsoft#930, microsoft#962, microsoft#1004, microsoft#1008, microsoft#1039, microsoft#1045, microsoft#1046, microsoft#1047, microsoft#1052, microsoft#1053, microsoft#1054, microsoft#1057, microsoft#1058, microsoft#1060, microsoft#1061, microsoft#1063, microsoft#1064, microsoft#1068, microsoft#1069, microsoft#1070, microsoft#1071, microsoft#1074, microsoft#1078, microsoft#1079, microsoft#1081, microsoft#1082, microsoft#1083, microsoft#1084, microsoft#1088, microsoft#1090, microsoft#1091, microsoft#1093, microsoft#1094, microsoft#1096, microsoft#1098, microsoft#1099, microsoft#1102, microsoft#1103, microsoft#1105, microsoft#1106, microsoft#1108, microsoft#1109, microsoft#1115, microsoft#1116, microsoft#1122, microsoft#1123, microsoft#1126

Add retry around wclayer operations for process isolated containers

dcantah requested a review from a team as a code owner August 2, 2021 23:47

dcantah force-pushed the retry-layerops branch 2 times, most recently from e0edb8f to 0df4d76 Compare August 2, 2021 23:54

dcantah linked an issue Aug 3, 2021 that may be closed by this pull request

ContainerCannotRun: hcsshim::PrepareLayer - failed failed in Win32: The device is not ready. (0x15) #919

Closed

dcantah force-pushed the retry-layerops branch 2 times, most recently from b483e89 to 76cd63c Compare August 3, 2021 04:58

dcantah force-pushed the retry-layerops branch from 76cd63c to 01b9911 Compare August 5, 2021 02:08

msscotb reviewed Aug 5, 2021

View reviewed changes

msscotb approved these changes Aug 5, 2021

View reviewed changes

katiewasnothere approved these changes Aug 5, 2021

View reviewed changes

dcantah merged commit b8f71ac into microsoft:master Aug 6, 2021

marosset mentioned this pull request Aug 12, 2021

Projected downwardAPI should provide container's memory request fails in Windows kubernetes/kubernetes#101908

Closed

zhiweiv mentioned this pull request Aug 13, 2021

Pods unable to be created on windows nodes microsoft/Windows-Containers#109

Closed

dcantah mentioned this pull request Aug 26, 2021

[release/0.8] Cherry-pick PrepareLayer fixes #1131

Merged

claudiubelu mentioned this pull request Oct 20, 2021

Windows: "Win32: The device is not ready" random failures kubernetes/kubernetes#105784

Closed

princepereira pushed a commit to princepereira/hcsshim that referenced this pull request Aug 29, 2024

Merge pull request microsoft#1091 from dcantah/retry-layerops

8e73ef5

Add retry around wclayer operations for process isolated containers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry around wclayer operations for process isolated containers #1091

Add retry around wclayer operations for process isolated containers #1091

dcantah commented Aug 2, 2021

dcantah commented Aug 3, 2021

dcantah commented Aug 3, 2021

dcantah commented Aug 5, 2021

msscotb Aug 5, 2021

dcantah Aug 5, 2021

katiewasnothere left a comment

AbelHu commented Nov 11, 2021

Add retry around wclayer operations for process isolated containers #1091

Add retry around wclayer operations for process isolated containers #1091

Conversation

dcantah commented Aug 2, 2021

dcantah commented Aug 3, 2021

dcantah commented Aug 3, 2021

dcantah commented Aug 5, 2021

msscotb Aug 5, 2021

Choose a reason for hiding this comment

dcantah Aug 5, 2021

Choose a reason for hiding this comment

katiewasnothere left a comment

Choose a reason for hiding this comment

AbelHu commented Nov 11, 2021