Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry around wclayer operations for process isolated containers #1091

Merged
merged 1 commit into from
Aug 6, 2021

Conversation

dcantah
Copy link
Contributor

@dcantah dcantah commented Aug 2, 2021

This change adds a simple retry loop to handle some behavior on RS5. Loopback VHDs
used to be mounted in a different manor on RS5 (ws2019) which led to some
very odd cases where things would succeed when they shouldn't have, or we'd simply
timeout if an operation took too long. Many parallel invocations of this code path
and stressing the machine seem to bring out the issues, but all of the possible failure
paths that bring about the errors we have observed aren't known.

On 19h1+ this retry loop shouldn't be needed, but the logic is to leave the loop if everything succeeded so this is harmless
and shouldn't need a version check.

Signed-off-by: Daniel Canter dcanter@microsoft.com

@dcantah dcantah requested a review from a team as a code owner August 2, 2021 23:47
@dcantah dcantah force-pushed the retry-layerops branch 2 times, most recently from e0edb8f to 0df4d76 Compare August 2, 2021 23:54
@dcantah
Copy link
Contributor Author

dcantah commented Aug 3, 2021

Should hopefully help #919

@dcantah
Copy link
Contributor Author

dcantah commented Aug 3, 2021

@msscotb I did when trying to get the PrepareLayer issue to reproduce 😆 I got ERROR_DEVICE_NOT_CONNECTED

@dcantah
Copy link
Contributor Author

dcantah commented Aug 5, 2021

@msscotb Any other feedback for this?

This change adds a simple retry loop to handle some behavior on RS5. Loopback VHDs
used to be mounted in a different manor on RS5 (ws2019) which led to some
very odd cases where things would succeed when they shouldn't have, or we'd simply
timeout if an operation took too long. Many parallel invocations of this code path
and stressing the machine seem to bring out the issues, but all of the possible failure
paths that bring about the errors we have observed aren't known.

On 19h1+ this retry loop shouldn't be needed, but the logic is to leave the loop if everything succeeded so this is harmless
and shouldn't need a version check.

Signed-off-by: Daniel Canter <dcanter@microsoft.com>
}

defer func() {
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't err need to be set to PrepareLayer result for the deferred DeactivateLayer to execute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, if you have a named return value, e.g. (err error) then the return value of line 107 or the PrepareLayer call will get assigned to err after completion. So when defer runs it will have the return value of PrepareLayer to check against.

Here's a quick example: https://play.golang.org/p/cID3RHPwl88

Copy link
Contributor

@katiewasnothere katiewasnothere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@AbelHu
Copy link

AbelHu commented Nov 11, 2021

@dcantah @msscotb Will this fix be included in moby? We have seen this failure in docker many times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ContainerCannotRun: hcsshim::PrepareLayer - failed failed in Win32: The device is not ready. (0x15)
4 participants