Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reconnect logic for stdio pipes #1197

Merged
merged 1 commit into from
Oct 29, 2021

Conversation

dcantah
Copy link
Contributor

@dcantah dcantah commented Oct 15, 2021

This change adds retry logic on the stdio relay if the server end of the named pipe
disconnects. This is a common case if containerd restarts for example.
The current approach is to make a io.Writer wrapper that handles the
reconnection logic on a write failure if it can be determined that the error
is from a disconnect.

This exposes a new shim config option to tailor the retry timeout as well as
an annotation so it can be set per container.

Signed-off-by: Daniel Canter dcanter@microsoft.com

@dcantah
Copy link
Contributor Author

dcantah commented Oct 18, 2021

Interesting linter warning lol. I feel it's helpful to include the unit regardless but let me know if we'd rather just ignore this in the linter.

Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var retryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var timeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved
internal/cmd/io_npipe.go Outdated Show resolved Hide resolved
@kevpar
Copy link
Member

kevpar commented Oct 19, 2021

Interesting linter warning lol. I feel it's helpful to include the unit regardless but let me know if we'd rather just ignore this in the linter.

Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var ioRetryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var retryTimeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)
Error: ST1011: var timeoutInSec is of type time.Duration; don't use unit-specific suffix "Sec" (stylecheck)

Once it's a time.Duration it doesn't have any implicit units anymore though, does it? Adding Sec here seems misleading for a duration value.

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved
internal/cmd/io_npipe.go Outdated Show resolved Hide resolved
@kevpar
Copy link
Member

kevpar commented Oct 19, 2021

I thought it might be helpful to provide some context on the use of time.Duration values in Go. The general idea of this type is it represents an abstract "amount of time" rather than any specific units. Though it is actually backed by an int64 representing nanoseconds, it's preferred to not rely on this detail. Any time you want to convert some specific number of time units (e.g. "5 seconds") into a Duration, or vice-versa, you should do so by multiplying or dividing by the time unit constants.

e.g.

var d time.Duration = 5 * time.Second                      // Create a duration representing 5 seconds.
var timeInMilliseconds int64 = int64(d / time.Millisecond) // Convert d back into a number of millisecond time units.

More info on this can be found under the Constants section of the time package docs.

@dcantah
Copy link
Contributor Author

dcantah commented Oct 19, 2021

@kevpar Welp my changing to InSec everywhere was not worth it. Swapping back

@dcantah
Copy link
Contributor Author

dcantah commented Oct 19, 2021

Error: SA5011: possible nil pointer dereference (staticcheck)
Error: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)
Error: SA5011: possible nil pointer dereference (staticcheck)
Error: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)
Error: SA5011: possible nil pointer dereference (staticcheck)
Error: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)

Not seeing this lint error locally, I assume it's referring to the conn on the retryWriter but.. Also just changed to what it used to be mainly so even more confused. Running same version locally and everything. I wish it would display what line it's referring to

@dcantah dcantah force-pushed the retry-stdio-conns2 branch 2 times, most recently from 3de0457 to ef166c0 Compare October 19, 2021 14:49
@dcantah
Copy link
Contributor Author

dcantah commented Oct 19, 2021

I'm not seeing this on the runs that it does on my fork when I push either https://github.com/dcantah/hcsshim/runs/3940750279?check_suite_focus=true. Trying to squash here and see if it does anything..

Edit: And the answers a no

Copy link
Contributor

@msscotb msscotb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending resolution of the time.Duration warnings.

@kevpar
Copy link
Member

kevpar commented Oct 19, 2021

I question whether the IO retry timeout should actually be configurable via annotation. In general I think trying to make everything configurable all the time just results in a messier, harder to understand system. I can't think of a good use case where someone would want different IO retry timeouts for different containers. Can you give some context on your thinking here?

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved
@dcantah
Copy link
Contributor Author

dcantah commented Oct 19, 2021

@kevpar My thinking started and ended with most of the deployment wide config options (the containerd-shim options) have annotation equivalents so mostly following the status quo. Also just a quick way to check if it's set without passing around a timeout field or passing the shim options to newHcsTask and pals.

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved
internal/cmd/io_npipe.go Outdated Show resolved Hide resolved
@dcantah
Copy link
Contributor Author

dcantah commented Oct 19, 2021

I question whether the IO retry timeout should actually be configurable via annotation. In general I think trying to make everything configurable all the time just results in a messier, harder to understand system. I can't think of a good use case where someone would want different IO retry timeouts for different containers. Can you give some context on your thinking here?

Replied above also but I don't have a strong justification (or opinion) for it being configurable per container as well. The only one I can really think of would be for tests. I can remove.

@dcantah dcantah force-pushed the retry-stdio-conns2 branch 3 times, most recently from dbdd9b5 to e30f98f Compare October 21, 2021 13:32
@dcantah
Copy link
Contributor Author

dcantah commented Oct 21, 2021

This linter issue makes no sense to me.. I don't see it locally on the same version, and even just installing staticcheck directly doesn't have this pop up after running. What the heck is going on?

Edit: If you set to only check for issues in the current PR the linter error goes away. Guess it was something in the task_hcs.go file that's been there..?

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved
@dcantah dcantah force-pushed the retry-stdio-conns2 branch 2 times, most recently from 9017e5a to c9cc26a Compare October 25, 2021 20:36
Copy link
Contributor

@ambarve ambarve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevpar
Copy link
Member

kevpar commented Oct 27, 2021

PR Feedback #1

  • Get rid of annotation to configure per container.
  • Swap to context.Background() instead of TODO()
  • Add back in return after comment that was an accident.

Signed-off-by: Daniel Canter dcanter@microsoft.com

Pretty sure the #1 in this commit will cause an issue reference.

@dcantah
Copy link
Contributor Author

dcantah commented Oct 27, 2021

PR Feedback #1

  • Get rid of annotation to configure per container.
  • Swap to context.Background() instead of TODO()
  • Add back in return after comment that was an accident.

Signed-off-by: Daniel Canter dcanter@microsoft.com

Pretty sure the #1 in this commit will cause an issue reference.

It's just going to get squashed on check-in

@kevpar
Copy link
Member

kevpar commented Oct 27, 2021

PR Feedback #1

  • Get rid of annotation to configure per container.
  • Swap to context.Background() instead of TODO()
  • Add back in return after comment that was an accident.

Signed-off-by: Daniel Canter dcanter@microsoft.com

Pretty sure the #1 in this commit will cause an issue reference.

It's just going to get squashed on check-in

IMO just something to keep in mind and avoid in the future.

Copy link
Contributor

@msscotb msscotb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dcantah
Copy link
Contributor Author

dcantah commented Oct 27, 2021

@kevpar You'd given this a couple rounds of review so even though we're at the usual 2 approvals let me know if you want to give this one more scan


func (nprw *nPipeRetryWriter) Write(p []byte) (n int, err error) {
for {
n, err = nprw.Conn.Write(p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a case where we could write n > 0 bytes, but also get a disconnected error. You may want to track how many bytes we have written so far, and write from that position in p.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, GitHub had me on an old revision.

internal/cmd/io_npipe.go Outdated Show resolved Hide resolved
Copy link
Member

@kevpar kevpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevpar
Copy link
Member

kevpar commented Oct 28, 2021

Looks like CI is failing though...

@dcantah
Copy link
Contributor Author

dcantah commented Oct 28, 2021

Looks like CI is failing though...

Yep 😑 I can't make sense of this, it doesn't fail locally, passed the last revision, and now it's back. It ALSO doesn't fail on my fork when pushing which kicks off the CI as well. Color me confused. I left it from last night so you could see what I was talking about

Edit: And now it passes 😂

@dcantah
Copy link
Contributor Author

dcantah commented Oct 28, 2021

Gonna squash the commits and check-in

@kevpar
Copy link
Member

kevpar commented Oct 29, 2021

Gonna squash the commits and check-in

Seems reasonable. We can look into the CI flakiness separately.

This change adds retry logic on the stdio relay if the server end of the named pipe
disconnects. This is a common case if containerd restarts for example.
The current approach is to make a io.Writer wrapper that handles the
reconnection logic on a write failure if it can be determined that the error
is from a disconnect. A new shim config option is exposed to tailor the retry timeout.

This changes also adds cenkalti/backoff/v4 as a dependency to be used for handling
exponential backoff logic for the stdio connection retry attempts. Retrying
at a fixed interval is a bit naive as all of the shims would potentially
be trying to reconnect to 3 pipes continuously all in <timeout> bursts.
This allows us to space out the connections, set an upper limit on timeout
intervals and add an element of randomness to the retry attempts.

Signed-off-by: Daniel Canter <dcanter@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants