Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.17.0] Context deadline exceeded (timeout) cloning large repo #20680

Closed
parnic opened this issue Aug 5, 2022 · 6 comments · Fixed by #20689
Closed

[1.17.0] Context deadline exceeded (timeout) cloning large repo #20680

parnic opened this issue Aug 5, 2022 · 6 comments · Fixed by #20689
Labels
Milestone

Comments

@parnic
Copy link
Contributor

parnic commented Aug 5, 2022

Description

Starting with v1.17.0, one of our large repos (70+ GB, using LFS) is failing trying to clone on some connections.

When the clone fails, the client reports some variation of:

error: 24576 bytes of body are still expected0 MiB | 1.83 MiB/s
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

and the server logs show, e.g.:

2022/08/05 03:27:25 ...ers/web/repo/http.go:484:serviceRPC() [E] [62ec8cb5-3] Fail to serve RPC(upload-pack) in /mnt/gitea/data/repositories/org/repo.git: context deadline exceeded -

I tracked this down to a change in commit 35fdefc for #18363 where serviceRPC() was no longer allowed to run indefinitely, but carries a context with an unchangeable default timeout of 360 seconds (once I found this, I confirmed that our clones were dying around the 6 minute mark).

edit: collapsed this workaround because a PR is up with a different approach

We are locally using a custom build with this quick patch in order to set an extremely large timeout to allow us to work around the issue:

diff --git a/modules/setting/setting.go b/modules/setting/setting.go
index f5e624946..585c4589c 100644
--- a/modules/setting/setting.go
+++ b/modules/setting/setting.go
@@ -125,6 +125,7 @@ var (
        StartupTimeout       time.Duration
        PerWriteTimeout      = 30 * time.Second
        PerWritePerKbTimeout = 10 * time.Second
+       ServiceRpcTimeout    time.Duration
        StaticURLPrefix      string
        AbsoluteAssetURL     string

@@ -723,6 +724,7 @@ func loadFromConf(allowEmpty bool, extraConfig string) {
        StartupTimeout = sec.Key("STARTUP_TIMEOUT").MustDuration(0 * time.Second)
        PerWriteTimeout = sec.Key("PER_WRITE_TIMEOUT").MustDuration(PerWriteTimeout)
        PerWritePerKbTimeout = sec.Key("PER_WRITE_PER_KB_TIMEOUT").MustDuration(PerWritePerKbTimeout)
+       ServiceRpcTimeout = sec.Key("SERVICE_RPC_TIMEOUT").MustDuration(ServiceRpcTimeout)

        defaultAppURL := string(Protocol) + "://" + Domain
        if (Protocol == HTTP && HTTPPort != "80") || (Protocol == HTTPS && HTTPPort != "443") {
diff --git a/routers/web/repo/http.go b/routers/web/repo/http.go
index 6a85bca16..82835cb62 100644
--- a/routers/web/repo/http.go
+++ b/routers/web/repo/http.go
@@ -474,11 +474,12 @@ func serviceRPC(ctx gocontext.Context, h serviceHandler, service string) {
        cmd := git.NewCommand(h.r.Context(), service, "--stateless-rpc", h.dir)
        cmd.SetDescription(fmt.Sprintf("%s %s %s [repo_path: %s]", git.GitExecutable, service, "--stateless-rpc", h.dir))
        if err := cmd.Run(&git.RunOpts{
-               Dir:    h.dir,
-               Env:    append(os.Environ(), h.environ...),
-               Stdout: h.w,
-               Stdin:  reqBody,
-               Stderr: &stderr,
+               Dir:     h.dir,
+               Env:     append(os.Environ(), h.environ...),
+               Stdout:  h.w,
+               Stdin:   reqBody,
+               Stderr:  &stderr,
+               Timeout: setting.ServiceRpcTimeout,
        }); err != nil {
                if err.Error() != "signal: killed" {
                        log.Error("Fail to serve RPC(%s) in %s: %v - %s", service, h.dir, err, stderr.String())

I can make a PR out of this if this is desired, but I'd kind of rather this go back to the old behavior instead of adding a new 6-minute timeout footgun where, even if targeted logging was added to point people to correcting the failure, admins would need to find this config variable and change it for any decently-large repo. 1.16.x's behavior was much preferred.

Gitea Version

1.17.0

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

No response

Git Version

2.37.1

Operating System

Ubuntu 20.04 aarch64

How are you running Gitea?

Linux arm64 release on an AWS instance

Database

PostgreSQL

@parnic parnic added the type/bug label Aug 5, 2022
@lunny
Copy link
Member

lunny commented Aug 5, 2022

Yes, please. And I think you can give a default timeout value.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Aug 5, 2022

IMO there is no need to introduce another config option.

There is a context for serviceRPC, the timeout should be controlled by it. It might be possible to make Git's Run respect the context.Deadline() or have a long-enough timeout.

@parnic
Copy link
Contributor Author

parnic commented Aug 5, 2022

IMO there is no need to introduce another config option.

I agree.

There is a context for serviceRPC, the timeout should be controlled by it. It might be possible to make Git's Run respect the context.Deadline()

I added some logging to Run(), and, in my limited testing, only a few contexts come in with a deadline set on them (serviceRPC's context has no deadline). Would it be better here to fall back to the context deadline whenever the given Timeout is <= 0 and skip setting a timeout at all if both of those are unspecified? I guess that wouldn't work since then we'd never be using this fallback default value. I think no matter what, serviceRPC() should be able to indicate that it doesn't want to be cut off just because a timeout expired. Maybe a fourth Run() variant that is timeout-free is appropriate?

or have a long-enough timeout.

To me, the problem with this is, as always, "how long is long enough?" 6 minutes may have seemed long enough, but when the scenario is "however long upload-pack takes to deliver an arbitrarily-sized repository through an arbitrarily-fast network connection," I'm not sure there can ever be a "long enough." It's especially problematic in this case because even if you choose something that seems "long enough," if it fails for someone there is no way for the admin to make changes to allow it to complete. I've set our timeout here to 24 hours, for example, but I can tell you that if my clone fails 24 hours in for no other reason than "there was a timeout set" and I have to start over again, depending on what part of the clone it's at, I'd be less-than-thrilled as a result.

@wxiaoguang
Copy link
Contributor

Yup, at the moment, the git.Run will use default timeout if Timeout<=0.

IMO there might be 2 methods:

  • Method 1: If there is a deadline of the use serviceRPC context, use it. If no, use a long-enough timeout (for example, 1 year) timeout.
  • Method 2: Introduce a new const like UseContextTimeout = -2. Then Run(Timeout: UseContextTimeout), if there is no deadline in the context, then do not use timeout when running the git command.

parnic added a commit to parnic/gitea that referenced this issue Aug 5, 2022
This enables git.Command's Run to optionally use the given context directly so its deadline will be respected. Otherwise, it falls back to the previous behavior of using the supplied timeout or a default timeout value of 360 seconds.

repo's serviceRPC() calls now use the context's deadline (which is unset/unlimited) instead of the default 6-minute timeout. This means that large repo clones will no longer arbitrarily time out on the upload-pack step, and pushes can take longer than 6 minutes on the receive-pack step.

Fixes go-gitea#20680
parnic-sks pushed a commit to parnic-sks/gitea that referenced this issue Aug 5, 2022
This enables git.Command's Run to optionally use the given context directly so its deadline will be respected. Otherwise, it falls back to the previous behavior of using the supplied timeout or a default timeout value of 360 seconds.

repo's serviceRPC() calls now use the context's deadline (which is unset/unlimited) instead of the default 6-minute timeout. This means that large repo clones will no longer arbitrarily time out on the upload-pack step, and pushes can take longer than 6 minutes on the receive-pack step.

Fixes go-gitea#20680
@wxiaoguang wxiaoguang added this to the 1.17.1 milestone Aug 6, 2022
lafriks pushed a commit that referenced this issue Aug 6, 2022
This enables git.Command's Run to optionally use the given context directly so its deadline will be respected. Otherwise, it falls back to the previous behavior of using the supplied timeout or a default timeout value of 360 seconds.

repo's serviceRPC() calls now use the context's deadline (which is unset/unlimited) instead of the default 6-minute timeout. This means that large repo clones will no longer arbitrarily time out on the upload-pack step, and pushes can take longer than 6 minutes on the receive-pack step.

Fixes #20680

Co-authored-by: Lunny Xiao <xiaolunwen@gmail.com>
parnic added a commit to parnic/gitea that referenced this issue Aug 6, 2022
This enables git.Command's Run to optionally use the given context directly so its deadline will be respected. Otherwise, it falls back to the previous behavior of using the supplied timeout or a default timeout value of 360 seconds.

repo's serviceRPC() calls now use the context's deadline (which is unset/unlimited) instead of the default 6-minute timeout. This means that large repo clones will no longer arbitrarily time out on the upload-pack step, and pushes can take longer than 6 minutes on the receive-pack step.

Fixes go-gitea#20680

Co-authored-by: Lunny Xiao <xiaolunwen@gmail.com>
wxiaoguang pushed a commit that referenced this issue Aug 7, 2022
This enables git.Command's Run to optionally use the given context directly so its deadline will be respected. Otherwise, it falls back to the previous behavior of using the supplied timeout or a default timeout value of 360 seconds.

repo's serviceRPC() calls now use the context's deadline (which is unset/unlimited) instead of the default 6-minute timeout. This means that large repo clones will no longer arbitrarily time out on the upload-pack step, and pushes can take longer than 6 minutes on the receive-pack step.

Fixes #20680
vsysoev pushed a commit to IntegraSDL/gitea that referenced this issue Aug 10, 2022
This enables git.Command's Run to optionally use the given context directly so its deadline will be respected. Otherwise, it falls back to the previous behavior of using the supplied timeout or a default timeout value of 360 seconds.

repo's serviceRPC() calls now use the context's deadline (which is unset/unlimited) instead of the default 6-minute timeout. This means that large repo clones will no longer arbitrarily time out on the upload-pack step, and pushes can take longer than 6 minutes on the receive-pack step.

Fixes go-gitea#20680

Co-authored-by: Lunny Xiao <xiaolunwen@gmail.com>
@ScottBeeson
Copy link

So... what do I do if I'm getting this error?

Clone: context deadline exceeded

@zeripath
Copy link
Contributor

zeripath commented Nov 10, 2022

https://docs.gitea.io/en-us/config-cheat-sheet/#git---timeout-settings-gittimeout

Adjust/set [git.timeout] MIGRATE appropriately.

zeripath added a commit to zeripath/gitea that referenced this issue Nov 10, 2022
There are far too many error reports regarding timeouts from migrations.
We should adjust error report to suggest increasing this timeout.

Ref go-gitea#20680

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit that referenced this issue Nov 12, 2022
There are far too many error reports regarding timeouts from migrations.
We should adjust error report to suggest increasing this timeout.

Ref #20680

Signed-off-by: Andrew Thornton <art27@cantab.net>
@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants