Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to get filesystem from image: connection reset by peer #1717

Closed
mosheavni opened this issue Aug 11, 2021 · 14 comments · Fixed by #2837 or #2853
Closed

failed to get filesystem from image: connection reset by peer #1717

mosheavni opened this issue Aug 11, 2021 · 14 comments · Fixed by #2837 or #2853

Comments

@mosheavni
Copy link

mosheavni commented Aug 11, 2021

Actual behavior
The kaniko build always fails with this error:

error building image: error building stage: failed to get filesystem from image: read tcp 10.50.99.48:34650->52.XXX.141.196:443: read: connection reset by peer

This is the entire log:

INFO[0001] Resolved base name my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 to builder 
INFO[0001] Resolved base name my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 to networks 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Retrieving image my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 from registry my-registry.azurecr.io 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Returning cached image manifest              
INFO[0001] Retrieving image manifest my-registry.azurecr.io/backend/bb/base:latest 
INFO[0001] Retrieving image my-registry.azurecr.io/backend/bb/base:latest from registry my-registry.azurecr.io 
INFO[0001] Built cross stage deps: map[0:[/app/_build/deploy/rel/bb_release /usr/local/cuda-9.1] 1:[/app/nnconfig]] 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Returning cached image manifest              
INFO[0001] Executing 0 build triggers                   
INFO[0001] Unpacking rootfs as cmd RUN pip install jinja2-cli==0.7.0 requires it. 
error building image: error building stage: failed to get filesystem from image: read tcp 10.50.99.48:34650->52.XXX.141.196:443: read: connection reset by peer

Expected behavior
Build to succeed

To Reproduce
Not sure it would be reproducible, other builds are ok, this specific image and some other are failing for similar error.
the executor cmd used:

$ /kaniko/executor \
  --context . \
  --dockerfile Dockerfile \
  --destination $DOCKER_BUILD_IMAGE:$DOCKER_PROD_BUILD_TAG \
  --push-retry=4

Additional Information

  • Dockerfile
    Please provide either the Dockerfile you're trying to build or one that can reproduce this error.
  • Build Context
    Please provide or clearly describe any files needed to build the Dockerfile (ADD/COPY commands)
  • Kaniko Image (fully qualified with digest): gcr.io/kaniko-project/executor:debug digest sha256:fcccd2ab9f3892e33fc7f2e950c8e4fc665e7a4c66f6a9d70b300d7a2103592f

I want to better understand the nature of this error, what can cause it and what are the possible fixes.
Thanks.
Triage Notes for the Maintainers

Description Yes/No
Please check if this a new feature you are proposing
Please check if the build works in docker but not in kaniko
Please check if this error is seen when you use --cache flag
Please check if your dockerfile is a multistage dockerfile
@Crapshit
Copy link

Crapshit commented Sep 9, 2021

We have the same exact issue. Already reported in issue #1627
It seems that there is already a fix merged #1685 #6380 that adds a new flag. Waiting for a newer release > 1.6.0
I also have a support ticket open by Microsoft Azure team and they said they found the root cause and are in a rollout.
But I don't have any ETA for this

@Crapshit
Copy link

I got feedback from Microsoft last week Friday:

RCA:
Azure storage uses multiple frontend nodes within a single storage scale unit to serve multiple storage accounts. As part of regular maintenance of these nodes, we reboot them after draining the existing requests and then put them back into production. As a result of investigating this incident, we have learned that it is possible for storage front-end nodes to be rebooted while the requests are still draining. This will cause any existing connections to the front end node to be closed causing a connection reset of associated pipelines. The precise cause of why the requests are not drained fully is still under investigation but it is likely due to a faulty feedback logic of when nodes get taken down.
Because the load balancer distributes requests evenly across many front-end nodes, clients are unlikely to experience a Reset like this if they retry the request. We are still looking into ways we can proactively detect and mitigate these nodes in the short term. Longer-term, we will have a permanent fix to prevent this issue from happening.

Resolution

  1. We have decreased the reboot frequency on the impacted storage scale units spreading them further apart to reduce the impact
  2. We are further investigating on validating all requests draining to complete before we reboot the front end nodes.

@mehdibenfeguir
Copy link

any update on this issue ?
experiencing the exact same issue as @mosheavni

@Crapshit
Copy link

@mehdibenfeguir as a workaround we are using the flag "--image-fs-extract-retry" in Kaniko 1.7.0.
I got no new reports that these connection resets are happening.

@mehdibenfeguir
Copy link

mehdibenfeguir commented Jan 27, 2022

do you mean this image
gcr.io/kaniko-project/executor:v1.7.0-debug ??

@mehdibenfeguir
Copy link

mehdibenfeguir commented Jan 27, 2022

this is the result with
--image-fs-extract-retry 5

@mehdibenfeguir
Copy link

the argument worked, but retrying gives the same result

error building image: error building stage: failed to get filesystem from image: read tcp "MY_IP_HERE":34650->"MY_IP_HERE":443: read: connection reset by peer

@mehdibenfeguir
Copy link

mehdibenfeguir commented Jan 27, 2022

[36mINFO�[0m[0012] Unpacking rootfs as cmd RUN mkdir -p /t && cp -r /twgl/common-service/.gradle /t requires it. 
�[33mWARN�[0m[0037] Retrying operation after 1s due to read tcp MY_FIRST_IP:38536->MY_SECOND_IP:443: read: connection reset by peer 
�[33mWARN�[0m[0061] Retrying operation after 2s due to read tcp MY_FIRST_IP:38960->MY_SECOND_IP:443: read: connection reset by peer 
�[33mWARN�[0m[0089] Retrying operation after 4s due to read tcp MY_FIRST_IP:39254->MY_SECOND_IP:443: read: connection reset by peer 
�[33mWARN�[0m[0123] Retrying operation after 8s due to read tcp MY_FIRST_IP:39662->MY_SECOND_IP:443: read: connection reset by peer 
�[33mWARN�[0m[0157] Retrying operation after 16s due to read tcp MY_FIRST_IP:40112->MY_SECOND_IP:443: read: connection reset by peer 
error building image: error building stage: failed to get filesystem from image: read tcp MY_FIRST_IP:40540->MY_SECOND_IP:443: read: connection reset by peer
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withCredentials
[Pipeline] }
[Pipeline] // container
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline
ERROR: script returned exit code 1
[Bitbucket] Notifying commit build result
[Bitbucket] Build result notified
Finished: FAILURE

@mehdibenfeguir
Copy link

@Crapshit could you please help

@Crapshit
Copy link

We are not getting such often connection reset issues in our environment.
And with that mentioned flag I can't see any issues anymore.
I could even see the resets with Docker itself, but Docker retries it 5 times as default, so it never failed in CICD pipelines...

@pierreyves-lebrun
Copy link

the argument worked, but retrying gives the same result

error building image: error building stage: failed to get filesystem from image: read tcp "MY_IP_HERE":34650->"MY_IP_HERE":443: read: connection reset by peer

Experiencing the same issue here, --image-fs-extract-retry 5 didn't seem to help at all

@lzd-1230
Copy link

lzd-1230 commented Nov 5, 2022

I've met this problem now, and I test in different machine with different network.
The one of them met this problem everytime and the error information is there:

INFO[0032] Unpacking rootfs as cmd COPY package*.json ./ requires it.
error building image: error building stage: failed to get filesystem from image: read tcp 172.17.0.3:60130->104.18.123.25:443: read: connection reset by peer

I'm confused with that why the option of "COPY" in Dockerfile would trigger the network connection with such 104.18.123.25:443
(for that i don't get the principle of kaniko)
And it seems got some error due to the network.....
i've try many times in container exec or CI pipeline and it all stuck at this COPY instruction.
besides, in CI-pipeline I got following errors:

INFO[0034] Unpacking rootfs as cmd COPY package*.json ./ requires it. 
error building image: error building stage: failed to get filesystem from image: stream error: stream ID 7; INTERNAL_ERROR
and
INFO[0014] Unpacking rootfs as cmd COPY package*.json ./ requires it. 
error building image: error building stage: failed to get filesystem from image: stream error: stream ID 3; INTERNAL_ERROR; received from peer

And it got no solutions except image-fs-extract-retry or --push-retry in github issue.
I would be appreciated you guys teach me some way to find the core reason or how to debug for it!
/(ㄒoㄒ)/~~

@Crapshit
Copy link

Crapshit commented Nov 5, 2022

We had the same issue.
A simple Dockerfile with FROM and COPY statement was enough.
I interpreted it so that COPY requires the extraction of the FROM image and the download for it failed because of the connection reset.

@lzd-1230
Copy link

lzd-1230 commented Nov 6, 2022

We had the same issue. A simple Dockerfile with FROM and COPY statement was enough. I interpreted it so that COPY requires the extraction of the FROM image and the download for it failed because of the connection reset.

Thanks for your prompt reply, I've solve it by using local image register to restore the FROM image.
It is probably stuck when kaniko pulls image from official registry where I have speed problem to access in my region.
Additionally can I try to change the default image registry from docker.io to dockerhub or orther mirrors registry by config kaniko when pulling such FROM image?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants