failed to get filesystem from image: connection reset by peer #1717

mosheavni · 2021-08-11T09:48:43Z

Actual behavior
The kaniko build always fails with this error:

error building image: error building stage: failed to get filesystem from image: read tcp 10.50.99.48:34650->52.XXX.141.196:443: read: connection reset by peer

This is the entire log:

INFO[0001] Resolved base name my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 to builder 
INFO[0001] Resolved base name my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 to networks 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Retrieving image my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 from registry my-registry.azurecr.io 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Returning cached image manifest              
INFO[0001] Retrieving image manifest my-registry.azurecr.io/backend/bb/base:latest 
INFO[0001] Retrieving image my-registry.azurecr.io/backend/bb/base:latest from registry my-registry.azurecr.io 
INFO[0001] Built cross stage deps: map[0:[/app/_build/deploy/rel/bb_release /usr/local/cuda-9.1] 1:[/app/nnconfig]] 
INFO[0001] Retrieving image manifest my-registry.azurecr.io/inhouse/dockers/build-image:otp23.2 
INFO[0001] Returning cached image manifest              
INFO[0001] Executing 0 build triggers                   
INFO[0001] Unpacking rootfs as cmd RUN pip install jinja2-cli==0.7.0 requires it. 
error building image: error building stage: failed to get filesystem from image: read tcp 10.50.99.48:34650->52.XXX.141.196:443: read: connection reset by peer

Expected behavior
Build to succeed

To Reproduce
Not sure it would be reproducible, other builds are ok, this specific image and some other are failing for similar error.
the executor cmd used:

$ /kaniko/executor \
  --context . \
  --dockerfile Dockerfile \
  --destination $DOCKER_BUILD_IMAGE:$DOCKER_PROD_BUILD_TAG \
  --push-retry=4

Additional Information

Dockerfile
Please provide either the Dockerfile you're trying to build or one that can reproduce this error.
Build Context
Please provide or clearly describe any files needed to build the Dockerfile (ADD/COPY commands)
Kaniko Image (fully qualified with digest): gcr.io/kaniko-project/executor:debug digest sha256:fcccd2ab9f3892e33fc7f2e950c8e4fc665e7a4c66f6a9d70b300d7a2103592f

I want to better understand the nature of this error, what can cause it and what are the possible fixes.
Thanks.
Triage Notes for the Maintainers

Description	Yes/No
Please check if this a new feature you are proposing
Please check if the build works in docker but not in kaniko
Please check if this error is seen when you use `--cache` flag
Please check if your dockerfile is a multistage dockerfile

The text was updated successfully, but these errors were encountered:

Crapshit · 2021-09-09T13:07:00Z

We have the same exact issue. Already reported in issue #1627
It seems that there is already a fix merged #1685 #6380 that adds a new flag. Waiting for a newer release > 1.6.0
I also have a support ticket open by Microsoft Azure team and they said they found the root cause and are in a rollout.
But I don't have any ETA for this

Crapshit · 2021-11-15T10:01:23Z

I got feedback from Microsoft last week Friday:

RCA:
Azure storage uses multiple frontend nodes within a single storage scale unit to serve multiple storage accounts. As part of regular maintenance of these nodes, we reboot them after draining the existing requests and then put them back into production. As a result of investigating this incident, we have learned that it is possible for storage front-end nodes to be rebooted while the requests are still draining. This will cause any existing connections to the front end node to be closed causing a connection reset of associated pipelines. The precise cause of why the requests are not drained fully is still under investigation but it is likely due to a faulty feedback logic of when nodes get taken down.
Because the load balancer distributes requests evenly across many front-end nodes, clients are unlikely to experience a Reset like this if they retry the request. We are still looking into ways we can proactively detect and mitigate these nodes in the short term. Longer-term, we will have a permanent fix to prevent this issue from happening.

Resolution

We have decreased the reboot frequency on the impacted storage scale units spreading them further apart to reduce the impact
We are further investigating on validating all requests draining to complete before we reboot the front end nodes.

mehdibenfeguir · 2022-01-27T10:17:45Z

any update on this issue ?
experiencing the exact same issue as @mosheavni

Crapshit · 2022-01-27T10:45:50Z

@mehdibenfeguir as a workaround we are using the flag "--image-fs-extract-retry" in Kaniko 1.7.0.
I got no new reports that these connection resets are happening.

mehdibenfeguir · 2022-01-27T10:52:45Z

do you mean this image
gcr.io/kaniko-project/executor:v1.7.0-debug ??

mehdibenfeguir · 2022-01-27T10:53:59Z

this is the result with
--image-fs-extract-retry 5

mehdibenfeguir · 2022-01-27T11:02:54Z

the argument worked, but retrying gives the same result

error building image: error building stage: failed to get filesystem from image: read tcp "MY_IP_HERE":34650->"MY_IP_HERE":443: read: connection reset by peer

mehdibenfeguir · 2022-01-27T11:05:37Z

[36mINFO�[0m[0012] Unpacking rootfs as cmd RUN mkdir -p /t && cp -r /twgl/common-service/.gradle /t requires it. 
�[33mWARN�[0m[0037] Retrying operation after 1s due to read tcp MY_FIRST_IP:38536->MY_SECOND_IP:443: read: connection reset by peer 
�[33mWARN�[0m[0061] Retrying operation after 2s due to read tcp MY_FIRST_IP:38960->MY_SECOND_IP:443: read: connection reset by peer 
�[33mWARN�[0m[0089] Retrying operation after 4s due to read tcp MY_FIRST_IP:39254->MY_SECOND_IP:443: read: connection reset by peer 
�[33mWARN�[0m[0123] Retrying operation after 8s due to read tcp MY_FIRST_IP:39662->MY_SECOND_IP:443: read: connection reset by peer 
�[33mWARN�[0m[0157] Retrying operation after 16s due to read tcp MY_FIRST_IP:40112->MY_SECOND_IP:443: read: connection reset by peer 
error building image: error building stage: failed to get filesystem from image: read tcp MY_FIRST_IP:40540->MY_SECOND_IP:443: read: connection reset by peer
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // withCredentials
[Pipeline] }
[Pipeline] // container
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline
ERROR: script returned exit code 1
[Bitbucket] Notifying commit build result
[Bitbucket] Build result notified
Finished: FAILURE

mehdibenfeguir · 2022-01-27T11:12:10Z

@Crapshit could you please help

Crapshit · 2022-01-27T12:41:09Z

We are not getting such often connection reset issues in our environment.
And with that mentioned flag I can't see any issues anymore.
I could even see the resets with Docker itself, but Docker retries it 5 times as default, so it never failed in CICD pipelines...

pierreyves-lebrun · 2022-02-28T07:05:51Z

the argument worked, but retrying gives the same result

error building image: error building stage: failed to get filesystem from image: read tcp "MY_IP_HERE":34650->"MY_IP_HERE":443: read: connection reset by peer

Experiencing the same issue here, --image-fs-extract-retry 5 didn't seem to help at all

lzd-1230 · 2022-11-05T16:02:20Z

I've met this problem now, and I test in different machine with different network.
The one of them met this problem everytime and the error information is there:

INFO[0032] Unpacking rootfs as cmd COPY package*.json ./ requires it.
error building image: error building stage: failed to get filesystem from image: read tcp 172.17.0.3:60130->104.18.123.25:443: read: connection reset by peer

I'm confused with that why the option of "COPY" in Dockerfile would trigger the network connection with such 104.18.123.25:443
(for that i don't get the principle of kaniko)
And it seems got some error due to the network.....
i've try many times in container exec or CI pipeline and it all stuck at this COPY instruction.
besides, in CI-pipeline I got following errors:

INFO[0034] Unpacking rootfs as cmd COPY package*.json ./ requires it. 
error building image: error building stage: failed to get filesystem from image: stream error: stream ID 7; INTERNAL_ERROR
and
INFO[0014] Unpacking rootfs as cmd COPY package*.json ./ requires it. 
error building image: error building stage: failed to get filesystem from image: stream error: stream ID 3; INTERNAL_ERROR; received from peer

And it got no solutions except image-fs-extract-retry or --push-retry in github issue.
I would be appreciated you guys teach me some way to find the core reason or how to debug for it!
/(ㄒoㄒ)/~~

Crapshit · 2022-11-05T16:14:45Z

We had the same issue.
A simple Dockerfile with FROM and COPY statement was enough.
I interpreted it so that COPY requires the extraction of the FROM image and the download for it failed because of the connection reset.

lzd-1230 · 2022-11-06T03:44:10Z

We had the same issue. A simple Dockerfile with FROM and COPY statement was enough. I interpreted it so that COPY requires the extraction of the FROM image and the download for it failed because of the connection reset.

Thanks for your prompt reply, I've solve it by using local image register to restore the FROM image.
It is probably stuck when kaniko pulls image from official registry where I have speed problem to access in my region.
Additionally can I try to change the default image registry from docker.io to dockerhub or orther mirrors registry by config kaniko when pulling such FROM image?

TomerShor mentioned this issue Dec 8, 2022

Kaniko - Retry extracting image filesystem nuclio/nuclio#2744

Merged

TomerShor mentioned this issue Jun 7, 2023

[Kaniko] Retry extracting image filesystem mlrun/mlrun#3724

Merged

aaron-prindle added work-around-available priority/p2 High impact feature/bug. Will get a lot of users happy needs-reproduction issue/connection-reset-by-peer issue/build-fails works-with-docker labels Jun 25, 2023

coryan mentioned this issue Oct 20, 2023

[Flake] kaniko flake (5xx) fetching image from production.cloudflare.docker.com googleapis/google-cloud-cpp#6438

Closed

alevenberg mentioned this issue Nov 9, 2023

impl: add a retry with result function #2837

Merged

3 tasks

aaron-prindle closed this as completed in #2837 Nov 10, 2023

alevenberg mentioned this issue Nov 14, 2023

impl: add a retry with result function (#2837) #2853

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to get filesystem from image: connection reset by peer #1717

failed to get filesystem from image: connection reset by peer #1717

mosheavni commented Aug 11, 2021 •

edited

Loading

Crapshit commented Sep 9, 2021 •

edited

Loading

Crapshit commented Nov 15, 2021

mehdibenfeguir commented Jan 27, 2022

Crapshit commented Jan 27, 2022

mehdibenfeguir commented Jan 27, 2022 •

edited

Loading

mehdibenfeguir commented Jan 27, 2022 •

edited

Loading

mehdibenfeguir commented Jan 27, 2022

mehdibenfeguir commented Jan 27, 2022 •

edited

Loading

mehdibenfeguir commented Jan 27, 2022

Crapshit commented Jan 27, 2022

pierreyves-lebrun commented Feb 28, 2022

lzd-1230 commented Nov 5, 2022 •

edited

Loading

Crapshit commented Nov 5, 2022

lzd-1230 commented Nov 6, 2022 •

edited

Loading

failed to get filesystem from image: connection reset by peer #1717

failed to get filesystem from image: connection reset by peer #1717

Comments

mosheavni commented Aug 11, 2021 • edited Loading

Crapshit commented Sep 9, 2021 • edited Loading

Crapshit commented Nov 15, 2021

mehdibenfeguir commented Jan 27, 2022

Crapshit commented Jan 27, 2022

mehdibenfeguir commented Jan 27, 2022 • edited Loading

mehdibenfeguir commented Jan 27, 2022 • edited Loading

mehdibenfeguir commented Jan 27, 2022

mehdibenfeguir commented Jan 27, 2022 • edited Loading

mehdibenfeguir commented Jan 27, 2022

Crapshit commented Jan 27, 2022

pierreyves-lebrun commented Feb 28, 2022

lzd-1230 commented Nov 5, 2022 • edited Loading

Crapshit commented Nov 5, 2022

lzd-1230 commented Nov 6, 2022 • edited Loading

mosheavni commented Aug 11, 2021 •

edited

Loading

Crapshit commented Sep 9, 2021 •

edited

Loading

mehdibenfeguir commented Jan 27, 2022 •

edited

Loading

mehdibenfeguir commented Jan 27, 2022 •

edited

Loading

mehdibenfeguir commented Jan 27, 2022 •

edited

Loading

lzd-1230 commented Nov 5, 2022 •

edited

Loading

lzd-1230 commented Nov 6, 2022 •

edited

Loading