Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Cache Warning GRPC #765

Open
Ryang20718 opened this issue Jul 29, 2024 · 3 comments
Open

Remote Cache Warning GRPC #765

Ryang20718 opened this issue Jul 29, 2024 · 3 comments

Comments

@Ryang20718
Copy link

DEADLINE_EXCEEDED: deadline exceeded after 59.999913100s. [closed=[], open=[[buffered_nanos=33620, ....]]

Hi 👋 We've been using this remote cache backed by s3 and have been recently seeing timeouts in grpc. (blob based s3 storage) with 3 instances of the service running with 4 vcpu. running on version 2.3.9.

Our CPU and memory usage are only 80 & 30% peaks respectively. I don't have a reliable repro for this, but was wondering if you had any insight as to what could be going wrong here?

@mostynb
Copy link
Collaborator

mostynb commented Jul 31, 2024

Do you see any errors in bazel-remote's logs when the client shows these timeouts?

Do you have iowait monitoring on the bazel-remote machine and on the s3 storage (if it's something you're running locally)? If you see iowait spikes when then maybe your storage bandwidth is saturated.

@Ryang20718
Copy link
Author

@mostynb, yeah, we see a lot of GRPC BYTESTREAM READ FAILED

our requests increased when we added

coverage --experimental_fetch_all_coverage_outputs
coverage --experimental_split_coverage_postprocessing

Do you have iowait monitoring on the bazel-remote machine and on the s3 storage

I don't believe so, I can take a look here.

In general, is it better to scale vertically or horizontally? Currently we have 3 replicas on ECS

@mostynb
Copy link
Collaborator

mostynb commented Aug 3, 2024

yeah, we see a lot of GRPC BYTESTREAM READ FAILED

Are there any more details provided in the logs besides bytestream read failed and the resource/blob name? If so, could you share a few of them here?

In general, is it better to scale vertically or horizontally? Currently we have 3 replicas on ECS

The REAPIv2 cache service has strong coherence requirements, and bazel doesn't behave nicley when those assumptions fail. eg bazel builds can fail if they make a request to one cache server, then make a request to another server (with a different set of blobs) during the same build. This makes horizontal scaling risky, unless you arrange things in such a way that a given client only talks to a single cache server during a single build.

Evicting items from bazel-remote's proxy backends can also break these assumptions. To avoid this we would need to figure out a way to imlpement some sort of LRU-like eviction for S3 (but I don't have an AWS account to do this work myself).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants