Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AC cache misses when action result exists on S3 backend #767

Open
nmattia opened this issue Aug 9, 2024 · 9 comments
Open

AC cache misses when action result exists on S3 backend #767

nmattia opened this issue Aug 9, 2024 · 9 comments

Comments

@nmattia
Copy link

nmattia commented Aug 9, 2024

I have a bazel-remote v2.4.3 instance that serves 404 on Action Results that exists in its configured S3 backend. Other instances connected to the same S3 backend, with the same configuration, are able to serve those Action Results.

One action cache entry that surfaces the issue is 6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560. I've tried querying the bazel-remote instance with curl to reproduce the issue, and the implementation seems close enough to the gRPC implementation for this to be relevant (both go through GetValidatedActionResult). I'm definitely seeing the same behavior when going through gRPC/bazel.

This is the request I'm sending:

GET /ac/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560

The first log line I see appears to be a hit on the S3 backend for that entry:

11:25:28 S3 DOWNLOAD bucket ac/6e/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560 OK

However the line after that suggests the artifact is missing and bazel-remote returned a 404 (curl also shows a 404):

11:25:28  GET 404 42.42.42.42 /ac/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560

What's slightly surprising is that the second time I run the curl command, I do not see the instance querying the S3 backend, and instead logs a 404 directly:

11:34:54  GET 404 42.42.42.42 /ac/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560

I can get the AC entry from another instance connected to the same backend:

curl <other instance>/ac/6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560 | protoc --decode_raw
protobuf output 2 { 1: "package.cache" 2 { 1: "903494c63057c6076dc4bdc43940cf93866a2d8a3af538964bfbf62e06b793fa" 2: 2693 } 4: 1 } 2 { 1: "libHSprimitive-0.8.0.0.a" 2 { 1: "6d7227f669ed2c506033c2a5c4556b387b125f58d77c80770c97819acfc40a20" 2: 1170436 } 4: 1 } ... 6 { 1: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" } 8 { 1: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" } ... }

and for each CAS blob returned by the (working) instance, I'm able to fetch them from the original/failing instance:

GET /cas/903494c63057c6076dc4bdc43940cf93866a2d8a3af538964bfbf62e06b793fa
-> 200

and yet the /ac/ call still returns 404, even after all the CAS entries were curled.

Any idea what might be happening here? The instance hasn't been upgraded to 2.4.4 in case something needs to be reproduced, though happy to close this issue if this has been fixed in 2.4.4.

@mostynb
Copy link
Collaborator

mostynb commented Aug 9, 2024

Hmm, interesting. I don't think any of the changes between 2.4.3 and 2.4.4 would be relevant to this.

On the problematic bazel-remote instance, are you saying that you request this AC entry with curl, it shows the S3 DOWNLOAD OK log line, then the next log line is a GET 404 for the same AC entry?

Are you running this bazel-remote instance with --max_blob_size or --max_proxy_blob_size and if so do you know if this AC entry is larger than either of those values?

@nmattia
Copy link
Author

nmattia commented Aug 12, 2024

you request this AC entry with curl, it shows the S3 DOWNLOAD OK log line, then the next log line is a GET 404 for the same AC entry?

Yes, that's it! and when requesting the same AC entry it later on, there's no S3 line, it directly logs GET 404.

Are you running this bazel-remote instance with --max_blob_size or --max_proxy_blob_size and if so do you know if this AC entry is larger than either of those values?

We run with the following:

--max_size=200 --experimental_remote_asset_api=true --enable_endpoint_metrics

do note that other instances with the same configuration are able to serve the AC entry fine. Also, the AC entry is ~5MiB.

@mostynb
Copy link
Collaborator

mostynb commented Aug 12, 2024

On the first request, the AC entry is downloaded from S3 and stored in the local disk cache. The second request gets the AC entry directly from the local disk cache.

If you're getting "not found" then I suspect there's a validation failure of the entry. If the entry was validated successfully, then I think you would see log lines for each of the CAS blobs that are checked. I will try to improve the logging in this case, I might have a test build for you to try tomorrow.

@nmattia
Copy link
Author

nmattia commented Aug 13, 2024

I might have a test build for you to try tomorrow.

Thanks! if additionally we can migrate without wiping the disk cache that would be ideal, as we're more likely to be able to reproduce the issue with that AC entry

@mostynb
Copy link
Collaborator

mostynb commented Aug 13, 2024

I haven't been found any interesting places to add extra logging yet, unfortunately. Are you using the http or grpc interface?

Can you try upgrading the problematic instance to v2.4.4? You should not need to wipe the disk cache when doing this.

@nmattia
Copy link
Author

nmattia commented Aug 13, 2024

Are you using the http or grpc interface?

Using grpc with Bazel but I used the HTTP interface to reproduce the issue with curl. It looked like both were similar enough that debugging on the HTTP interface would help eventually resolve the grpc issue.

Can you try upgrading the problematic instance to v2.4.4? You should not need to wipe the disk cache when doing this.

Will do! I'll report back and let you know if the problem persists. Thanks for the quick feedback!

@nmattia
Copy link
Author

nmattia commented Aug 15, 2024

@mostynb Updated to 2.4.4 and still seeing the issue. Here are the logs for an instance showing the issue (note that this is now happening on a different artifact, on a different instance, and most of the instances still work as expected):

2024/08/15 12:08:33 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:08:33 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:20:00 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
2024/08/15 12:20:00 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:21:22 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:21:56 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:23:52 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:27:14 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:27:26 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
2024/08/15 12:27:26 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
2024/08/15 12:27:48 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 12:43:18 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 13:40:51 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
2024/08/15 13:41:24 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND

Note how at 2024/08/15 12:20:00 there's a successful S3 DOWNLOAD but then we get NOT FOUND. There's something odd here though, I'm surprised that both a CAS and an AC entries have the same hash, but maybe that something Bazel does? Can't be a coincidence though

@mostynb
Copy link
Collaborator

mostynb commented Aug 25, 2024

By convention bazel uses the hash of the Action (the command to run) as the lookup key for the ActionResult that it creates. Bazel is probably uploading the Action for debugging purposes (or maybe for consistency with the remote execution setup), even though it is not strictly required when only using remote caching. This probably explains why you see the same hash used for both an AC entry and a CAS entry.

There are two relevant functions to look at, diskCache.GetValidatedActionResult and diskCache.get.

The calling code looks at the values returned by GetValidatedActionResult, if the *pb.ActionResult is nil and the error is nil then a "not found" log is emitted. That happens in these places:

  1. If the ActionResult was not found, or the size returned by diskCache.Get was <= 0. We know the action result was found, because we got it from the S3 backend. But maybe diskCache.Get erroneously returned a non-positive size here?
    return nil, nil, nil // aka "not found"
  2. If one of the OutputDirectories was not found.
    return nil, nil, err // aka "not found", or an err if non-nil
  3. If findMissingCasBlobsInternal found missing blobs
    return nil, nil, nil // aka "not found"

I added some extra debug logs to the places above in this branch in my fork:
https://github.com/mostynb/bazel-remote/tree/issue_767_debugging
Can you make a build, and try this out? Do you see any of the new warnings in the logs when this error occurs?

If not, we can look into diskCache.Get: diskCache.get (called by the diskCache.Get wrapper) returns a nil io.ReadCloser and nil error, then the caller treats that blob as not found. This happens in these places:

  1. If the blob is not in the disk cache, and there is no backend proxy. This can't be the case, because you are using a backend proxy.
    return nil, -1, nil
  2. If the blob was not in the disk cache, and not in the backend proxy. I don't think this would be the case, because we don't see any other S3 logs besides the ActionResult (or did you not include them in the report?)
    return nil, -1, nil
  3. If the blob was found in the proxy, but it was larger than the maximum proxy blob size- do you set the --max_proxy_blob_size flag?
    return nil, -1, nil
  4. If there was a size mismatch between the blob found in the proxy backend and the expected size
    return nil, -1, nil

@nmattia
Copy link
Author

nmattia commented Aug 26, 2024

Thanks a lot! I'm seeing a lot of WARNING 3 though it may be another issue. When reproducing this particular issue, I see a WARNING 2.

Here's what's happening:

  • First, we see a cache miss because the artifact has not been built yet.
  • Then, we see the upload for the artifact (S3 UPLOAD, GRPC AC PUT).
  • Then, we see a couple instances successfully make use of the artifact.
  • Then, instance-31 shows S3 DOWNLOAD OK, WARNING 2, and then NOT FOUND
  • Then, other instances can still successfully make use of the artifact.
instance-19 12:29:11 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-19 12:29:11 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-07 12:29:23 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-07 12:29:23 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-10 12:29:23 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-10 12:29:23 GRPC BYTESTREAM WRITE COMPLETED: uploads/786723c7-6baf-4487-b8e3-310a7f5c2d53/compressed-blobs/zstd/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6/148
instance-10 12:29:23 S3 UPLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-32 12:29:23 GRPC AC PUT b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-32 12:29:23 S3 UPLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-24 12:30:12 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-24 12:30:12 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-10 12:30:19 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-10 12:30:19 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-32 12:30:31 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-32 12:30:31 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-23 12:30:31 GRPC AC PUT b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-23 12:30:31 S3 UPLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-25 12:31:30 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-25 12:31:30 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-13 12:31:35 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-13 12:31:35 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-25 12:31:40 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-25 12:31:45 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-31 12:31:49 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-31 12:31:49 WARNING 2: ActionResult b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 has a missing CAS object in OutputDirectories b431ba3377fc1d2bbd2eef8f448ebeae50a5b8d47a317c6e0d7556d11f48c834
instance-31 12:31:49 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 NOT FOUND
instance-06 12:32:01 S3 CONTAINS bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb cas.v2/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-06 12:32:01 GRPC CAS HEAD b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-27 12:32:01 GRPC AC PUT b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-27 12:32:01 S3 UPLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-22 12:34:20 S3 DOWNLOAD bazel-remote-aefbcafa-b6e1-4cec-a6d3-bbce88b294bb ac/b4/b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK
instance-22 12:34:20 GRPC AC GET b4f89b82de72199c09bd91369dd2d56a2cc49506450bfd04283cfc989d9a5dc6 OK

All instances have the exact same configuration:

--max_size=200 --experimental_remote_asset_api=true --enable_endpoint_metrics

Let me know if you need more logs/info/etc!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants