-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: High spontaneous latencies #753
Comments
No one else has mentioned this problem (yet). Could you instrument shouldRetry to see if your theory is correct? |
@jba I just deployed our service with an instrumented |
@jba As already suspected it indeed is caused by spontaneous excessive amounts of Do you have ideas as to why that is the case, or how it could be solved? |
Bringing it up internally. Will keep you posted. |
We are seeing similar problems with uploads, high rate of failure that can spike up to 30% (but probably hovers around 1-5%). Our code looks something like: our timeout is long, 2min, but still see failures. ctx, _ := context.WithTimeout(context.Background(), time.Duration(config.GcsTimedContextSec)*time.Second) |
Is there any update or progress on this? we are still seeing a very high level of uploads fail with REFUSED_STREAM, and very high latencies as suggested above. |
@chasmastin We "solved" the issue in the interim, by keeping a running average of failed GCS requests. If that value is too high we forcefully instantiate a new connection pool. This did not solve the underlying issue though of course. |
We have seen somewhat similar things with the bigtable library. This uses GRPC. We have been struggling with occasional latency for ~30 minutes where our upper percentiles timeout. We often see this across hosts client, but have occasionally seen it against one particular host. We have been heavily instrumenting the connection logic and the request logic. It is very hard to pinpoint what is actually going on. For bigtable, our fear was either large rows or hot keys but this does not appear to be the case. Our current main suspect is the GRPC connection. I wonder if we are seeing similar issues manifested in different API's. |
Interesting fact: storage uses JSON/HTTP, while bigtable uses gRPC. So if these are related, they are at a deeper level—the Go networking stack, perhaps, or something in Google's network infrastructure. It would be useful to know if these same issues crop up in other languages' clients. |
Off-the-cuff observation: if the BigTable client maintains long-lived connections, then those spikes are probably the GFE killing connections every hour (WAI). redacted: gRPC should handle connection closures gracefully and with minimal downtime. Nevermind. :) |
@jba At my company we solved the initial issue above by recreating the GCS client and subsequently its But in fact we indeed have more issues like that with other services for which we didn't find any easy solution yet. For instance sometimes PubSub clients of our low-traffic services simply stop receiving any messages at all for up to multiple minutes at a time - far more than the client-side timeout of 60s. (That's probably not related and I hope I'm not mistaken, but... Every single day at some time each database in our Spanner instances increase their respective CPU usage. Which is kinda understandable, since they're probably running some kind of DB compaction etc. But what's interesting is that each single day at some consistently random other time Spanner just "decides" to have a 3-10x (not a typo) increase in latency for a few minutes without any increase in OPS or CPU usage. Strange stuff. 😄) |
At a high level, I feel like these issues are best discussed with the product teams via their support channels. Latency is pretty hard to debug, and I don't think that the problems are specific to GoLang. I think that we should have some good documentation about who to talk to about performance issues in a centralized place. I do think that we ought to close this issue, since there isn't anything actionable from the client team. @jadekler, we did some research on the Cloud Bigtable side. There's an interesting client-side issue with a subtle interaction between connection pooling, low QPS clients, connection refreshes and retries-with-exponential-backoff. That issue ought to be discussed off line, and in the context of performance testing and fixes. |
I suspect this issue is another manifestation of golang/go#27208, which we discovered in #1211. In short, it's an http/2 bug, which has since been fixed and will arrive in Go 1.11.3. I'm going to close this. If the problem appears again with Go 1.11.3, feel free to re-open. |
FYI: The (supposed) fix for this issue was recently released in Go 1.11.4 instead. 🙂 |
We're currently facing issues with high latencies on GCS operations for 10-30 minutes a time, multiple times a day, causing timeouts on our APIs. Most issues surprisingly seem to happen with delete operations, even though they should be the easiest to process.
Our upload/delete code is similar to this:
Are you aware of anyone else having these troubles? Or do you have a solution for this?
I personally suspect these issues to stem from my request for #701, which - due to retrying on
REFUSED_STREAM
, causes those high latencies (tenfold even!). Could that be the case? And if so: Could it be solved somehow?The text was updated successfully, but these errors were encountered: