Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: Fix load reporting when pick first is used for locality-routing. #11495

Merged
merged 6 commits into from
Aug 31, 2024

Conversation

DNVindhya
Copy link
Contributor

This fixes Incorrect load reporting when using pick first as locality-routing policy.

Cause: ClusterImplLoadBalancer used to assume all the address are in the same locality and used first address in the list of addresses to determine locality. This is not always true leading to incorrect load reporting.

Fix: Get locality from address connected by the subchannel.

Closes #11434.

…mption that all addresses for a subchannel are in the same locality.
Copy link
Contributor

@larry-safran larry-safran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've recommended a number of minor changes to comments

/**
* An internal class. Do not use.
*
* <p>An interface to provide the address connected by subchannel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says it provides the address, but the method provides address attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the interface name to InternalSubchannelAddressAttributes.

@Override
public void shutdown() {
if (localityStats != null) {
localityStats.release();
if (localityAtomicReference.get() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't possibly be null AFAICT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Removed the null check.

// attributes with its locality, including endpoints in LOGICAL_DNS clusters.
// In case of not (which really shouldn't), loads are aggregated under an empty
// locality.
if (locality == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be, but is possible that locality is set but locality name is null. You should probably have an else clause that does a null check on localityName.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is existing code, and it seems in general code assumes the values to be non-null. I'd much rather we handle that centrally in io.grpc.xds.client.Locality.create() than lots of random not-possible-to-trigger checks. The data is primarily coming from a proto, so it will be "" when unset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to move the locality name null check to io.grpc.xds.client.Locality.create()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see now Larry's comment wasn't about the Locality struct, but instead the attribute. That had been discussed when the code was originally introduced:
#11133 (comment)

This PR is likely to be reverted (in some way) later, once the old PF policy goes away. So changes to the existing code are likely to be lost.

@@ -447,4 +480,33 @@ public void onLoadReport(MetricReport report) {
stats.recordBackendLoadMetricStats(report.getNamedMetrics());
}
}

/**
* Represents the locality attributes of a cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually the stats and the name. Name could be considered an attribute, the the stats aren't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. PTAL

@@ -91,7 +91,7 @@ private synchronized void releaseClusterDropCounter(
String cluster, @Nullable String edsServiceName) {
checkState(allDropStats.containsKey(cluster)
&& allDropStats.get(cluster).containsKey(edsServiceName),
"stats for cluster %s, edsServiceName %s not exits", cluster, edsServiceName);
"stats for cluster %s, edsServiceName %s not exist", cluster, edsServiceName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/not exist/do not exist/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// attributes with its locality, including endpoints in LOGICAL_DNS clusters.
// In case of not (which really shouldn't), loads are aggregated under an empty
// locality.
if (locality == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is existing code, and it seems in general code assumes the values to be non-null. I'd much rather we handle that centrally in io.grpc.xds.client.Locality.create() than lots of random not-possible-to-trigger checks. The data is primarily coming from a proto, so it will be "" when unset.


// This clusterLocality ideally should not be utilized. We derive locality
// information from the first address, even prior to the subchannel becoming ready.
// This is primarily for the purpose of facilitating load reporting in the pre-READY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no load reporting pre-READY. But an LB API could return the subchannel even before it is ready, and PF does this. We generally won't end up reporting load for such picks because the channel will ignore the selected (not-ready) subchannel, but we needed to handle the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, updated the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you miss pushing the update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, we both are able to see the updated comment.

xds/src/main/java/io/grpc/xds/ClusterImplLoadBalancer.java Outdated Show resolved Hide resolved
Comment on lines 82 to 83
private static final Attributes.Key<AtomicReference<ClusterLocality>>
ATTR_CLUSTER_LOCALITY =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put ATTR_CLUSTER_LOCALITY on the same line as the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@DNVindhya DNVindhya merged commit 1dae144 into grpc:master Aug 31, 2024
15 checks passed
@DNVindhya DNVindhya deleted the pick_first_load_report branch August 31, 2024 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xds: Incorrect load reporting when using pick first as locality-routing policy
3 participants