Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] API for decommissioning/recommissioning zone and weighted zonal search request routing policy #3639

Closed
imRishN opened this issue Jun 21, 2022 · 30 comments
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes

Comments

@imRishN
Copy link
Member

imRishN commented Jun 21, 2022

Is your feature request related to a problem? Please describe.
#3402 aims to build support for decommissioning and recommissioning a zone based on the value assigned to a zonal value. Similarly, #2859 aims to build support for weighted zonal search request using weighted round robin mechanism. We need to have a consistent and precise API structure finalised for the same.

Scope of this issue is limited to finalise the API structure for zonal decommission/recommission and weighted zonal search request.

Describe the solution you'd like
Below are API structure that we can use for zonal decommission/recommission and weighted zonal search request.


Zone Decommission

PUT /_cluster/decommission/awareness/<zone>/<zone-a>
{
      "acknowledged": true
}

Zone Recommission

DELETE /_cluster/decommission/awareness
{
      "acknowledged": true
}

Get Zone Decommission Status

GET /_cluster/decommission/awareness/zone/_status
{
     "zone-1": "INIT | DRAINING | IN_PROGRESS | SUCCESSFUL | FAILED"
}

Weighted Round Robin for search request

PUT /_cluster/routing/awareness/<attribute>/weights
{ 
      "zone_1": "1", 
      "zone_2": "1", 
      "zone_3": "0"
}
{
     "acknowledged": true
}

Get weight for a local node

GET /_cluster/routing/awareness/<attribute>/weights?local
{ 
    "zone_1": "1", 
    "zone_2": "1", 
    "zone_3": "0",
    "node_weight" : "0"
}

Get Weight

GET /_cluster/routing/awareness/<attribute>/weights
{
      "zone_1": "1", 
      "zone_2": "1", 
      "zone_3": "0"
}

The PUT /_cluster/decommission/awareness/<zone>/<zoneA> would ensure it modifies the weights to weigh away the traffic of the zone attribute and would also check if there is no incoming HTTP traffic or search traffic to the weighed away zone. If there is traffic it moves the status to DRAINING once incoming HTTP traffic and search traffic is drained, the decommission is executed.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@imRishN imRishN added enhancement Enhancement or improvement to existing feature or request untriaged labels Jun 21, 2022
@imRishN imRishN changed the title API for decommissioning/recommissioning zone and weighted zonal search request routing policy [RFC] API for decommissioning/recommissioning zone and weighted zonal search request routing policy Jun 21, 2022
@kotwanikunal kotwanikunal added RFC Issues requesting major changes and removed untriaged labels Jun 21, 2022
@saikaranam-amazon
Copy link
Member

@imRishN Could we please add req(/res) for obtaining current status of Recommission/Decommission as well?

@imRishN
Copy link
Member Author

imRishN commented Jun 26, 2022

@saikaranam-amazon updated the req/res for both update/get calls for recommission/decommission

@saikaranam-amazon
Copy link
Member

Thanks @imRishN

  • Can we store additional information to reflect the updated-at timestamp for the changes? (and this can be part of the response payload for all the GET calls)

@Bukhtawar
Copy link
Collaborator

@saikaranam-amazon could you help elaborate on why that might be needed?

@saikaranam-amazon
Copy link
Member

saikaranam-amazon commented Jun 28, 2022

As we have two APIs to update the weights of the search traffic and decommissioning entire zone(without any additional checks/wait time), the later operation might incur instability for some of the inflight operations.
with updated-at field, customer can make conscious decision (to decommission zone) based on the current updated weights timestamp for search traffic and can further help in automation as well.

  • Also, for all the Update APIs, Does those support conditional update based on the current weights passed in the payload and updated values in the current cluster state? (for de(/re)commission)
  • And will the absence of the recommission/decommission state, will the APIs return the empty response? rather can we have decommission state (bool) as part of the payload itself for the GET calls?

@saikaranam-amazon
Copy link
Member

@saikaranam-amazon could you help elaborate on why that might be needed?

sure @Bukhtawar
update above.

@gbbafna
Copy link
Collaborator

gbbafna commented Jul 15, 2022

Thanks @imRishN .

should we have Zone Decommission API synchronous ? The Get commission status API essentially will output the state preserved in cluster state. By making API sync , the status API need can be avoided .

If users want to know the status of decommission, _cat/nodes, _cat/master and _cluster/health should be able to provide the granular details .

@Bukhtawar
Copy link
Collaborator

When we decommission the zone it is possible that traffic hasn't be DRAINED in which case it might take longer and calls getting timed out. The GET API can perform more exhaustive checks on traffic drain, there could be more graceful checks in future around ongoing snapshots, shard relocation etc which would take time to complete. Having a GET API would help extend for other cases as well.

@gbbafna
Copy link
Collaborator

gbbafna commented Jul 15, 2022

@Bukhtawar : This makes sense . But do you think we can start without it for now and iterate on it later based on the need ?

In the case where traffic hasn't be DRAINED, we would return the API call , with the reasons for the same. A user can call the APIs with a lower timeout and see the reasons for that getting stalled. As and when we add more checks around snapshots, shard relocation , that will automatically get added to the reasons as well .

@elfisher
Copy link

Can we add labels for "roadmap" and the version of OpenSearch this is targeting? I can add it to the overall project roadmap in the right column once that is done.

@saikaranam-amazon
Copy link
Member

Regarding the Put call for Decommission API

{
     "status": "PROCESSING | DRAINING | COMPLETED",
      "awareness": {
		"zone": "zone-A"
	}
}

How are responding regarding failures in executing the call? - ( May be let's track FAILED as a status value. )

@saikaranam-amazon
Copy link
Member

Regarding the Get call for commission status

{
	"awareness": {
		"zone": "zone-A",
                 "zone": "zone-B",
                 "zone": "zone-C"
	}
}

Can we have the list of values under zones as key? - and are we not foreseeing any async operations that can be captured under the status field similar to Decommission call?

@imRishN
Copy link
Member Author

imRishN commented Jul 28, 2022

How are responding regarding failures in executing the call? - ( May be let's track FAILED as a status value. )

Updated the API contract above, including FAILED as a status value

Can we have the list of values under zones as key? - and are we not foreseeing any async operations that can be captured under the status field similar to Decommission call?

This makes sense to have a list under the same key. Updated the details

@anshu1106
Copy link
Contributor

Should we change _local to local in GET _cluster/shard_routing/weights?_local to keep it consistent with other _cluster/* apis like _cluster/state.

@imRishN
Copy link
Member Author

imRishN commented Aug 9, 2022

Updated the API structures above

@imRishN
Copy link
Member Author

imRishN commented Aug 10, 2022

@reta @dblock @elfisher any thoughts on above APIs?

@reta
Copy link
Collaborator

reta commented Aug 15, 2022

Thanks for summarizing the API design @imRishN , I personally see large disconnect between the existing routing awareness and suggested decomissioning / recommissioning API.

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values":["zoneA", "zoneB"]
  }
}

Logically, it looks to me that decomissioning == remove zone from cluster.routing.allocation.awareness.force.zone.values whereas recomissioning == reintroduce zone cluster.routing.allocation.awareness.force.zone.values.

The weights could be modeled in the similar fashion using "cluster.routing.allocation.awareness.force.zone.weights" : [1, 0, 0] setting.

I have nothing against introducing dedicated APIs but it is going to be difficult and confusing to maintain the API/settings split. Also, one important thing to keep in mind is that cluster settings could be persistent or transient, I believe it equally applies to decomissioning / recommissioning, for example when people rescale clusters withing same zones - the settings could be set to temporarily exclude some zones and reintroduce them back after restart.

Does it make sense or I completely derailed the conversation?

@imRishN
Copy link
Member Author

imRishN commented Aug 17, 2022

@reta, thanks for taking a look into the RFC. The cluster settings that you mentioned above is more of shard allocation strategy based on the awareness attribute set to the cluster.

As part of decommissioning an awareness attribute, we intend to remove the nodes from the cluster during zonal outages as it might be operating in a degraded manner and impacting the overall cluster's availability. Today, any write request requires a response from all the shard copies before the request is acknowledged. During zonal outages, this model can impact the writes to the cluster as any slow copy or impairment will slow down the writes significantly. The API design gives the user flexibility to remove the nodes present in an impacted zone out from the cluster and mark shards there as unavailable.

Logically, it looks to me that decomissioning == remove zone from cluster.routing.allocation.awareness.force.zone.values whereas recomissioning == reintroduce zone cluster.routing.allocation.awareness.force.zone.values.

We don't need to remove the zone from force zone values as it might trigger a storm of shard recoveries impacting latencies due to additional CPU and network consumption. We will let shard stay in UNASSIGNED state after decommissioning the zone. During recovery, the user can decide on recommissioning the zone back again.

More details on recommission and decommissioning a zone can be found here #3402

@reta
Copy link
Collaborator

reta commented Aug 18, 2022

@imRishN aha, I see, thanks for clarification, I think I have even more questions, this time regarding the API:

PUT /_cluster/decommission/awareness/<zone>/<zone-a>
DELETE /_cluster/decommission/awareness/<zone>/<zone-a>

The <zone> seems to be off here, what we need is the awareness attribute (which could be zone), so the APIs could be generalized this way:

PUT /_cluster/decommission/awareness/<attribute>/<value>
DELETE /_cluster/decommission/awareness/<<attribute>/<value>

And in case of zone attribute:

PUT /_cluster/decommission/awareness/zone/<zone-a>
DELETE /_cluster/decommission/awareness/zone/<zone-a>

Regarding weights, why we are introducing shard_routing?

GET /_cluster/shard_routing/weights?local
PUT /_cluster/shard_routing/weights

The termilogy we settled upon is just routing so I think we should stick to that?

GET /_cluster/routing/weights?local
PUT /_cluster/routing/weights

Even better (arguably) approach is to follow decommission/decommission and design something like this:

GET /_cluster/routing/awareness/<attribute>/weights?local
PUT /_cluster/routing/awareness/<attribute>weights

WDYT?

@imRishN
Copy link
Member Author

imRishN commented Aug 19, 2022

@reta,

The seems to be off here, what we need is the awareness attribute (which could be zone), so the APIs could be generalized this way:
PUT /_cluster/decommission/awareness/{attribute}/{value}
DELETE /_cluster/decommission/awareness/{attribute}/{value}

That's correct. The API will take in the awareness attribute set to the cluster by the setting cluster.routing.allocation.awareness.attributes and will be validating against the value this setting has. zone was an example here. I have created a draft PR implementing the same #4261

@imRishN
Copy link
Member Author

imRishN commented Aug 22, 2022

@reta, thanks for the suggestions. I have updated the API path for weights as well. Let me know if this looks good to you?

@reta
Copy link
Collaborator

reta commented Aug 22, 2022

@reta, thanks for the suggestions. I have updated the API path for weights as well. Let me know if this looks good to you?

Thanks @imRishN , it looks concise to me (minor typo with missed slash, PUT /_cluster/routing/awareness/<attribute>weights -> PUT /_cluster/routing/awareness/<attribute>/weights)

@imRishN
Copy link
Member Author

imRishN commented Aug 22, 2022

@reta, thanks for pointing out. Updated above.

@shuklas
Copy link

shuklas commented Aug 25, 2022

Default looks like - {"msg":"Weights are not set"}. Shouldn't we just return empty object {}

@sachinpkale
Copy link
Member

@imRishN
I am not completely sure on the URL pattern for decommission. We are decommissioning nodes the cluster as part of this API but not clear from the URL.

This is what I think it should be:

PUT /_nodes/awareness/<zone>/<zone-a>/_decommission

@reta
Copy link
Collaborator

reta commented Aug 26, 2022

@sachinpkale this is another way to look at it, but I believe we aim to decommission an awareness attribute value (zone in this case), which means - everything related to this zone, including nodes (which is probably the only thing we decommission). But I see your point, it has valid concerns

@anshu1106
Copy link
Contributor

Default looks like - {"msg":"Weights are not set"}. Shouldn't we just return empty object {}

Will make the response empty object

@imRishN
Copy link
Member Author

imRishN commented Sep 2, 2022

@sachinpkale Although, the attribute key value is a node property, but awareness in general is a cluster property. This is how we set the awareness attribute to the cluster - "cluster.routing.allocation.awareness.attributes": "zone".

Also, I feel, /_nodes/awareness/<zone>/<zone-a>/_decommission will create confusion when we extend this feature to decommission only a set of nodes from a particular zone, or for that matter any such extensions. Also, I'll create another issue for implementing nodes decommission irrespective of awareness attribute. And in case if we purse it further then /_cluster/decommission can be simply used as path prefix. Let me know if this clears your doubt or do you still see concerns with the above APIs

@imRishN
Copy link
Member Author

imRishN commented Oct 17, 2022

Closing this issue as ALL the API PRs are merged now

@dblock
Copy link
Member

dblock commented Aug 22, 2024

I tried to build a test following our documentation to exercise these APIs and failed. If someone wants to either pickup the adding/completing specs in the API specification for this, or at least can show me how to get a simple cluster to a state where these APIs can be called, starting with the code in the description of opensearch-project/opensearch-api-specification#524 that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes
Projects
None yet
Development

No branches or pull requests