Add terraform.ProviderScheduler #178

ulucinar · 2023-03-22T12:44:42Z

Description of your changes

Related to: crossplane-contrib/provider-upjet-aws#325

This PR adds terraform.ProviderScheduler interface and three implementations of it (terraform.NoOpProviderScheduler, a noop implementation, terraform.SharedProviderScheduler that shares Terraform provider processes among multiple reconciliation loops with a configured TTL, and a terraform.WorkspaceProviderScheduler that shares a Terraform provider process between the CLI invocations done in the context of a single reconciliation loop). Providers may opt to reenable the shared gRPC server runtime based on these schedulers, which properly isolate the forked Terraform providers to prevent some external resource leakage issues we had observed in the past with the shared server runtime. We have also performed a set of external resource leakage tests with this runtime, which will be discussed below.

The max ttl configuration for the terraform.SharedProviderScheduler puts a limit on the lifetime of a forked native plugin process, the TTL of a forked plugin process is incremented each time the Terraform CLI is to be invoked against it. So basically the TTL of a plugin process holds the number of times the process has been used to handle requests from Terraform client. After the a plugin process expires, the scheduler attempts to replace it. The scheduler also keeps track of whether a plugin process is actively in use or not and replacements are not allowed if the plugin process is in use. If the scheduler finds that an expired plugin process is in use, it allows new reuse requests for a grace period (measured as a percentage of the max ttl). After this grace period if the plugin process has not been replaced yet, then any new reuse attempts will be denied.

The terraform.WorkspaceProviderScheduler has a higher isolation level and shares a plugin process between the multiple Terraform CLI invocations and the multiple gRPC requests made by those CLI invocations only during a certain managed resource lifecycle event such as an observe, create, update or delete. It's meant to be used with certain provider configurations where using the shared scheduler will cause race conditions in the native provider.

I have:

Read and followed Crossplane's contribution process.
Run make reviewable to ensure this PR is ready for review.
Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

We are still evaluating the performance characteristics of this runtime. We are also collecting feedback from the community members who gave the provider packages consuming this runtime a try. Relevant issues are here:

We have also performed two long-running tests (one 2d 16h long, the other 4d 16h long) with a modified https://github.com/upbound/platform-ref-aws Crossplane configuration package (modified to depend on a upbound/provider-aws package that consumes this runtime). No leaks were observed. We also could not observe external resource leaks during an experiment with this runtime with 210 cognitoidp.UserPool resources spanning over 7 AWS regions, with 30 MRs per region. In this experiment, we also configured the poll interval to 1m to increase the likelihood of a race condition. This experiment lasted for 19h. Through another set of experiments involving 2 AWS regions conducted at the Terraform layer (not involving a Crossplane provider or upjet or the schedulers being discussed here) to stress test the isolation principle we've implemented. After 12h, no external resource leaks were observed.

…erRunners Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

sergenyalcin

Thanks @ulucinar I left a few comments to be sure that I understand the some parts of PR.

sergenyalcin · 2023-03-26T14:10:02Z

pkg/terraform/workspace.go

 // ApplyAsync makes a terraform apply call without blocking and calls the given
 // function once that apply call finishes.
 func (w *Workspace) ApplyAsync(callback CallbackFn) error {
 	if !w.LastOperation.MarkStart("apply") {
 		return errors.Errorf("%s operation that started at %s is still running", w.LastOperation.Type, w.LastOperation.StartTime().String())
 	}
 	ctx, cancel := context.WithDeadline(context.TODO(), w.LastOperation.StartTime().Add(defaultAsyncTimeout))
+	w.providerInUse.Increment()


As far as I see, for async ops, we call the providerInUse.Increment() function in the ...Async func body. And for sync ops, we call the Increment function in the runTF. I want to ask to be sure that I do not miss anything. What is the reason of this difference?

In the async execution cases, we increment the in-use counter via the reconciler goroutine to make sure that the reservations are actually performed by the reconciler goroutine. The Workspace.runTF might be executed by the reconciler goroutine (either if the execution mode for an MR is sync or if we always execute a Terraform CLI always in sync mode, e.g., terraform apply -refresh-only invocations), or it might be executed by a worker goroutine asynchronously. In these async cases, we would like to make sure that the goroutine who drives the scheduling decisions (i.e., the goroutine which drives the scheduler implementation) is actually the goroutine responsible for making the reservations (incrementing the in-use count makes a reservation on the shared plugin process, the scheduler is not allowed to replace it as long as it's actively in use).

sergenyalcin · 2023-03-27T05:23:46Z

pkg/terraform/store.go

@@ -57,6 +55,48 @@ type ProviderRequirement struct {
 // ProviderConfiguration holds the setup configuration body
 type ProviderConfiguration map[string]any

+// ToProviderHandle converts a provider configuration to a handle
+// for the provider scheduler.
+func (pc ProviderConfiguration) ToProviderHandle() (ProviderHandle, error) {


I think we are using the ProviderConfig hash, for deciding whether we need a new tf-provider process to, right? By this way, we will resolve the resource leak issues that we observed before.

Yes, correct. ProviderHandle is basically a hash of the Terraform provider configuration block. Clients pass a ProviderHandle to the scheduler while making reservation requests to identify their requests.

sergenyalcin · 2023-03-27T12:45:47Z

@ulucinar We have a ttl value as 100. And also the ttlBudget is 0.1. As I understand, this means that, the native provider plugin process will be restarted after every 110 usage. For clusters with a low number of MRs, this means that the process will survive for a long time. Considering the memory leak problem on the Terraform side, how would it be to set a timeout for the relevant process? We can cover this in other iterations as well.

sergenyalcin · 2023-03-27T12:47:50Z

nit: What about using ttlMargin instead of ttlBudget to avoid ambiguity? We may also consider changing this in the next iteration.

ulucinar · 2023-03-27T13:32:10Z

@ulucinar We have a ttl value as 100. And also the ttlBudget is 0.1. As I understand, this means that, the native provider plugin process will be restarted after every 110 usage. For clusters with a low number of MRs, this means that the process will survive for a long time. Considering the memory leak problem on the Terraform side, how would it be to set a timeout for the relevant process? We can cover this in other iterations as well.

Thanks for the suggestions. The assumption in the current implementation is that the native plugin process leaks memory as it responds to client requests and memory leakage (if any) will be minimal if it's idle, i.e., not replying client requests. So we currently base the replacement decisions on the ttl.

In further iterations, we can discuss setting timeouts, making replacement decisions based on the memory consumption and probably on some other criteria. We will also need to define a cap on the number of native plugin processes the shared scheduler forks. For instance, what happens if there are a thousand AWS accounts that are actively used in a cluster? The implementation will attempt to fork a process without any limits. We will be addressing similar issues in the next iterations.

sergenyalcin · 2023-03-27T14:52:04Z

Some test results that are done via this PR:

crossplane-contrib/provider-upjet-aws#576 (comment)
crossplane-contrib/provider-upjet-aws#576 (comment)
crossplane-contrib/provider-upjet-aws#576 (comment)

ulucinar · 2023-03-27T15:00:54Z

nit: What about using ttlMargin instead of ttlBudget to avoid ambiguity? We may also consider changing this in the next iteration.

Done.

sergenyalcin

Thanks @ulucinar LGTM! This is an important milestone in the context of addressing the performance issues.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

przysiadZeSztanga · 2023-03-30T09:16:09Z

Testing that with 500MR and settings like (--max-reconcile-rate=5)
CPU usage is significantly lower than on provider version 0.31 and 0.30 but I see tons of errors like:

observe failed: cannot schedule a native provider during observe: xxxxxxxxxxxxxxx: cannot schedule native Terraform provider process: native provider reuse budget has been exceeded: invocationCount: 113, ttl: 100'

Also max rate seems not be affecting process, I see lots of terraform apply running in parallel:

18948 1 2000 S 817m 5% 7 0% terraform apply -auto-approve -input=false -lock=false -json 19247 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 18718 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 18757 1 2000 S 817m 5% 5 0% terraform apply -auto-approve -input=false -lock=false -json 19409 1 2000 S 817m 5% 2 0% terraform apply -auto-approve -input=false -lock=false -json 19001 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 19076 1 2000 S 884m 6% 2 0% terraform apply -auto-approve -input=false -lock=false -json 18887 1 2000 S 818m 5% 7 0% terraform apply -auto-approve -input=false -lock=false -json 18926 1 2000 S 817m 5% 0 0% terraform apply -auto-approve -input=false -lock=false -json 18797 1 2000 S 817m 5% 2 0% terraform apply -auto-approve -input=false -lock=false -json 19300 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 19655 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 19105 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19699 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 18978 1 2000 S 817m 5% 2 0% terraform apply -auto-approve -input=false -lock=false -json 19210 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 19592 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 19170 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19552 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19759 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19726 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 18836 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19683 1 2000 S 817m 5% 0 0% terraform apply -auto-approve -input=false -lock=false -json 19044 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 18689 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 19103 1 2000 S 817m 5% 5 0% terraform apply -auto-approve -input=false -lock=false -json 18628 1 2000 S 817m 5% 2 0% terraform apply -auto-approve -input=false -lock=false -json

przysiadZeSztanga · 2023-03-30T12:08:30Z

Some CPU and memory observations same test scenario:
0->500 MR (route53 records) in 1 claim
node dedicated for aws provider:
c6i.2xlarge with 8 cores and 16 GB memory

provider version: 0.32.0-rc.1

Limit/request settings:
Limits: cpu: 6 memory: 12Gi Requests: cpu: 6 memory: 12Gi
provider configuration params:

 extraArgs:
  - --max-reconcile-rate=5
  - --poll=30min

0.31.0 provider:

0.32.0-rc.1 provider

creation of MR is much longer
fix made some improvement but it rather looks like a fuse, not to kill Provider during large bulk creation/updates
still high (full) cpu usage for terraform operations on ready/synced/created MRs

Add terraform.ProviderScheduler that manages the lifecycles of Provid…

68c9112

…erRunners Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

ulucinar requested a review from sergenyalcin as a code owner March 22, 2023 12:44

ulucinar force-pushed the scheduler branch 5 times, most recently from 9426eae to 7b5ae1c Compare March 24, 2023 05:30

sergenyalcin mentioned this pull request Mar 24, 2023

Load tests with provider-aws crossplane-contrib/provider-upjet-aws#576

Closed

sergenyalcin reviewed Mar 27, 2023

View reviewed changes

sergenyalcin approved these changes Mar 27, 2023

View reviewed changes

ulucinar added 3 commits March 27, 2023 18:09

Move scheduler from WorskpaceStore to ExternalClient

967a3a5

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

Call InUse.Increment from reconciliation goroutine

616ea71

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

Rename ttlBudget as ttlMargin

76d344c

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

ulucinar force-pushed the scheduler branch from 2537121 to 76d344c Compare March 27, 2023 15:12

ulucinar merged commit 05c3d62 into crossplane:main Mar 27, 2023

ulucinar deleted the scheduler branch March 27, 2023 15:12

jeanduplessis mentioned this pull request Mar 28, 2023

Dramatic increase in CPU and memory consumption / performance issues crossplane-contrib/provider-upjet-aws#325

Closed

Duologic mentioned this pull request Feb 15, 2024

Memory leak in provider grafana/crossplane-provider-grafana#87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add terraform.ProviderScheduler #178

Add terraform.ProviderScheduler #178

ulucinar commented Mar 22, 2023 •

edited

Loading

sergenyalcin left a comment

sergenyalcin Mar 26, 2023

ulucinar Mar 27, 2023

sergenyalcin Mar 27, 2023

ulucinar Mar 27, 2023

sergenyalcin commented Mar 27, 2023 •

edited

Loading

sergenyalcin commented Mar 27, 2023 •

edited

Loading

ulucinar commented Mar 27, 2023

sergenyalcin commented Mar 27, 2023

ulucinar commented Mar 27, 2023

sergenyalcin left a comment

przysiadZeSztanga commented Mar 30, 2023

przysiadZeSztanga commented Mar 30, 2023 •

edited

Loading

Add terraform.ProviderScheduler #178

Add terraform.ProviderScheduler #178

Conversation

ulucinar commented Mar 22, 2023 • edited Loading

Description of your changes

How has this code been tested

sergenyalcin left a comment

Choose a reason for hiding this comment

sergenyalcin Mar 26, 2023

Choose a reason for hiding this comment

ulucinar Mar 27, 2023

Choose a reason for hiding this comment

sergenyalcin Mar 27, 2023

Choose a reason for hiding this comment

ulucinar Mar 27, 2023

Choose a reason for hiding this comment

sergenyalcin commented Mar 27, 2023 • edited Loading

sergenyalcin commented Mar 27, 2023 • edited Loading

ulucinar commented Mar 27, 2023

sergenyalcin commented Mar 27, 2023

ulucinar commented Mar 27, 2023

sergenyalcin left a comment

Choose a reason for hiding this comment

przysiadZeSztanga commented Mar 30, 2023

przysiadZeSztanga commented Mar 30, 2023 • edited Loading

ulucinar commented Mar 22, 2023 •

edited

Loading

sergenyalcin commented Mar 27, 2023 •

edited

Loading

sergenyalcin commented Mar 27, 2023 •

edited

Loading

przysiadZeSztanga commented Mar 30, 2023 •

edited

Loading