Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex repo setup example #89

Open
tbondarchuk opened this issue Oct 18, 2022 · 16 comments
Open

Complex repo setup example #89

tbondarchuk opened this issue Oct 18, 2022 · 16 comments

Comments

@tbondarchuk
Copy link

TL;DR - are there any other examples of complex flux config repo with multiple clusters/apps/envs/tenants?

I really like FluxCD documentation - guides, apis - everything you need to get going and improve along the way. But one extremely frustrating missing piece is complex flux repo example. Both flux2-kustomize-helm-example and flux2-multi-tenancy are just POC to be honest - nice to try locally on kind cluster but not really suitable to run on live cluster imo. All examples seem to focus on either single cluster usage or multiple but fully/almost identical clusters. I've tried to find something more complex, but no much luck so far. Well, there is bootstrap-repo but it seems to be an enhanced version of current repo.

Don't take me wrong - it all works perfectly, but as soon as you start to scale - add clusters, apps, tenants, etc - it became quite cumbersome.

Like this repo's readme:

├── clusters
│   ├── production
│   └── staging
├── infrastructure
│   ├── kyverno
│   └── kyverno-policies
└── tenants
    ├── base
    ├── production
    └── staging

it assumes both clusters will use same exact version of infrastructure - for example save values for helm releases. Of course we can add base/overlays to infrastructure but when it scales out it became unwieldy. We need monitoring stack - prometheus, grafana, loki, promtail. Private/public ingress, can't live without cert-manager and external-dns, then EKS is no good without aws-load-balancer-controller and karpenter/autoscaler, add some kubernetes-dashboard, weave-gitops, etc - and here you are with 10 to 20 helm releases just to get cluster to be ready for an actual app deployment. (then deploy single-pod app and proudly watch all those machinery running static website with enormous 50Mb/100m resources consumption :))

Having infrastructure/base with 20 helm releases plus help helm repos files isn't that bad, but in my case with multiple EKS clusters in different accounts I need different values for most helm releases (irsa roles, increased replicas for prod, etc) So it results in infrastructure/dev folder having 20 values/secrets files and long list of kustomize's configMapGenerators.

I guess possible solution is to use variables substitution and put all values inside helm releases and just keep one/few configmaps with per-cluster variables, but I've found out that keeping plan value/secret file for helm release is really useful when maintainers release breaking change patch and you need to quickly run helm template -f values.yaml to see what's changed. (I really like Flux's helm charts autoupdate feature but sometimes reading alerts channel in the morning after nightly updates is no fun at all)

Second issue/inconvenience is dependencies or rather lack of dependency between helm release/kustomization at this moment (hope it will be possible to implement this soon/ever, subscribed to existing issue already). For example, in order to deploy cert-manager from helm release and cluster-issuer from kustomization I need to wrap hr into kustomization and then set dependency between two ks resources. So deploying from a single large folder full of helm releases and kustomzations and plain manifests is just not possible, unless you want to spend some time during cluster's bootstrap manually reconciling and suspending/resuming.

And "single folder" approach does not work well with plain/kustmized manifests either - recently tried approach with multiple apps in base folder and then single dev/kustomization.yaml deploying them all - quickly realized that one failing deployment blocks all others since wrapper Kustomization is unhealthy. Plus you'd need to suspend everything even if single deployment needs maintenance/etc.

So I ended up with "multiple proxy kustomizations, multiple components" setup which worked quite well until I got to explain it to somebody else :)

Flux-fleet repo
.
├── clusters
│   └── dev
│       ├── flux-system
│       │   ├── gotk-components.yaml
│       │   ├── gotk-sync.yaml
│       │   └── kustomization.yaml
│       ├── infrastructure
│       │   ├── cert-issuer.yaml
│       │   ├── ingress.yaml
│       │   ├── kustomization.yaml
│       │   ├── monitoring.yaml
│       │   └── system.yaml
│       └── kustomization.yaml
└── infrastructure
    ├── cert-issuer
    │   ├── base
    │   │   ├── cluster-issuer.yaml
    │   │   └── kustomization.yaml
    │   └── dev
    │       └── kustomization.yaml
    ├── ingress
    │   ├── base
    │   │   ├── hr-ingress-nginx-private.yaml
    │   │   ├── hr-ingress-nginx-public.yaml
    │   │   ├── kustomization.yaml
    │   │   ├── namespace.yaml
    │   │   └── source-ingress-nginx.yaml
    │   └── dev
    │       ├── ingress-nginx-private-values.yaml
    │       ├── ingress-nginx-public-values.yaml
    │       ├── kustomization.yaml
    │       └── kustomizeconfig.yaml
    ├── monitoring
    │   ├── base
    │   │   ├── hr-grafana.yaml
    │   │   ├── hr-kube-prometheus-stack.yaml
    │   │   ├── hr-loki.yaml
    │   │   ├── hr-promtail.yaml
    │   │   ├── kustomization.yaml
    │   │   ├── namespace.yaml
    │   │   ├── source-grafana.yaml
    │   │   └── source-prometheus-community.yaml
    │   └── dev
    │       ├── grafana-secrets.yaml
    │       ├── grafana-values.yaml
    │       ├── kube-prometheus-stack-secrets.yaml
    │       ├── kube-prometheus-stack-values.yaml
    │       ├── kustomization.yaml
    │       ├── kustomizeconfig.yaml
    │       ├── loki-values.yaml
    │       └── promtail-values.yaml
    └─── system
        ├── base
        │   ├── hr-aws-load-balancer-controller.yaml
        │   ├── hr-cert-manager.yaml
        │   ├── hr-external-dns.yaml
        │   ├── hr-metrics-server.yaml
        │   ├── kustomization.yaml
        │   ├── namespace.yaml
        │   ├── source-eks.yaml
        │   ├── source-external-dns.yaml
        │   ├── source-jetstack.yaml
        │   └── source-metrics-server.yaml
        └── dev
            ├── aws-load-balancer-controller-values.yaml
            ├── cert-manager-values.yaml
            ├── external-dns-values.yaml
            ├── kustomization.yaml
            └── kustomizeconfig.yaml

Where clusters/dev/infrastructure is bunch of flux Kustomizations referencing dev folders in infrastructure/, grouped roughly by purpose (monitoring together, ingress together)

Got very similar setup for tenant's apps repos as well:

Flux-tenant repo
├── apps
│   └── webpage
│       ├── base
│       │   ├── deployment.yaml
│       │   ├── ingress.yaml
│       │   ├── kustomization.yaml
│       │   ├── service-account.yaml
│       │   └── service.yaml
│       └── dev
│           ├── ingress.yaml
│           └── kustomization.yaml
└── clusters
    └── dev
        └── apps
            ├── kustomization.yaml
            └── webpage.yaml

I guess "if it works don't touch it" but those long chains of KS => kustomization => KS => kustomization => HR are becoming hard to keep track of. Plus all of official examples seems to be completely opposite to this. I have a strong feeling I've overengineered this but can't seem to find a way to do a simpler but still flexible setup.

So I'm looking for some different example of "prod-like" complex setup tested on live clusters, anybody is willing to share? :)
Any suggestions are much appreciated.

@torpare
Copy link

torpare commented Oct 29, 2022

I agree - the examples presented in this repo cannot be followed step-by-step. I've spent hours trying to follow the instructions but it simply doesn't work. I'm pretty sure steps are missing or are listed in the wrong order. Possibly some steps have errors :(

@kingdonb
Copy link
Member

kingdonb commented Oct 30, 2022

The main difference between the Flux-designed examples and the bootstrap-repo that I use for my demos is that Flux docs beget a suggested architecture of "separating apps from infrastructure" without prescribing that every app must be separated from every app, and every infrastructure must be separated from every infrastructure. You can put more apps in that "apps" kustomization without making more, as long as they don't need to be applied in a dependency order.

In my example repo, I am frequently breaking things, and I have lots of complex inter-connected dependencies, so this bootstrap-repo example that you linked @aliusmiles is intended to break those apart, so that independent components can fail independently and not cause surprises through overlapping.

I think the problem that people trip over is that they try to design a multi-tenant system but they don't have real multi-tenant requirements, so they start skipping steps. This example works by using a "dev-team" branch, which is meant to be standing in for a separate repo that belongs to the dev team. Does stating that clearly help / would it help at all if we made that clearer in the doc? @torpare I'm certain the example will work if you don't try to translate it to GitLab on your first time running it.

If you cloned the repo and pushed it up, but didn't also clone the dev-team branch and push it up, you would have problems. In a real production environment, you would want this to be a separate repo so that dev-team doesn't have a repo where the write permissions are shared with infrastructure. It's sort of baked in as an expectation here that the "bootstrap repo" becomes a hub for the Platform team to manage many clusters, and it would not make sense for "dev-team" to manage their own infrastructure in a different branch of that repo, that might be the most confusing part about the example, but what is worse – this, or spreading the example across multiple repos? I'm not sure... ok so, If you have more than one definition for infrastructure, you can keep them in /infrastructure/a and /infrastructure/b instead of writing them as one definition, and let them inherit the common things from a common base in /infrastructure/bases.

There is also this doc on repository structure, nobody mentioned it so far, so I wonder if it has been missed:

https://fluxcd.io/flux/guides/repository-structure/#repo-per-environment

There are at least four different approaches to repo structure in that doc, maybe one of them makes more sense for you. It is difficult to offer prescriptive guidance for everyone because of Conway's Law, the structure of your comms should resemble the structure of your organization, and every organization is going to likely have a different unique structure of its own.

@tbondarchuk
Copy link
Author

@kingdonb,

Regarding dev-team branch - I guess would be useful to have at least small note stating in this repo branch is used to emulate team repository or smth similar, because currently you need to click on tenant repository link to understand that there is another branch right under your nose. Kind of non-obvious and confusing for new users I believe, though not an issue after spending some time reading around.

Repository structure doc is hard to miss, really 😀 But comparing to other guides, like image-update it lacks details. I mean it's an excellent "design overview" but when you start looking at example repos for implementation details - you suddenly find out that for the live usage you just need to do it completely from scratch all by yourself.

Like you said:

suggested architecture of "separating apps from infrastructure" without prescribing that every app must be separated from every app, and every infrastructure must be separated from every infrastructure

this is exactly one of the problems that I have with repository-structure guide and all the examples repos - they do not take into account dependencies between apps/infra components. Another is that they do not scale well. Let me illustrate (monorepo example for simplicity):

Example repo
.
├── apps
│   ├── base
│   │   ├── backend
│   │   │   ├── configmap.yaml
│   │   │   ├── deployment.yaml
│   │   │   ├── kustomization.yaml
│   │   │   ├── secret.yaml
│   │   │   └── service.yaml
│   │   ├── db
│   │   │   ├── hr-postgres.yaml
│   │   │   └── kustomization.yaml
│   │   └── frontend
│   │       ├── configmap.yaml
│   │       ├── deployment.yaml
│   │       ├── ingress.yaml
│   │       ├── kustomization.yaml
│   │       └── service.yaml
│   ├── prod
│   │   ├── backend-configmap.yaml
│   │   ├── backend-secret.yaml
│   │   ├── frontend-configmap.yaml
│   │   ├── frontend-ingress.yaml
│   │   ├── kustomization.yaml
│   │   └── values-db.yaml
│   └── stage
│       ├── backend-configmap.yaml
│       ├── backend-secret.yaml
│       ├── frontend-configmap.yaml
│       ├── frontend-ingress.yaml
│       ├── kustomization.yaml
│       └── values-db.yaml
└── infrastructure
  ├── base
  │   ├── hr-aws-load-balancer-controller.yaml
  │   ├── hr-cert-manager.yaml
  │   ├── hr-external-dns.yaml
  │   ├── hr-ingress-nginx.yaml
  │   ├── ks-cert-issuer.yaml
  │   ├── kustomization.yaml
  │   ├── namespaces.yaml
  │   ├── source-cert-manager.yaml
  │   ├── source-eks.yaml
  │   ├── source-external-dns.yaml
  │   └── source-ingress-nginx.yaml
  ├── prod
  │   ├── kustomization.yaml
  │   ├── values-aws-load-balancer-controller.yaml
  │   ├── values-external-dns.yaml
  │   └── values-ingress-nginx.yaml
  └── stage
      ├── kustomization.yaml
      ├── values-aws-load-balancer-controller.yaml
      ├── values-external-dns.yaml
      └── values-ingress-nginx.yaml

Adding more apps/infrastrucutre will result in long kustomization file and quit a lot of values/patches files, but ok, it works - as long as you don't have any HR => KS or vice versa dependencies. I guess it'll be much easier to follow this setup when Flux will support cross-resources dependencies, but there is another potential issue - all apps under single Flux Kustomization result in a) suspend one - suspend all and b) one fails - whole KS is failed.

I completely understand that "one size fits all" example repo is impossible to create but still, all examples look like they were created just for a short demo on minikube (as I guess they were). But come on - nobody installs let's say ingress nginx with the same configuration to all clusters. Or use a single app without dependencies. Having "more meat on bones" in example repos would be very nice I believe.

Surely Flux dev team must've used Flux internally on large scale, not only on kind/minikube local clusters? 😀 So would you mind perhaps to share some "battle tested" configs in form of example repos? It's just all the other Flux docs are kind of "guide by the hand" type regarding all the small components but when you try to put all the things together - you are on your own 🤷 And when you try to search for some more complex example - can't find anything usable, at least for my case - multiple EKS clusters. Though I have a feeling that "EKS" part is what's causing me troubles - looks like GKE for example requires much less infrastructure services to install before actual usage (EKS needs about 15 if you want ingress/monitoring/etc) And to be honest I should start with making my own setup public, even though still not sure it's a good one.

P.S. Sorry for the rant, it's just it was a bit of frustrating experience trying to figure it all out all by yourself.
P.P.S. Changed username from @aliusmiles => @tbondarchuk, hence broken mentions.

@chenditc
Copy link

@tbondarchuk

As you mentioned, some apps will need a different config for each cluster, just like Nginx ingress. If we have n cluster and m such app, we will need to manege n*m kustomization overlay files. That sounds impossible to me without some scripting.

Do you manage to have a public multi-tenant flux setup? I wanna see where you finally end up with.

@vterdunov
Copy link

vterdunov commented Jan 26, 2023

If you don't use Helm it's became even harder with pre-deploy+deploy dependency. E.g.: for DB migratios. 2x kustomizations.

Some sort of scripting or automation will help. At the moment I'm pinning my hopes on CUE lang.

@tuxillo
Copy link

tuxillo commented Mar 8, 2023

Surely Flux dev team must've used Flux internally on large scale, not only on kind/minikube local clusters? 😀 So would you mind perhaps to share some "battle tested" configs in form of example repos? It's just all the other Flux docs are kind of "guide by the hand" type regarding all the small components but when you try to put all the things together - you are on your own 🤷 And when you try to search for some more complex example - can't find anything usable, at least for my case - multiple EKS clusters. Though I have a feeling that "EKS" part is what's causing me troubles - looks like GKE for example requires much less infrastructure services to install before actual usage (EKS needs about 15 if you want ingress/monitoring/etc) And to be honest I should start with making my own setup public, even though still not sure it's a good one.

Did you eventually make your own setup public? How did you solve it?

Antoher (unrelated to Flux) thing to ask is, where do you guys place the Terraform files, if that's what you use to allocate clusters?

@freimer
Copy link

freimer commented Apr 12, 2023

I have a setup I need to get working with about 400 clusters. Any guidance on that kind of scale with FluxCD and Gitlab?

@mrad-bilel
Copy link

@stefanprodan can you help up please ?

@tbondarchuk
Copy link
Author

tbondarchuk commented May 25, 2023

And I've finally got some time to put together an example of what I use: https://github.com/tbondarchuk/flux-fleet

it's all sanitized from sensitive info but basic setup is the same as a live one.

It's lacking proper documentation I'm afraid, but I'm happy to answer any questions. Will add flux-apps example repo to show them working together, but a bit later.

It's a third or so iteration since I've started with flux, best I can come up with. Bundling related services together in groups, like monitoring is really more convenient for me then just putting all HRs and value files together, with all the services EKS needs to just function properly (not counting extras like cnpg or dex, etc).

It's a setup best suited to small scale, like 3-4 clusters. would be a total mess with 10 or more I guess.

And would be a bit simpler if we could have dependency between Kustomization and HelmRelease, then all those "wrapper" kustomizations would be unnecessary.

Update: and few words on HR updates setup: renovate creates PRs with new version for stage or other lower envs and prod has versions locked by patches in kustomization.yaml. Had too much surprises from flux auto updating chart with minor or patch update which includes breaking changes and waking up to slack channel full of overnight failed update notifications. Syncing updates to prod manually is getting annoying though, but haven't decided on better solution yet.

@tbondarchuk
Copy link
Author

@chenditc you are completely right, and I don't think my setup would really work with hundreds of clusters. Maximum I have at this moment is 4 clusters per one flux-fleet repo.

@tbondarchuk
Copy link
Author

@tuxillo on terraform - personally I just keeping everything in dedicated terraform repo, separating by env/component (dev/eks, prod/acm, etc), using https://github.com/terraform-aws-modules whenever possible and simple copy-pasted resources in simpler cases.

And I run apply manually since it's all small scale anyway. there are all sort of automations for terraform, including https://github.com/weaveworks/tf-controller, but in all my experience with terraform it tend to fail three times: on validate, then plan, then apply, so I would be really hesitant to relay on automatic apply without at least reviewing the plan, especially for complex setups like eks and supporting infra.

@tuxillo
Copy link

tuxillo commented Jun 1, 2023

@tuxillo on terraform - personally I just keeping everything in dedicated terraform repo, separating by env/component (dev/eks, prod/acm, etc), using https://github.com/terraform-aws-modules whenever possible and simple copy-pasted resources in simpler cases.

I'm thinking in having a production repo where I place everything infrastructure related stuff and leave all flux repos outside. I personally don't like the env/component approach but more the service oriented stuff, perhaps separated by providers, I still don't know. If I come up with something usable I'll share it.

@Sturgelose
Copy link

Sturgelose commented Nov 10, 2023

I know this PR is a bit old, but still relevant (I subscribed to it a year ago).

So, I wrote a post in Hackernoon that provides an opinionated structure to scale on clusters in any cloud (as long as you have K8s and flux).
It is the first part of a series, and will tackle multi-tenancy soonish, but I guess you can get some insights and ideas from it: https://hackernoon.com/how-to-structure-your-k8s-gitops-repository-at-scale-part-1

Feedback is welcome! (and sorry for the shame-less self-promotion 😅 )

@kitforbes
Copy link

And I've finally got some time to put together an example of what I use: https://github.com/tbondarchuk/flux-fleet

@tbondarchuk Thanks for sharing, this was great to see how others approach Flux in a "real world" scenario. I've had a similar issue getting my head around how to organise my Flux repository for our needs and I've found the official examples to be a little sparse.

@Aubermean
Copy link

Facing all the same issues you have mentioned, its a shame @stefanprodan hasn't chimed in yet as I see this as the most fundamental reason why many, including big orgs, end up going with Argo instead! First impressions are everything...

@stefanprodan
Copy link
Member

You can find here a reference architecture for using Flux on multi-tenant / multi-cluster:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests