Accumulating new AuthRequest objects can cause a denial of service on storage. #1292

pjmichael · 2018-09-07T03:39:22Z

We are running dex in a kubernetes cluster using kubernetes CRD's as our storage implementation. We recently had came across a scenario in our testing in which a misconfiguration resulted in the storage of a large number of authorization requests.

The implications of this scenario with our implementation was that:

Each authorization request would result in the creation of a new AuthRequest object, hard coded with an expiry of 24 hours (see: https://github.com/dexidp/dex/blob/master/server/handlers.go#L163)
Even though garbage collection would run every 5 minutes, AuthRequest objects continued to accumulate for 24 hours (ie. as they have not yet expired)
Garbage collection lists all AuthRequest objects without pagination or filtering at the start of each execution
Eventually, the list AuthRequests call began to timeout, causing Garbage collection to fail
By the time AuthRequest objects began to expire they we not being cleaned up as Garbage collection could no longer list AuthRequests successfully
The flow on effect was to cause a degradation of the kubernetes apiserver (ie. the storage implementation)

Do you have any thoughts as to how this issue could be avoided?

Some possible ideas I was thinking which may help were:

making AuthRequest expiry configurable to reduce period in which they could accumulate prior to them being garbage collected
perhaps adding rate limiting on the creation of AuthRequest objects

srenatus · 2018-09-10T07:25:35Z

Thanks for reporting this, and including all that information. 👍

making AuthRequest expiry configurable to reduce period in which they could accumulate prior to them being garbage collected

perhaps adding rate limiting on the creation of AuthRequest objects

I suppose that making the "totally arbitrary value" configurable is a great idea, and a low-hanging fruit. OTOH, it's also a bit of a band-aid...

Looking at the GC code, it think this is an issue specific to the k8s CRD usage -- the SQL query is likely much more efficient. Also, I know next to nothing about CRD, so I don't know if there's ways to tweak its garbage collection code in any way...as you mention, pagination could be one thing (no idea if that's possible).

I think some kind of rate limiting would be good, even regardless of this specific bug. (However, I also wonder if this wouldn't often be handled by some reverse proxy sitting in front of Dex.)

For the AuthRequest generation, the references issue above that totally arbitrary value seems related, too: #646. If less AR were generated, this could relieve the pressure, too.

ericchiang · 2018-09-10T15:37:43Z

I think the crux of the issue is "prevent unauthenticated users from creating too much backend state."

Rate limiting would be good, but overall we probably want to tie the auth request to an IP or something. The proposal in #646 is one way, but doesn't stop a motivated attacker from messing with your cluster.

Removing "help wanted" for now. I don't think there's a clear enough solution to warrant the label.

mxey · 2018-12-11T17:29:23Z

I suppose that making the "totally arbitrary value" configurable is a great idea, and a low-hanging fruit. OTOH, it's also a bit of a band-aid...

TBH I would prefer a band-aid over downtime.

srenatus · 2018-12-12T07:42:40Z

@mxey help wanted or not, exposing that as a configurable seems like a welcome PR to me. 😃

Make expiry of auth requests configurable This is a band-aid against #1292 I did not change the default from 24h, but I think it should be much lower for safety.

ericchiang · 2018-12-13T18:17:50Z

I bet there's a way we can get away from storing state until we know a user's authenticated by using a cookie. The naive way would be to serialize the AuthRequest, sign it using dex's signing keys, and shoving it in the user's session:

https://godoc.org/github.com/coreos/dex/storage#AuthRequest

Then only persist the AuthRequest once we get a valid response from the backend (e.g. the LDAP server), once we'd fill in the Claims and ConnectorData.

Overall, I think breaking up the AuthRequest into different classes of data would help forward that effort. Even just adding comments to the struct about what fields are generated by the initial request, filled in once the user's authenticated with a backend, and any other data that's accumulated on the way. It was probably wrong to lump all that data together in the first place.

srenatus · 2018-12-16T11:40:27Z

by using a cookie.

One thing to consider here is that cookies are per browser, not per tab. With the current way of redirecting with a req=.. parameter, you can have to concurrent login flows in one browser, using cookies, this could become more complicated.

As an alternative, we might be able to pass the authreq as a signed JWT instead -- so it wouldn't be ?req=auth_req_id but ?req=auth_req_as_jwt. We wouldn't have to store anything in the database, and we'd keep stuff working "per-tab". Potential caveats are header sizes (redirecting using a url that has the auth req jwt in it) -- but since the auth req contains little more than the connector id, some expiry, header and signature, I don't think that'll become an issue.

mxey · 2018-12-18T09:14:37Z

With the current way of redirecting with a req=.. parameter, you can have to concurrent login flows in one browser, using cookies, this could become more complicated.

I would expect the application you are logging in to with OIDC to be using session cookies anyway.

srenatus · 2018-12-18T09:21:26Z

@mxey indeed -- we do the same. But I'm a little uneasy with imposing this on everyone...

On a related note, I don't think my previous proposal (JWT authreq) makes the need for rate limiting go away: right now, unauthenticated requests cause database entries; with the proposal, they'd cause some calculations (checksums for the JWT signatures). So, it's still worth limiting the impact of what an unauthenticated peer can do... 🤔

forsberg · 2019-06-11T09:07:36Z

We're being hit by this now and then due to Apache Flink Admin UIs we have put behind authentication with oauth2proxy and dex, with a SAML IdP. The Admin UI will auto-refresh, and once all the authentication expires, it will continue to auto-refresh and cause a new authrequest every time it does so, eventually filling up etcd :-(

Just my two cents on ways this bug makes life harder.

gavinmcnair · 2019-08-08T16:11:42Z

The cookies also do not prevent suffering from a deliberate filling of etcd by unauthenticated users. In our case our own black box exporter had created over 500,000 auth entries in less than 24 hours.

…expiry Make expiry of auth requests configurable This is a band-aid against dexidp#1292 I did not change the default from 24h, but I think it should be much lower for safety.

Slides for my failure story related to the default [dex](https://github.com/dexidp/dex/) configuration relating to storing authrequests as CustomResources and its potential for nuking your kubernetes control plane. Ref dexidp/dex#1292 Shared at https://www.meetup.com/Dutch-Kubernetes-Meetup/events/262313920/

* Add dex/CR/bad defaults failure story Slides for my failure story related to the default [dex](https://github.com/dexidp/dex/) configuration relating to storing authrequests as CustomResources and its potential for nuking your kubernetes control plane. Ref dexidp/dex#1292 Shared at https://www.meetup.com/Dutch-Kubernetes-Meetup/events/262313920/ * Move fullstaq story to top of the failure stack

primeroz · 2019-11-28T13:25:35Z

We just got hit by this issue pretty bad using the kubernetes storage backend.

In one of our cluster we ended up with 260K object for authrequest.dev.coreos.com CRD causing loads of problems as you can imagine

IliaGershenzon · 2020-01-08T14:44:59Z

We have reached 500K objects of authrequests and issues started.
Is there an update to the Dex GC code to limit the "list" of authresuests so he can at least clean it slowly?
How did you clean all objects when the GC is dead?

primeroz · 2020-01-08T14:51:14Z

@IliaGershenzon

The only way to clear the object was to act directly on the storage layer, in our case since it was kubernetes we just deleted all the authrequest.dev.coreos.com resources , if you are using a different backend just find those objects and delete them

in our case we had a "misconfiguration" on the healthcheck of some OIDC proxy that was causing every healthcheck to create an authrequest.
reaching 500K objects of authrequests means something is creating login requests and never finishing them ... this must be a misconfiguration somewhere ( or a ddos ? :) )
If you look at the actual content of those authrequests you might see the client / path and understand what is creating them.

Make sure you find the source of your issue or cleaning up will be pointless

IliaGershenzon · 2020-01-08T15:09:17Z

@primeroz
Yeah, found the source of the mess... It was a proxy someone deployed on top of the Kiali (Istio). Dex should add to the GC code "?limit=500" :0
Thanks !

githubrotem · 2020-01-12T16:35:08Z

Is there any real solution for this issue?
Right now anyone can make a curl loop and bring the cluster down

primeroz · 2020-01-13T08:36:47Z

Not a real solution but we just decided to move the storage away from kubernetes to its own backend, this way worst that can happen is the dex service get ddos and not the kubernetes cluster.

pieterlange · 2020-01-13T11:46:37Z

FWIW i took similar precautions, moving storage to a dedicated etcd cluster (of 1 node) to restrain impact of this bug to that etcd cluster instead of the main k8s API. This works out for me so far.

Kyrremann · 2020-01-31T10:03:50Z

Anyone have some tips to locate whats hammering our dex? Tried to shut down the services I thought was the culprit, but to no help.

primeroz · 2020-01-31T10:08:21Z

@Kyrremann if you describe / look into the authrequest you will see informations about the client that created the request , that might help you identify the offender

good luck!

Kyrremann · 2020-01-31T10:13:01Z

I've tried that, but maybe I misunderstand something, or is kubeflow-authservice-oidc the offender?

$ k describe authrequest zzvnxk7ejxydqnsbnvpusrm5k
Name:         zzvnxk7ejxydqnsbnvpusrm5k
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  dex.coreos.com/v1
Claims:
  Email:           
  Email Verified:  false
  User ID:         
  Username:        
Client ID:         kubeflow-authservice-oidc
Expiry:            2020-01-31T16:50:52.200156942Z
Kind:              AuthRequest
Logged In:         false
Metadata:
  Creation Timestamp:  2020-01-30T16:50:52Z
  Generation:          1
  Resource Version:    61443521
  Self Link:           /apis/dex.coreos.com/v1/namespaces/kubeflow/authrequests/zzvnxk7ejxydqnsbnvpusrm5k
  UID:                 11eb53c7-0772-4242-b1eb-145226daa34c
Redirect URI:          https://kubeflow.some.site/login/oidc
Response Types:
  code
Scopes:
  openid
  profile
  email
  groups
State:   BX15XBrv
Events:  <none>

I do have a lot of

2020/01/31 10:13:30 xxx.xxx.xxx.xx /notebook/opendata-dev/opendata-dev/api/sessions?1580465610725 Cookie not set, redirecting to login.

in the logs from authservice.

primeroz · 2020-01-31T10:18:56Z

yeah the redirect URI is sort of the client that initiated the auth request , is the uri where an authenticated request will be redirected to

Kyrremann · 2020-01-31T11:06:23Z

I'm wondering if the issue is because we don't have internal load balancer, so we're redirecting between two ports externally when login (if I remember/understand our network correctly). So it may be the refresh token, or something similar who triggers all these calls.

marcjimz · 2022-05-24T22:20:50Z

Is this still an active issue with the latest Dex code?

aaron-arellano · 2022-08-18T16:52:18Z

we are still seeing this issue with v2.24.0

Dentrax · 2022-12-27T14:12:09Z

Hey, is it still the case? Anyone working on this?

CC'ing @sagikazarmark for awareness.

bella-wmi · 2023-05-08T11:50:00Z

Was there any update or fix for this issue? Or maybe a workaround? Im facing a similar issue.

nabokihms · 2023-08-07T22:40:31Z

The problem can occur only due to the high rate of authentication requests. The best way to solve the issue is to protect the authentication endpoint with the rate-limiting solution, e.g., a reverse proxy with the feature.

There is an example of how you can protect your authentication endpoint with the ingress-nginx in Kubernetes:

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dex
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - dex.example.com
    secretName: ingress-tls
  rules:
  - host: dex.example.com
    http:
      paths:
      - path: /
        pathType: ImplementationSpecific
        backend:
          service:
            name: dex
            port:
              number: 443
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dex-auth
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
    nginx.ingress.kubernetes.io/limit-rpm: "20"
    # Works only for ingress-controllers >=0.40. It is here to not forget to add the annotation after upgrading ingress controller.
    nginx.ingress.kubernetes.io/limit-burst-multiplier: "2"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - dex.example.com
    secretName: ingress-tls
  rules:
  - host: dex.example.com
    http:
      paths:
      - path: /auth
        pathType: ImplementationSpecific
        backend:
          service:
            name: dex
            port:
              number: 443

One more thing: do not forget to set the lifetime for auth requests (expiry.authRequests) to something like 10m. However, will affect the time users can stay on the login page.

yurkoff-mv · 2023-11-29T11:38:56Z

@nabokihms, thanks for the proposed solution. Let me ask a few questions about this.

Why two separate Ingress? I don't really understand the purpose of the first one.
Why port 443 and not 5556?
Do I need to use certificates: create a ClusterIssuer and add the line cert-manager.io/cluster-issuer: cert-issuer to the annotations in Ingress?

nabokihms · 2024-01-26T17:00:16Z

You need to use the port of your service. Two ingresses are required to distinct auth requests from other requests. As for certificates, it is up to you 🙂

srenatus added kind/bug area/security needs investigation and removed needs investigation labels Sep 10, 2018

srenatus added the help wanted label Sep 10, 2018

ericchiang removed the help wanted label Sep 10, 2018

This was referenced Nov 22, 2018

storage/sql: unbounded postgresql client pool #1354

Closed

quickfix #1355: only try three times #1356

Closed

mxey mentioned this issue Dec 13, 2018

Make expiry of auth requests configurable #1372

Merged

srenatus mentioned this issue Jan 23, 2019

Healthcheck via /healthz endpoint is write-intensive operation for storage backend #1386

Closed

pieterlange mentioned this issue Oct 23, 2019

Add dex/CR/bad defaults failure story hjacobs/kubernetes-failure-stories#40

Merged

primeroz mentioned this issue Nov 28, 2019

Why does Flux consume so much memory? answer: > 200k resources in the cluster fluxcd/flux#2633

Closed

Kyrremann mentioned this issue Feb 5, 2020

K8s authrequests filling up kubeflow/kubeflow#4737

Closed

snuggie12 mentioned this issue Feb 2, 2021

Garbage Collection of AuthRequests can prevent users from logging in #1969

Open

This was referenced Nov 28, 2023

Kubeflow 1.9 Enterprise grade Agenda kubeflow/kubeflow#6662

Closed

Network problems with multiple nodes in a cluster kubeflow/manifests#2571

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accumulating new AuthRequest objects can cause a denial of service on storage. #1292

Accumulating new AuthRequest objects can cause a denial of service on storage. #1292

pjmichael commented Sep 7, 2018 •

edited

Loading

srenatus commented Sep 10, 2018

ericchiang commented Sep 10, 2018

mxey commented Dec 11, 2018 •

edited

Loading

srenatus commented Dec 12, 2018

ericchiang commented Dec 13, 2018

srenatus commented Dec 16, 2018

mxey commented Dec 18, 2018

srenatus commented Dec 18, 2018

forsberg commented Jun 11, 2019

gavinmcnair commented Aug 8, 2019

primeroz commented Nov 28, 2019

IliaGershenzon commented Jan 8, 2020

primeroz commented Jan 8, 2020 •

edited

Loading

IliaGershenzon commented Jan 8, 2020

githubrotem commented Jan 12, 2020

primeroz commented Jan 13, 2020

pieterlange commented Jan 13, 2020

Kyrremann commented Jan 31, 2020

primeroz commented Jan 31, 2020

Kyrremann commented Jan 31, 2020 •

edited

Loading

primeroz commented Jan 31, 2020

Kyrremann commented Jan 31, 2020

marcjimz commented May 24, 2022

aaron-arellano commented Aug 18, 2022 •

edited

Loading

Dentrax commented Dec 27, 2022

bella-wmi commented May 8, 2023

nabokihms commented Aug 7, 2023

yurkoff-mv commented Nov 29, 2023

nabokihms commented Jan 26, 2024

Accumulating new AuthRequest objects can cause a denial of service on storage. #1292

Accumulating new AuthRequest objects can cause a denial of service on storage. #1292

Comments

pjmichael commented Sep 7, 2018 • edited Loading

srenatus commented Sep 10, 2018

ericchiang commented Sep 10, 2018

mxey commented Dec 11, 2018 • edited Loading

srenatus commented Dec 12, 2018

ericchiang commented Dec 13, 2018

srenatus commented Dec 16, 2018

mxey commented Dec 18, 2018

srenatus commented Dec 18, 2018

forsberg commented Jun 11, 2019

gavinmcnair commented Aug 8, 2019

primeroz commented Nov 28, 2019

IliaGershenzon commented Jan 8, 2020

primeroz commented Jan 8, 2020 • edited Loading

IliaGershenzon commented Jan 8, 2020

githubrotem commented Jan 12, 2020

primeroz commented Jan 13, 2020

pieterlange commented Jan 13, 2020

Kyrremann commented Jan 31, 2020

primeroz commented Jan 31, 2020

Kyrremann commented Jan 31, 2020 • edited Loading

primeroz commented Jan 31, 2020

Kyrremann commented Jan 31, 2020

marcjimz commented May 24, 2022

aaron-arellano commented Aug 18, 2022 • edited Loading

Dentrax commented Dec 27, 2022

bella-wmi commented May 8, 2023

nabokihms commented Aug 7, 2023

yurkoff-mv commented Nov 29, 2023

nabokihms commented Jan 26, 2024

pjmichael commented Sep 7, 2018 •

edited

Loading

mxey commented Dec 11, 2018 •

edited

Loading

primeroz commented Jan 8, 2020 •

edited

Loading

Kyrremann commented Jan 31, 2020 •

edited

Loading

aaron-arellano commented Aug 18, 2022 •

edited

Loading