Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accumulating new AuthRequest objects can cause a denial of service on storage. #1292

Open
pjmichael opened this issue Sep 7, 2018 · 29 comments

Comments

@pjmichael
Copy link

pjmichael commented Sep 7, 2018

We are running dex in a kubernetes cluster using kubernetes CRD's as our storage implementation. We recently had came across a scenario in our testing in which a misconfiguration resulted in the storage of a large number of authorization requests.

The implications of this scenario with our implementation was that:

  • Each authorization request would result in the creation of a new AuthRequest object, hard coded with an expiry of 24 hours (see: https://github.com/dexidp/dex/blob/master/server/handlers.go#L163)
  • Even though garbage collection would run every 5 minutes, AuthRequest objects continued to accumulate for 24 hours (ie. as they have not yet expired)
  • Garbage collection lists all AuthRequest objects without pagination or filtering at the start of each execution
  • Eventually, the list AuthRequests call began to timeout, causing Garbage collection to fail
  • By the time AuthRequest objects began to expire they we not being cleaned up as Garbage collection could no longer list AuthRequests successfully
  • The flow on effect was to cause a degradation of the kubernetes apiserver (ie. the storage implementation)

Do you have any thoughts as to how this issue could be avoided?

Some possible ideas I was thinking which may help were:

  • making AuthRequest expiry configurable to reduce period in which they could accumulate prior to them being garbage collected
  • perhaps adding rate limiting on the creation of AuthRequest objects
@srenatus
Copy link
Contributor

Thanks for reporting this, and including all that information. 👍

  • making AuthRequest expiry configurable to reduce period in which they could accumulate prior to them being garbage collected
  • perhaps adding rate limiting on the creation of AuthRequest objects

I suppose that making the "totally arbitrary value" configurable is a great idea, and a low-hanging fruit. OTOH, it's also a bit of a band-aid...

Looking at the GC code, it think this is an issue specific to the k8s CRD usage -- the SQL query is likely much more efficient. Also, I know next to nothing about CRD, so I don't know if there's ways to tweak its garbage collection code in any way...as you mention, pagination could be one thing (no idea if that's possible).

I think some kind of rate limiting would be good, even regardless of this specific bug. (However, I also wonder if this wouldn't often be handled by some reverse proxy sitting in front of Dex.)

For the AuthRequest generation, the references issue above that totally arbitrary value seems related, too: #646. If less AR were generated, this could relieve the pressure, too.

@ericchiang
Copy link
Contributor

I think the crux of the issue is "prevent unauthenticated users from creating too much backend state."

Rate limiting would be good, but overall we probably want to tie the auth request to an IP or something. The proposal in #646 is one way, but doesn't stop a motivated attacker from messing with your cluster.

Removing "help wanted" for now. I don't think there's a clear enough solution to warrant the label.

@mxey
Copy link
Contributor

mxey commented Dec 11, 2018

I suppose that making the "totally arbitrary value" configurable is a great idea, and a low-hanging fruit. OTOH, it's also a bit of a band-aid...

TBH I would prefer a band-aid over downtime.

@srenatus
Copy link
Contributor

@mxey help wanted or not, exposing that as a configurable seems like a welcome PR to me. 😃

srenatus added a commit that referenced this issue Dec 13, 2018
Make expiry of auth requests configurable

This is a band-aid against #1292

I did not change the default from 24h, but I think it should be much lower for safety.
@ericchiang
Copy link
Contributor

I bet there's a way we can get away from storing state until we know a user's authenticated by using a cookie. The naive way would be to serialize the AuthRequest, sign it using dex's signing keys, and shoving it in the user's session:

https://godoc.org/github.com/coreos/dex/storage#AuthRequest

Then only persist the AuthRequest once we get a valid response from the backend (e.g. the LDAP server), once we'd fill in the Claims and ConnectorData.

Overall, I think breaking up the AuthRequest into different classes of data would help forward that effort. Even just adding comments to the struct about what fields are generated by the initial request, filled in once the user's authenticated with a backend, and any other data that's accumulated on the way. It was probably wrong to lump all that data together in the first place.

@srenatus
Copy link
Contributor

by using a cookie.

One thing to consider here is that cookies are per browser, not per tab. With the current way of redirecting with a req=.. parameter, you can have to concurrent login flows in one browser, using cookies, this could become more complicated.

As an alternative, we might be able to pass the authreq as a signed JWT instead -- so it wouldn't be ?req=auth_req_id but ?req=auth_req_as_jwt. We wouldn't have to store anything in the database, and we'd keep stuff working "per-tab". Potential caveats are header sizes (redirecting using a url that has the auth req jwt in it) -- but since the auth req contains little more than the connector id, some expiry, header and signature, I don't think that'll become an issue.

@mxey
Copy link
Contributor

mxey commented Dec 18, 2018

With the current way of redirecting with a req=.. parameter, you can have to concurrent login flows in one browser, using cookies, this could become more complicated.

I would expect the application you are logging in to with OIDC to be using session cookies anyway.

@srenatus
Copy link
Contributor

@mxey indeed -- we do the same. But I'm a little uneasy with imposing this on everyone...

On a related note, I don't think my previous proposal (JWT authreq) makes the need for rate limiting go away: right now, unauthenticated requests cause database entries; with the proposal, they'd cause some calculations (checksums for the JWT signatures). So, it's still worth limiting the impact of what an unauthenticated peer can do... 🤔

@forsberg
Copy link

We're being hit by this now and then due to Apache Flink Admin UIs we have put behind authentication with oauth2proxy and dex, with a SAML IdP. The Admin UI will auto-refresh, and once all the authentication expires, it will continue to auto-refresh and cause a new authrequest every time it does so, eventually filling up etcd :-(

Just my two cents on ways this bug makes life harder.

@gavinmcnair
Copy link

The cookies also do not prevent suffering from a deliberate filling of etcd by unauthenticated users. In our case our own black box exporter had created over 500,000 auth entries in less than 24 hours.

mmrath pushed a commit to mmrath/dex that referenced this issue Sep 2, 2019
…expiry

Make expiry of auth requests configurable

This is a band-aid against dexidp#1292

I did not change the default from 24h, but I think it should be much lower for safety.
pieterlange added a commit to pieterlange/kubernetes-failure-stories that referenced this issue Oct 23, 2019
Slides for my failure story related to the default [dex](https://github.com/dexidp/dex/) configuration relating to storing authrequests as CustomResources and its potential for nuking your kubernetes control plane.

Ref dexidp/dex#1292
Shared at https://www.meetup.com/Dutch-Kubernetes-Meetup/events/262313920/
hjacobs pushed a commit to hjacobs/kubernetes-failure-stories that referenced this issue Oct 24, 2019
* Add dex/CR/bad defaults failure story

Slides for my failure story related to the default [dex](https://github.com/dexidp/dex/) configuration relating to storing authrequests as CustomResources and its potential for nuking your kubernetes control plane.

Ref dexidp/dex#1292
Shared at https://www.meetup.com/Dutch-Kubernetes-Meetup/events/262313920/

* Move fullstaq story to top of the failure stack
@primeroz
Copy link

We just got hit by this issue pretty bad using the kubernetes storage backend.

In one of our cluster we ended up with 260K object for authrequest.dev.coreos.com CRD causing loads of problems as you can imagine

@IliaGershenzon
Copy link

We have reached 500K objects of authrequests and issues started.
Is there an update to the Dex GC code to limit the "list" of authresuests so he can at least clean it slowly?
How did you clean all objects when the GC is dead?

@primeroz
Copy link

primeroz commented Jan 8, 2020

@IliaGershenzon

The only way to clear the object was to act directly on the storage layer, in our case since it was kubernetes we just deleted all the authrequest.dev.coreos.com resources , if you are using a different backend just find those objects and delete them

in our case we had a "misconfiguration" on the healthcheck of some OIDC proxy that was causing every healthcheck to create an authrequest.
reaching 500K objects of authrequests means something is creating login requests and never finishing them ... this must be a misconfiguration somewhere ( or a ddos ? :) )
If you look at the actual content of those authrequests you might see the client / path and understand what is creating them.

Make sure you find the source of your issue or cleaning up will be pointless

@IliaGershenzon
Copy link

@primeroz
Yeah, found the source of the mess... It was a proxy someone deployed on top of the Kiali (Istio). Dex should add to the GC code "?limit=500" :0
Thanks !

@githubrotem
Copy link

Is there any real solution for this issue?
Right now anyone can make a curl loop and bring the cluster down

@primeroz
Copy link

Not a real solution but we just decided to move the storage away from kubernetes to its own backend, this way worst that can happen is the dex service get ddos and not the kubernetes cluster.

@pieterlange
Copy link

FWIW i took similar precautions, moving storage to a dedicated etcd cluster (of 1 node) to restrain impact of this bug to that etcd cluster instead of the main k8s API. This works out for me so far.

@Kyrremann
Copy link

Anyone have some tips to locate whats hammering our dex? Tried to shut down the services I thought was the culprit, but to no help.

@primeroz
Copy link

@Kyrremann if you describe / look into the authrequest you will see informations about the client that created the request , that might help you identify the offender

good luck!

@Kyrremann
Copy link

Kyrremann commented Jan 31, 2020

I've tried that, but maybe I misunderstand something, or is kubeflow-authservice-oidc the offender?

$ k describe authrequest zzvnxk7ejxydqnsbnvpusrm5k
Name:         zzvnxk7ejxydqnsbnvpusrm5k
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  dex.coreos.com/v1
Claims:
  Email:           
  Email Verified:  false
  User ID:         
  Username:        
Client ID:         kubeflow-authservice-oidc
Expiry:            2020-01-31T16:50:52.200156942Z
Kind:              AuthRequest
Logged In:         false
Metadata:
  Creation Timestamp:  2020-01-30T16:50:52Z
  Generation:          1
  Resource Version:    61443521
  Self Link:           /apis/dex.coreos.com/v1/namespaces/kubeflow/authrequests/zzvnxk7ejxydqnsbnvpusrm5k
  UID:                 11eb53c7-0772-4242-b1eb-145226daa34c
Redirect URI:          https://kubeflow.some.site/login/oidc
Response Types:
  code
Scopes:
  openid
  profile
  email
  groups
State:   BX15XBrv
Events:  <none>

I do have a lot of

2020/01/31 10:13:30 xxx.xxx.xxx.xx /notebook/opendata-dev/opendata-dev/api/sessions?1580465610725 Cookie not set, redirecting to login.

in the logs from authservice.

@primeroz
Copy link

yeah the redirect URI is sort of the client that initiated the auth request , is the uri where an authenticated request will be redirected to

@Kyrremann
Copy link

I'm wondering if the issue is because we don't have internal load balancer, so we're redirecting between two ports externally when login (if I remember/understand our network correctly). So it may be the refresh token, or something similar who triggers all these calls.

@marcjimz
Copy link

Is this still an active issue with the latest Dex code?

@aaron-arellano
Copy link

aaron-arellano commented Aug 18, 2022

we are still seeing this issue with v2.24.0

@Dentrax
Copy link

Dentrax commented Dec 27, 2022

Hey, is it still the case? Anyone working on this?

CC'ing @sagikazarmark for awareness.

@bella-wmi
Copy link

Was there any update or fix for this issue? Or maybe a workaround? Im facing a similar issue.

@nabokihms
Copy link
Member

The problem can occur only due to the high rate of authentication requests. The best way to solve the issue is to protect the authentication endpoint with the rate-limiting solution, e.g., a reverse proxy with the feature.

There is an example of how you can protect your authentication endpoint with the ingress-nginx in Kubernetes:

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dex
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - dex.example.com
    secretName: ingress-tls
  rules:
  - host: dex.example.com
    http:
      paths:
      - path: /
        pathType: ImplementationSpecific
        backend:
          service:
            name: dex
            port:
              number: 443
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dex-auth
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
    nginx.ingress.kubernetes.io/limit-rpm: "20"
    # Works only for ingress-controllers >=0.40. It is here to not forget to add the annotation after upgrading ingress controller.
    nginx.ingress.kubernetes.io/limit-burst-multiplier: "2"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - dex.example.com
    secretName: ingress-tls
  rules:
  - host: dex.example.com
    http:
      paths:
      - path: /auth
        pathType: ImplementationSpecific
        backend:
          service:
            name: dex
            port:
              number: 443

One more thing: do not forget to set the lifetime for auth requests (expiry.authRequests) to something like 10m. However, will affect the time users can stay on the login page.

@yurkoff-mv
Copy link

@nabokihms, thanks for the proposed solution. Let me ask a few questions about this.

  1. Why two separate Ingress? I don't really understand the purpose of the first one.
  2. Why port 443 and not 5556?
  3. Do I need to use certificates: create a ClusterIssuer and add the line cert-manager.io/cluster-issuer: cert-issuer to the annotations in Ingress?

@nabokihms
Copy link
Member

You need to use the port of your service. Two ingresses are required to distinct auth requests from other requests. As for certificates, it is up to you 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests