Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling selenium grid on kubernetes #1714

Conversation

prashanth-volvocars
Copy link
Contributor

Description

Autoscale selenium browser nodes running in kubernetes based on the request pending in session queue using KEDA. Toggle autoscaling on/off using 'autoscalingEnabled' option in helm charts.

Motivation and Context

Auto scaling selenium grid was a problem that was pending to be solved for long. So i took it up when there was a requirement at current my work place. KEDA seemed to the best candidate for the Job and i wrote a new scalar for Selenium Grid a year ago. I would like to have this enabled by default in our charts so everyone could use it.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@prashanth-volvocars
Copy link
Contributor Author

@diemol @jamesmortensen As discussed earlier, I have raised two separate PR's for autoscaling and video recoriding.

Autoscale selenium browser nodes running in kubernetes
based on the request pending in session queue using KEDA.
Toggle autoscaling on/off using 'autoscalingEnabled' option
in helm charts.
@jamesmortensen
Copy link
Member

Hi @prashanth-volvocars thank you for splitting them up. We'll take a look as soon as we can.

Copy link
Member

@diemol diemol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, @prashanth-volvocars!

I left some comments.

@@ -29,6 +29,11 @@ helm install selenium-grid docker-selenium/selenium-grid --version <version>
helm install selenium-grid --set ingress.hostname=selenium-grid.k8s.local docker-selenium/chart/selenium-grid/.
```

## Enable Selenium Grid Autoscaling
Selenium Grid has the ability to autoscale browser nodes up/down based on the requests pending in session queue. You can enable it setting 'autoscalingEnabled' to `true`. You need to install KEDA by following the [instructions](https://keda.sh/docs/2.8/deploy/#helm) in order for autoscaling to work.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Selenium Grid has the ability to autoscale browser nodes up/down based on the requests pending in session queue. You can enable it setting 'autoscalingEnabled' to `true`. You need to install KEDA by following the [instructions](https://keda.sh/docs/2.8/deploy/#helm) in order for autoscaling to work.
Selenium Grid has the ability to autoscale browser nodes up/down based on the pending requests in the
session queue.
You can enable it by setting 'autoscalingEnabled' to `true`. You need to install KEDA by following the
[instructions](https://keda.sh/docs/2.8/deploy/#helm) in order for autoscaling to work.

## Enable Selenium Grid Autoscaling
Selenium Grid has the ability to autoscale browser nodes up/down based on the requests pending in session queue. You can enable it setting 'autoscalingEnabled' to `true`. You need to install KEDA by following the [instructions](https://keda.sh/docs/2.8/deploy/#helm) in order for autoscaling to work.

The hpa.url value is configured to work for grid installed in `default` namespace. If you are installing the grid in some other namespace make sure to update the value of hpa.url accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The hpa.url value is configured to work for grid installed in `default` namespace. If you are installing the grid in some other namespace make sure to update the value of hpa.url accordingly.
The `hpa.url` value is configured to work for Grid when installed in the `default` namespace. If you are installing
the Grid in some other namespace make sure to update the value of `hpa.url` accordingly.

@@ -363,7 +364,7 @@ chromeNode:
# Custom annotations for service
annotations: {}
# Size limit for DSH volume mounted in container (if not set, default is "1Gi")
dshmVolumeSizeLimit: 1Gi
dshmVolumeSizeLimit: 2Gi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This would affect people who are already using the chart, their clusters have been provisioned with the existing values.

Copy link
Contributor Author

@prashanth-volvocars prashanth-volvocars Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some issues while running it along with video recording which vanished as soon as we increased the limit. Think it shouldn’t be part of this PR. Will revert this.

@@ -375,17 +376,17 @@ chromeNode:
# failureThreshold: 120
# periodSeconds: 5
# Time to wait for pod termination
terminationGracePeriodSeconds: 30
terminationGracePeriodSeconds: 3600
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the autoscaler chooses to kill a pod that's running a test currently, we would have at most one hour for the running test to complete rather than 30 seconds.
If the test doesn't complete within the configured grace period it would result in failure so giving it the maxium allowed time would help reduce these failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this affect the setup of users who are already having this configuration? Isn't there an alternative to this hard coded value? One of the reasons we added the draining after X sessions feature was to enable this use case, so the pod could live until the session is done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on what @diemol writes:

Wouldn't it be more straight forward to use ScaledJob combined with draining after one session?
I wrote about it here:

SeleniumHQ/selenium#9845 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diemol Killing after x sessions wouldn’t help. Consider this scenario where you create two sessions, one runs for 60 mins and another 1 min. As soon as the first is done, the hpa will try to bring down one of the pod irrespective of the configured x sessions. We have no control on which pod is chosen by hpa. This is not current flow but will happen when KEDA scales down.

Also I don’t understand the implication that would be caused by increasing it even for current users. No one would want their tests to be interrupted. Also with this value has no effect unless we have prestop lifecycle hook enabled. Even if that is enabled its gonna delay the pod termination only until the tes session is completed, which i think be desired behaviour

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prashanth-volvocars draining after X sessions will make the container exit and ideally the pod should stop, am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that for most users ScaledJobs would be the best fit so this should be the default in the helm chart. Creating a pod is normally not expensive in Kubernetes; Fargate is a special case. But sure, you can skip that and I guess it's me or @diemol that will implement it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diemol yes, we no longer need X sessions as the pod can handle sessions as long as there are pending sessions in the queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msvticket The default should be what works for every environment, not for a specific few. Jobs works for a few, not all, but pod works for all. If you think a scaled job suits your need best, it's available as an option that you can very well enable, but making it default will entirely break it for specific environments. Hope you understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What option? I don't see any mention of it in the chart.

Comment on lines +381 to +389
lifecycle:
preStop:
exec:
command:
- bash
- -c
- |
curl -X POST 127.0.0.1:5555/se/grid/node/drain --header 'X-REGISTRATION-SECRET;' && \
while curl 127.0.0.1:5555/status; do sleep 1; done;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this preStop be enabled only if autoscalingEnabled is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will add the check

autoscalingEnabled: false
maxReplicaCount: 8
hpa:
url: http://selenium-hub.default:4444/graphql # Repalce your http graphql url here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selenium-hub.default can this be retrieved from the Hub or Router service or pod name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought too, but created the PR in the excitement of sharing it😁. Will add this too.

@@ -475,7 +485,7 @@ firefoxNode:
# Custom annotations for service
annotations: {}
# Size limit for DSH volume mounted in container (if not set, default is "1Gi")
dshmVolumeSizeLimit: 1Gi
dshmVolumeSizeLimit: 2Gi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

url: http://selenium-hub.default:4444/graphql # Repalce your http graphql url here
browserName: chrome
# browserVersion: '91.0' # Optional. Only required when supporting multiple versions of browser in your Selenium Grid.
unsafeSsl : 'true' # Optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?unsafeSsl?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diemol https://keda.sh/docs/2.8/scalers/selenium-grid-scaler/

"Skip certificate validation when connecting over HTTPS."

@@ -487,17 +497,17 @@ firefoxNode:
# failureThreshold: 120
# periodSeconds: 5
# Time to wait for pod termination
terminationGracePeriodSeconds: 30
terminationGracePeriodSeconds: 3600
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Comment on lines +502 to +510
lifecycle:
preStop:
exec:
command:
- bash
- -c
- |
curl -X POST 127.0.0.1:5555/se/grid/node/drain --header 'X-REGISTRATION-SECRET;' && \
while curl 127.0.0.1:5555/status; do sleep 1; done;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

maxReplicaCount: 8
hpa:
url: http://selenium-hub.default:4444/graphql # Repalce your http graphql url here
browserName: MicrosoftEdge
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you was Edge to work you will need to have "sessionBrowserName" as well set to "msedge"

Docs: https://keda.sh/docs/2.8/scalers/selenium-grid-scaler/
Issue regarding this: kedacore/keda#2709

@qalinn
Copy link

qalinn commented Jan 30, 2023

@prashanth-volvocars Any news regarding this PR?

@QATew
Copy link

QATew commented May 1, 2023

@prashanth-volvocars @diemol Any updates on this?

@prashanth-volvocars
Copy link
Contributor Author

@diemol i have added/addressed all the concerns raised. Please take a look

@diemol
Copy link
Member

diemol commented May 22, 2023

I will have a look after releasing 4.10.0, so in June.

@jcputney
Copy link

@diemol any updates on this and #1854 ? It seems to add a lot of needed features to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants