Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Selenium 4 grid on kubernetes #9845

Closed
gazal-k opened this issue Sep 19, 2021 · 95 comments
Closed

Dynamic Selenium 4 grid on kubernetes #9845

gazal-k opened this issue Sep 19, 2021 · 95 comments
Labels
C-grid C-java help wanted Issues looking for contributions I-enhancement

Comments

@gazal-k
Copy link

gazal-k commented Sep 19, 2021

🚀 Feature Proposal

Just like dynamic Selenium 4 grid using docker, having a similar k8s "pod factory" (or something similar) would be nice.

https://github.com/zalando/zalenium does that. Perhaps some of that can be ported to grid 4

@ghost ghost added the needs-triaging label Sep 19, 2021
@diemol diemol added C-grid help wanted Issues looking for contributions I-enhancement and removed needs-triaging labels Sep 21, 2021
@diemol
Copy link
Member

diemol commented Sep 21, 2021

We are happy to discuss approaches, what do you have in mind, @gazal-k?

@gazal-k
Copy link
Author

gazal-k commented Sep 22, 2021

Sorry, I'm not really familiar with the selenium grid codebase. I imagine this: https://github.com/SeleniumHQ/selenium/blob/trunk/java/src/org/openqa/selenium/grid/node/docker/DockerSessionFactory.java has some of the logic to dynamically create browser nodes to join the grid. It would be nice to have something similar to create k8s Pods so that the kubernetes selenium 4 grid scales based on the test as opposed to creating a static number of browser nodes.

Again, sorry that I don't have something more solid to contribute.

@sahajamit
Copy link

I have attempted to build something similar for Kubernetes with Selenium Grid3.
More details here: https://link.medium.com/QQMCXLqQMjb

@pearj
Copy link

pearj commented Sep 23, 2021

I have some thoughts about how the Kubernetes support could be implemented. I remember having a look at the Grid 4 codebase in December 2018 and I wrote up my thoughts in this ticket over in Zalenium when someone asked if we planned to support Grid 4: zalando/zalenium#1028 (comment)
This was largely based on my ideas on how to add High-Availability support Zalenium for Kubernetes: zalando/zalenium#484 (comment) from early 2018.

So assuming the grid architecture is still the same as it was in 2018, ie router, sessionMap and distributor. Then I think my original ideas are still valid.

The crux of it was to implement the sessionMap as annotations (metadata) on a Kubernetes pod, so that Selenium Grid didn't need to maintain the session state, which means that you could scale it and make it highly available much easier.

So it means you could run multiple copies of the router, and you probably just want one distributor as you'd get into race conditions when creating new selenium pods. The sessionMap would end up just being a shared module/library that the router and distributor used to talk to the Kubernetes API server.

@LukeIGS
Copy link

LukeIGS commented Sep 28, 2021

If we wanted a more pure k8s solution, if there were metrics exposed around how many selenium sessions are in queue, or even how long they've been waiting, maybe even rate of queue processing, it would be possible to configure a horizontal pod autoscaler (HPA) around the node deployment itself to target a given rate of message processing.

@Warxcell
Copy link

Warxcell commented Dec 2, 2021

There is https://keda.sh/docs/2.4/scalers/selenium-grid-scaler/ which can autoscale nodes, it's working fine - the problem is with tearing down a node. Since it doesn't keep track of which node is working - it could kill test in progress, and it seems Chrome Node doesn't handle it gracefull.

@MissakaI
Copy link

I tried another approach by implementing an application which intercepts the docker-engine calls from the selenium node-docker component and then translates those calls to k8s calls and then call the Kubernetes API. It works properly on creating and stopping browser nodes depending on the calls from node-docker. But this has a major problem because node-docker doesn't support concurrency. It can only create single browser-node, run test, destroy it and then next. (I will be creating a separate issue for that involving the docker-node as for the concurrency issue).

From what i noticed is the node-docker binds those browser nodes to itself and expose it as an session of the node-docker to the distributor. So all that the distributor sees is the node-docker and not the browser node. I think this approach is not appropriate during concurrent execution as i feel it is a point of failure and end all the sessions routed through the node-docker.

Therefore I think KEDA Selenium-Grid-AutoScaler is a much better approach.

@MissakaI
Copy link

The crux of it was to implement the sessionMap as annotations (metadata) on a Kubernetes pod, so that Selenium Grid didn't need to maintain the session state, which means that you could scale it and make it highly available much easier.

There is slight issue with this as this will make the Grid 4 dependent on Kubernetes. This will make two different implementations of Grid which is specific to K8s and one that is not dependent on Kubernetes. I think much better approach is to make the Grid HA with other approaches like sharing the current state with all the instances of particular grid component type.

@quarckster
Copy link

quarckster commented Jan 16, 2022

There is slight issue with this as this will make the Grid 4 dependent on Kubernetes.

It's already dependent on Docker. Perhaps there should be some middleware for different environments.

@qalinn
Copy link

qalinn commented Jan 17, 2022

@MissakaI I have tested KEDA Selenium-Grid-AutoScaler and is scaling up how many nodes you need based on the queue session and is ok. The problem is with video part because doesn't work in kubernetes. I have managed to deploy video container on the same pod but the video file is not saved till the video container is not stop gracefully and also you cannot set the name of the video for every test, is recording all the time till will be closed.

@LukeIGS
Copy link

LukeIGS commented Jan 18, 2022

There is slight issue with this as this will make the Grid 4 dependent on Kubernetes.

It's already dependent on Docker. Perhaps there should be some middleware for different environments.

The selenium repository is currently dependent on Ruby, Python, dotnet, and quite a few other things that it probably shouldn't be, there's certainly an argument for a lot of stuff to be split out into separate modules, but that's probably a conversation for another issue.

@tomkerkhove
Copy link

We had a note in the standup meeting of KEDA to see if we can help with Selenium & video.
Is the person who added it part of this thread? If so, please open a discussion how we can help: https://github.com/kedacore/keda/discussions/new

@LukeIGS
Copy link

LukeIGS commented Jan 18, 2022

Will do, issue in question is
#10018

These two are pretty intertwined.

@qalinn
Copy link

qalinn commented Jan 19, 2022

@tomkerkhove I am that who added the note on your note standup meeting. Please see also the next issue: #10018

@tomkerkhove
Copy link

Tracking this in kedacore/keda#2494

@msvticket
Copy link

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

To make a node exit after a session is done you need to add a property to to the node section of config.toml:

implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

With the official docker images this isn't enough since supervidord would still be running. So for that case you would need to add a supervisord event listener that finishes supervisord with its subprocesses.

One good thing with this approach is that combined with the video feature you get one video per session. Regarding graceful shutdown: In the dynamic grid code any video container is stopped before the node/browser container. So I guess the video file gets corrupted if Xvfb exits before ffmpeg is done saving the file. The event listener described above should therefore shutdown the supervisord in the video container before shutting down the one in the same container.

For shutting down supervisord, you can use the unix_http_server and supervisorctl features of supervisord. That works between containers in the pod as well.

I've also been thinking about how to have the video file uploaded to s3 (or similar) automatically. The tricky part is supplying the pod with the url to upload the file to. I have some ideas, but that have to wait until the basic solution is implemented.

@MissakaI
Copy link

I have managed to deploy video container on the same pod but the video file is not saved till the video container is not stop gracefully and also you cannot set the name of the video for every test, is recording all the time till will be closed.

I think this case should be followed with the thread dedicated to it. Which is mentioned by @LukeIGS

Will do, issue in question is
#10018

@MissakaI
Copy link

Also we need a way to implement liveliness and readiness probes because i ran into few instances that the selenium process was killed and pod continues to run which results in no new pod is reinstated by Kubernetes after terminating the currently crashed pod.

@MissakaI
Copy link

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

Thank you for this. I have been using deployments and thought of raising a issue KEDA to add the annotation controller.kubernetes.io/pod-deletion-cost: -999 which sets the replication controller to delete the pod with least cost and leave the others.

To make a node exit after a session is done you need to add a property to to the node section of config.toml:

implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

Also can you point me to where this was included in the Selenium Documentation if it was documented.

@msvticket
Copy link

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

Thank you for this. I have been using deployments and thought of raising a issue KEDA to add the annotation controller.kubernetes.io/pod-deletion-cost: -999 which sets the replication controller to delete the pod with least cost and leave the others.

I don't see how that would help. You could put that cost in the manifest to begin with. But in any case you end up with having to remove/update that annotation when the test is done and KEDA don't know when that is.

There is a recent proposal for Kubernetes to let the pod inform Kubernetes on which pods to delete through a probe: kubernetes/kubernetes#107598. Until something like that is implemented either the node itself or maybe the distributor would need to update the annotation.

To make a node exit after a session is done you need to add a property to to the node section of config.toml:
implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

Also can you point me to where this was included in the Selenium Documentation if it was documented.

I haven't found anything about it in the documentation. I stumbled upon org.openqa.selenium.grid.node.k8s.OneShotNode when I was looking in the selenium code. It then took a while for me to find out how to make use of the class. That's implemented here:

return config.getClass(NODE_SECTION, "implementation", Node.class, DEFAULT_IMPL);

On the other hand I haven't tested it, so who knows if OneShotNode still works...

This is where it should be documented: https://www.selenium.dev/documentation/grid/configuration/toml_options/

@MissakaI
Copy link

I don't see how that would help. You could put that cost in the manifest to begin with. But in any case you end up with having to remove/update that annotation when the test is done and KEDA don't know when that is.

I was intending to either write an application that will monitor the test sessions along with the respective pod or write a custom KEDA scaler that will do what i mentioned previously.

@msvticket
Copy link

There is an issue about shutting down the node container when the node server has exited: SeleniumHQ/docker-selenium#1435

@MissakaI
Copy link

MissakaI commented Jan 21, 2022

On the other hand I haven't tested it, so who knows if OneShotNode still works...

It seems like even though the code is available in the repo it causes ClassNotFoundException after adding it to the config.toml. Extracting the selenium-server-4.1.1.jar revealed that the k8s folder is completely removed.

java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.openqa.selenium.grid.Bootstrap.runMain(Bootstrap.java:77)
        at org.openqa.selenium.grid.Bootstrap.main(Bootstrap.java:70)
Caused by: org.openqa.selenium.grid.config.ConfigException: java.lang.ClassNotFoundException: org.openqa.selenium.grid.node.k8s.OneShotNode
        at org.openqa.selenium.grid.config.MemoizedConfig.getClass(MemoizedConfig.java:115)
        at org.openqa.selenium.grid.node.config.NodeOptions.getNode(NodeOptions.java:148)
        at org.openqa.selenium.grid.node.httpd.NodeServer.createHandlers(NodeServer.java:127)
        at org.openqa.selenium.grid.node.httpd.NodeServer.asServer(NodeServer.java:183)
        at org.openqa.selenium.grid.node.httpd.NodeServer.execute(NodeServer.java:230)
        at org.openqa.selenium.grid.TemplateGridCommand.lambda$configure$4(TemplateGridCommand.java:129)
        at org.openqa.selenium.grid.Main.launch(Main.java:83)
        at org.openqa.selenium.grid.Main.go(Main.java:57)
        at org.openqa.selenium.grid.Main.main(Main.java:42)
        ... 6 more
Caused by: java.lang.ClassNotFoundException: org.openqa.selenium.grid.node.k8s.OneShotNode
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:398)
        at org.openqa.selenium.grid.config.ClassCreation.callCreateMethod(ClassCreation.java:35)
        at org.openqa.selenium.grid.config.MemoizedConfig.lambda$getClass$4(MemoizedConfig.java:100)
        at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1737)
        at org.openqa.selenium.grid.config.MemoizedConfig.getClass(MemoizedConfig.java:95)
        ... 14 more

The docker image that i used was selenium/node-firefox.

@msvticket
Copy link

msvticket commented Jan 21, 2022

Well, the selenium project is a bit confusing. Apparently the selenium build system excludes the package org.openqa.selenium.grid.node.k8s from selenium-server.jar. Here I found bazel build configurations for building docker images:
https://github.com/SeleniumHQ/selenium/tree/trunk/deploys/docker

The firefox_node and chrome_node images are there declared to include a layer (called one-shot) that includes a library with that class. But these images and the library doesn't seem to be published publicly anywhere.

In https://github.com/SeleniumHQ/selenium/tree/trunk/deploys/k8s you can see how that library is utilized:

- name: firefox-node
image: "selenium/firefox-node:latest"
imagePullPolicy: Always
command:
- /opt/selenium/bin/selenium.sh
- --ext
- /opt/selenium/libk8s.jar
- node
- -p
- "5555"
- --max-sessions
- "1"
- --stereotype
- '{"browserName":"firefox"}'
ports:
- containerPort: 5555
- containerPort: 5900 # VNC
env:
- name: EVENTS_BIND
value: "false"
- name: EVENTS_PUBLISH
value: "tcp://event-bus:4442"
- name: EVENTS_SUBSCRIBE
value: "tcp://event-bus:4443"
- name: NODE_IMPLEMENTATION
value: "org.openqa.selenium.grid.node.k8s.OneShotNode"

It seems like the idea is that you checkout the code to build and deploy these images and k8s manifest to your local infrastructure.

@diemol
Copy link
Member

diemol commented Jan 21, 2022

Thank you all for sharing your thoughts and offering paths to move forward. I will reply to the comments below.

elgatov pushed a commit to elgatov/selenium that referenced this issue Jun 27, 2022
This will help to enable a Dynamic Grid in Kubernetes, as
one can create a container with a single session which
will shut down on its own after the session is completed.

Helps with SeleniumHQ#9845 and SeleniumHQ/docker-selenium#1514
elgatov pushed a commit to elgatov/selenium that referenced this issue Jun 27, 2022
@prashanth-volvocars
Copy link

prashanth-volvocars commented Aug 5, 2022

Hi all,

Is anyone still facing issues related to KEDA implementation? I was the one who originally added the scaler to KEDA. I wasn't following much about it due to my other assignments. We have now started the setup of grid in EKS on Fargate and it seems to work fine for us. I am yet to work on retrieving the browser console logs and network logs. Any help on that would be greatly appreciated.

@prashanth-volvocars
Copy link

With regards to video recording the problem we face is it recorded a single video for the whole lifetime of the pod so if there are multiple sessions handled by the same pod then there is just one video for all the sessions handled by that pod. Also even if the pod handles just a single session, the video keeps recording until the pod is killed which is like 300 seconds by default in HPA. So even for a test that runs just for few seconds we get a video that's 5 mins or longer. Is there a way to control this behaviour?

@msvticket
Copy link

My idea of solving that (which I haven't tested yet) is to use the scaling jobs feature in KEDA to run selenium nodes. These nodes should then be configured with DRAIN_AFTER_SESSION_COUNT=1. After the session has finished the selenium container will then finish. The remaining problem is to make the the video container exit. This could be solved by harnessing features of supervisord:

If the supervisord of the video container has enabled unix_http_server then the supervisord of the selenium container could use supervisorctl to stop the video container. in a similar way as here: SeleniumHQ/docker-selenium@281e5c4

A somewhat tricky part would be how to make that supervisorctl when there actually is a video container to stop.

@NickWemekamp
Copy link

NickWemekamp commented Aug 17, 2022

An alternative would be to have the selenium node container kill the pod that it belongs to via the kube API server in the pre-stop hook (stackoverflow delete pod). I have not tested this. The pre-stop hook of the video container can then upload the video to a remote storage. The problem is then that the video container does not know the session identifier of the last test run of the selenium node container, which would be a practical filename in the remote storage.

@prashanth-volvocars
Copy link

I solved it by adding ffmpeg directly into browser node docker and record video directly for every session. It works great. I will soon share the whole setup.

@qalinn
Copy link

qalinn commented Aug 19, 2022

@prashanth-volvocars Hi! Great to hear this. Please don't forget to share with us the setup.
Thank you!

@josesimon
Copy link

@prashanth-volvocars we would love to receive your feedback :)

@prashanth-volvocars
Copy link

My setup is more oriented towards AWS but it works great for us until now. I need some help with sharing it. I have made some changes to the NodeBase and added new script to upload the videos and logs directly to S3. So would it be ok to have this part of this repo or should i just share it another separate repo since its more oriented towards AWS.

@gazal-k
Copy link
Author

gazal-k commented Sep 21, 2022

My setup is more oriented towards AWS but it works great for us until now. I need some help with sharing it. I have made some changes to the NodeBase and added new script to upload the videos and logs directly to S3. So would it be ok to have this part of this repo or should i just share it another separate repo since its more oriented towards AWS.

I think using S3 as opposed to block storage was an excellent choice. Perhaps parts of that logic can be made generic using something like https://github.com/google/go-cloud in the future. But for a lot of us who would want to setup a selenium 4 grid on AWS, I think ur contribution would be excellent. Perhaps it can be turned on based on some env params?

@prashanth-volvocars
Copy link

Hey all,

Apologies for the delay in sharing it. I was unsure of how to do it. But just taking a first step now. https://github.com/prashanth-volvocars/docker-selenium/tree/auto-scaling/charts/selenium-grid

Remember you need to install keda before installing the chart. The chart is configured to work with default namespace. If you are installing it another namespace make sure to update the hpa.url. Any questions please direct to me here or slack.

@prashanth-volvocars
Copy link

prashanth-volvocars commented Sep 21, 2022

You can grab all information about the setup here

On a nutshell, It can

  1. Auto Scale browser nodes up and down.
  2. Record videos and store them named under session id
  3. Capture logs and store them named under session id
  4. Upload captured logs and videos to S3

@krmahadevan
Copy link
Contributor

@diemol - Do you think that this was one of the use-cases for building the reference implementation of org.openqa.selenium.grid.node.k8s.OneShotNode ? Maybe we could consider exposing this as maven artifact so that we can perhaps pass in a reference to this implementation via --node-implementation ?

@diemol
Copy link
Member

diemol commented Nov 13, 2022

@krmahadevan there is a solution in a PR in the docker-selenium project, have you checked them? @prashanth-volvocars was kind enough to submit them.

@krmahadevan
Copy link
Contributor

@diemol - No I wasn't aware of the PR. I went back and checked SeleniumHQ/docker-selenium#1714

Even though I dont understand a lot of the k8s lingo yet, I kind of got the idea of what it is doing and looks like that should suffice for the k8s requirement of an autoscaling grid.

@diemol
Copy link
Member

diemol commented Nov 14, 2022

Yes, we we merge that, we can close this issue.

@msvticket
Copy link

I have made the new PR SeleniumHQ/docker-selenium#1854 (based on SeleniumHQ/docker-selenium#1714). It has a few more features, including automatic installation of KEDA and autoscaling with jobs.

I have also supplied a helm repo where you can get the chart to test it this before the PR is merged.

@aaron070596
Copy link

Hello Team , i would like to know if this implementation will include a solution for Video Recording feature enabled on a k8s distributed implementation and if there is any ETA on which we would be able to use these new components.

@subin-krishna-test
Copy link

You can grab all information about the setup here

On a nutshell, It can

  1. Auto Scale browser nodes up and down.
  2. Record videos and store them named under session id
  3. Capture logs and store them named under session id
  4. Upload captured logs and videos to S3

@prashanth-volvocars

What if I want to upload the videos with a specific name instead of <session_id>.mp4 to the S3 bucket? Or how can I identify which is the corresponding video file for a test?

@msvticket
Copy link

In your test you know the session id so therefore you also know the file name. Telling what file name to use is not possible with this solution.

@subin-krishna-test
Copy link

In your test you know the session id so therefore you also know the file name. Telling what file name to use is not possible with this solution.

@msvticket
Thanks for your response.
I am using selenium-side-runner to connect the remote selenium grid and pass the .side file as an argument to side runner. How can I get the correct session id for each test if I am running multiple tests simultaneously?

@tppalani
Copy link

I have attempted to build something similar for Kubernetes with Selenium Grid3. More details here: https://link.medium.com/QQMCXLqQMjb

Hi @sahajamit,, i have seen your post in medium to configure selenium grid inside the eks cluster.

As per your instructions i have created selenium grid hub and I can able to access it via ingress controller, but when I'm trying to configure the chrome node not able to register with selenium hub.

I your post you mentioned k8s_host what you referring value, is the eks cluster endpoint url or something else?

@diemol
Copy link
Member

diemol commented Dec 21, 2023

We now have Keda integrated in the chart, and there is also video there. Closing this.

@diemol diemol closed this as completed Dec 21, 2023
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C-grid C-java help wanted Issues looking for contributions I-enhancement
Projects
None yet
Development

No branches or pull requests