-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: Add command to reset mon quorum #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some initial reviews
d14be36
to
efe3543
Compare
32b6a5c
to
35d4e29
Compare
4c71c98
to
58d5906
Compare
KUBECTL_NS_CLUSTER wait --for=delete pod/"$deployment_pod" --timeout=60s | ||
set -e | ||
if [ "$deployment_pod" != "" ]; then | ||
# scale the deployment to 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scaling deployment to 0 can be still be out of if
statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no pod, is there any reason to still scale it to 0? Are you just saying it's a precaution to still scale it down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for a precaution,
A pod can be not present for certain intervals because of some restarting, etc
But the deployment may exist scaled up
kubectl-rook-ceph.sh
Outdated
|
||
# Check for the existence of the toolbox | ||
info_msg "Start the toolbox if it is not yet running" | ||
wait_for_pod_of_deployment_to_be_running rook-ceph-tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe if we just say
wait_for_pod_to_run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rook-ceph-tools
is the name of a deployment, not a pod. How about wait_for_deployment_to_run
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
83da8c4
to
038a02e
Compare
kubectl-rook-ceph.sh
Outdated
|
||
info_msg "Mon quorum was successfully restored to mon $good_mon" | ||
|
||
prompt_to_continue_or_cancel "Start up the operator and expand to full mon quorum again?" "yes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prompt_to_continue_or_cancel "Start up the operator and expand to full mon quorum again?" "yes" | |
prompt_to_continue_or_cancel "Start up the operator and expand to full mon quorum again?" "yes" |
since Mon are successfully restored(as we are logging above) why we are asking for starting the operator up again, we should be doing every time once mons are up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question, it just felt like a good thing to pause and say "I did my job to restore quorum to a single mon, are you sure you're ready to start the operator up again?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i understand correctly at this point we have a single monitor running right ? If the user cancels this prompt, they'll know that they're taking a risk to run with a single mon, maybe we should add some warning here if at a point they have a cluster running with a single monitor.
And they would need to start and scale up the operator manually right in such situation further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we don't want them to risk staying with a single mon. I think the "press any key to continue" will solve this.
kubectl-rook-ceph.sh
Outdated
if [ "$INPUT_VAR" = "$proceed_answer" ]; then | ||
info_msg "proceeding" | ||
else | ||
warn_msg "cancelled" | ||
exit 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if [ "$INPUT_VAR" = "$proceed_answer" ]; then | |
info_msg "proceeding" | |
else | |
warn_msg "cancelled" | |
exit 1 | |
if [ "$INPUT_VAR" = "$proceed_answer" ]; then | |
info_msg "proceeding" | |
else | |
warn_msg "cancelled" | |
exit 1 |
I have a question about this maybe I thinking to much, but since checking for strict "$INPUT_VAR" = "$proceed_answer"
and if somehow user has a typo or just does enter( which somewhere means default value
or n in y/n
case) then we are doing exit 1
which will exit the script in the middle of the process or some step. is this okay for the cluster? maybe we are doing the steps in debug mode pod so we are good?
Also, let's say the user didn't give any input in the prompt or wrong input that will lead to exit then should we delete the debug pod deployment in that case?
These are negative case questions and we can fix in follow-up too according to user feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are currently prompts in two places and I think it's fine if they exit the script:
- All the info is gathered, and we prompt if the user wants to proceed. No debug pod has been started yet.
- The mon restore is completed and we just have to scale up the operator. They could scale the operator manually if they exit.
The second one could be annoying that you have to scale up the operator. Perhaps a better approach for the 2nd one is to "press any key to continue when you're ready to scale up the operator".
@@ -423,15 +630,16 @@ function run_start_debug() { | |||
[[ -z "${REMAINING_ARGS[0]:-""}" ]] && fail_error "Missing mon or osd deployment name" | |||
deployment_name="${REMAINING_ARGS[0]}" # get deployment name | |||
REMAINING_ARGS=("${REMAINING_ARGS[@]:1}") # remove deploy name from remaining args | |||
set +u |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does plus +u is set as it turns off the unset variables as an error
SO to avoid failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I wonder if this was only necessary on mac. I was hitting the same error for the main debug start
command even though it was working in the CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohkay got it
Here is the output of a test run in minikube, with verbose rocksdb output removed. See attached for the full output.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few small nits
README.md
Outdated
@@ -58,6 +58,7 @@ These are args currently supported: | |||
- `rbd <args>` : Call a 'rbd' CLI command with arbitrary args | |||
|
|||
- `mons` : Print mon endpoints | |||
- `restore-quorum <mon-name>` : Restore the mon quorum to a single mon since quorum was lost with the other mons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `restore-quorum <mon-name>` : Restore the mon quorum to a single mon since quorum was lost with the other mons | |
- `restore-quorum <mon-name>` : Restore the mon quorum to a single mon since quorum was lost with the other mons |
is this right? to a single mon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rephrase it for clarity
kubectl-rook-ceph.sh
Outdated
@@ -445,13 +661,20 @@ function run_start_debug() { | |||
echo "setting debug command to main container" | |||
deployment_spec=$(update_deployment_spec_command "$deployment_spec") | |||
|
|||
# scale down the daemon pod if it's running | |||
set +e | |||
echo "get pod for deployment $deployment_name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo "get pod for deployment $deployment_name" | |
info_msg "get pod for deployment $deployment_name" |
?
kubectl-rook-ceph.sh
Outdated
set -e | ||
if [ "$deployment_pod" != "" ]; then | ||
# scale the deployment to 0 | ||
echo "scale down the deployment $deployment_name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo "scale down the deployment $deployment_name" | |
info_msg "scale down the deployment $deployment_name" |
kubectl-rook-ceph.sh
Outdated
KUBECTL_NS_CLUSTER scale deployments "$deployment_name" --replicas=0 | ||
|
||
# wait for the deployment pod to be deleted | ||
echo "waiting for the deployment pod \"$deployment_pod\" to be deleted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo "waiting for the deployment pod \"$deployment_pod\" to be deleted" | |
info_msg "waiting for the deployment pod \"$deployment_pod\" to be deleted" |
kubectl-rook-ceph.sh
Outdated
@@ -465,6 +688,8 @@ function run_start_debug() { | |||
spec: | |||
$deployment_spec | |||
EOF | |||
echo "ensure the debug deployment $deployment_name is scaled up" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo "ensure the debug deployment $deployment_name is scaled up" | |
info_msg "ensure the debug deployment $deployment_name is scaled up" |
@@ -49,6 +49,15 @@ jobs: | |||
sleep 5 | |||
kubectl rook_ceph -o test-operator -n test-cluster rbd ls replicapool | |||
|
|||
# test the mon restore to restore to mon a, delete mons b and c, then add d and e | |||
export ROOK_PLUGIN_SKIP_PROMPTS=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable is used just for the CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct it's for the CI. Or if someone else wanted to avoid the prompts they could set it, though I wouldn't expect it for normal use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When quorum is lost, restoring quorum to a single mon is currently a complex manual process. Now with this krew command the admin can with less risk reset the mon quorum and restore the cluster again in disaster scenarios. Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
Also tested on openshift with three nodes and PVC and non-PVC |
core: Add command to reset mon quorum Signed-off-by: parth-gr <paarora@redhat.com>
When quorum is lost, restoring quorum to a single mon is currently a complex manual process. Now with this krew command the admin can with less risk reset the mon quorum and restore the cluster again in disaster scenarios.
Resolves #19