Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dre): qualification via command #649

Merged
merged 46 commits into from
Jul 23, 2024

Conversation

NikolaMilosa
Copy link
Contributor

@NikolaMilosa NikolaMilosa commented Jul 20, 2024

Previously qualification used to be run with python scripts that were sitting in a different repository and were hard to maintain. With this addition we are a couple of steps closer to achieving better solution for qualification.

Reasons why this is good:

  1. Less code duplication
  2. All in one place and technology
  3. Allows similar / better running.

To run everything you need a:

  1. Network (any network apart from mainnet which is not allowed by the tool)
  2. A neuron (with majority so its proposals get adopted instantly)
  3. Prometheus instance used for scraping the metrics of xnet and workload tests
  4. Deployment name (staging, ...) again used for the ic label.

Since this command is rarely meant to be used manually it provides a lot of options and to run it one could use a following script:

#!/bin/bash

# To obtain all these url's one can spin up a farm testnet from ic repo
# Goto the root and run ./gitlab-ci/container/container-run.sh
# Create from there run:
# ict testnet create --lifetime-mins 180 --from-ic-config-path <(cat <<EOF
# {
#  "subnets": [
#    {
#      "subnet_type": "application",
#      "num_nodes": 4
#    },
#    {
#      "subnet_type": "application",
#      "num_nodes": 4
#    },
#    {
#      "subnet_type": "system",
#      "num_nodes": 4
#    }
#  ],
#  "num_unassigned_nodes": 0,
#  "initial_version": "7dee90107a88b836fc72e78993913988f4f73ca2"
# }
# EOF
# )

# NOTE! Farm testnets need some time to deploy and when they are done you HAVE TO KEEP
# THE TERMINAL SESSION going and not exit the command 

NNS_URL_1="http://[system-subnet-ipv6-1]:8080"
NNS_URL_2="http://[system-subnet-ipv6-2]:8080"
NNS_URL_3="http://[system-subnet-ipv6-3]:8080"
NNS_URL_4="http://[system-subnet-ipv6-4]:8080"
PROM_URL="http://prometheus.from-config--<fill-in>.testnet.farm.dfinity.systems/api/v1/query"
DEPLOYMENT_NAME="from-config--<fill-in>"
# On farm testnets this is the neuron with the most voting power so you can leave it as is
TEST_NEURON_ID=449479075714955186 
# To obtain the identity.pem you can create a simple rust binary, 
# add ic-nervous-system-common-test-keys = { workspace = true } as a dependency
# and write a simple function in main that contains => println!("{}", TEST_NEURON_1_OWNER_KEYPAIR.to_pem());
# and store it to the following path
TEST_NEURON_KEY="$HOME/.config/dfx/identity/test_neuron_1/identity.pem"
rm -rf ~/.cache/ic-registry-cache/configured-testnet

# At the moment of time of writing this, cargo on linux machines doesn't compile because of 
# canister code being in the same workspace as other tools so use bazel
bazel run //rs/cli:dre -- --yes --network configured-testnet \
   --nns-urls $NNS_URL_1 --nns-urls $NNS_URL_2 --nns-urls $NNS_URL_3 \
   --nns-urls $NNS_URL_4 --neuron-id $TEST_NEURON_ID \ 
   --private-key-pem $TEST_NEURON_KEY  qualify execute \
   --version a3831c87440df4821b435050c8a8fcb3745d86f6 \ 
   --prometheus-endpoint $PROM_URL --deployment-name $DEPLOYMENT_NAME

The idea is that this gives us the driving power of qualifying on any network (apart from mainnet) and there would be one more tool which is meant to automate the things that script right now does. That tool will run some more checks as well and will call dre as a library instead of having to call it as a separate binary which gives us less code duplication again and compile time errors if something in the cli args changes.

The ETA I've had when testing this across multiple runs was around 58 mins which equates to a run on staging (give or take).

One more feature the tool has is the ability to specify the steps. This means that only those steps will be executed (if used with the execute command since list only prints the plan)
One can run:

dre qualify list --step-range 1..5

The output looks like this:

bazel run //rs/cli:dre -- qualify list --step-range 1..2 
INFO: Invocation ID: 01c269a0-ae7a-4a47-869c-458b21493117
INFO: Analyzed target //rs/cli:dre (26 packages loaded, 188 targets configured).
INFO: Found 1 target...
Target //rs/cli:dre up-to-date:
  bazel-bin/rs/cli/dre
INFO: Elapsed time: 0.301s, Critical Path: 0.02s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
 INFO  dre > Running version 0.4.3-5b3ca130-dirty
| Index | Will run | Name                                 | Help                                                                                     |
| ..... | ........ | .................................... | ........................................................................................ |
|   0   |  false   | ensure_blessed_revision              | This step runs the check to determine if all versions are blessed                        |
|   1   |   true   | update_deployment_canisters          | This step ensures that deployment canisters match the version of nns deployment canister |
|   2   |   true   | upgrade_application_subnet_version   | This step upgrade all the application subnets to the desired version                     |
|   3   |  false   | upgrade_system_subnet_version        | This step upgrade all the system subnets to the desired version                          |
|   4   |  false   | upgrade_unassigned_nodes_version     | This step upgrade all the unassigned nodes to the desired version                        |
|   5   |  false   | workload_test                        | This step runs the workload test on one app subnet.                                      |
|   6   |  false   | xnet_test                            | This step runs xnet test on a subnet                                                     |
|   7   |  false   | retire_blessed_versions              | Ensure that versions are retired                                                         |
|   8   |  false   | ensure_blessed_revision              | This step runs the check to determine if all versions are blessed                        |
|   9   |  false   | downgrade_application_subnet_version | This step downgrade all the application subnets to the desired version                   |
|  10   |  false   | downgrade_system_subnet_version      | This step downgrade all the system subnets to the desired version                        |
|  11   |  false   | downgrade_unassigned_nodes_version   | This step downgrade all the unassigned nodes to the desired version                      |

I will attach a log of a successful qualification (which is quite long here as well).
output.log

@NikolaMilosa NikolaMilosa requested a review from a team as a code owner July 20, 2024 02:29
rs/cli/src/ctx.rs Outdated Show resolved Hide resolved
rs/cli/Cargo.toml Outdated Show resolved Hide resolved
@NikolaMilosa NikolaMilosa enabled auto-merge (squash) July 23, 2024 12:27
@NikolaMilosa NikolaMilosa merged commit 35303f9 into main Jul 23, 2024
4 checks passed
@NikolaMilosa NikolaMilosa deleted the nm-qualification-via-dre-command branch July 23, 2024 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants