Skip to content

Add test to re-execute specified range of mainnet C-Chain blocks #4019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 98 commits into
base: master
Choose a base branch
from

Conversation

aaronbuchwald
Copy link
Collaborator

@aaronbuchwald aaronbuchwald commented Jun 17, 2025

Why this should be merged

This PR adds a configurable C-Chain benchmark that can be run locally with Taskfile or via GitHub Action (this PR currently includes triggers for pull requests and manual GH Action, but this should be changed to a cronjob on master and manual GH Action before merge and once done testing).

The test clones a block database and current state database, so that it can execute a range of blocks starting from the state at an arbitrary height N.

The CI workflow copies the data from an S3 Bucket in the Ava Labs Experimental Account in us-east-2 and requires AWS credentials to be provided. It can also be used with local file system directory paths for use on Snoopy/Linus (dedicated machines with large SSDs used for Firewood testing).

How this works

Constructs an EVM instance mimicking the params passed in from AvalancheGo including passing in a metrics registry with the expected prefix and chain labels attached to it, so that it can be used directly with existing Grafana Dashboards used with tmpnet rather than needing to maintain an alternative without the prefixes/labels.

How this was tested

This has been tested by running in CI and running through the steps documented in the README. To review this PR, please walk through the steps in the benchmark README.

Need to be documented in RELEASES.md?

No.

Follow Ups to this PR

  • Remove PR trigger in favor of weekly cronjob (currently in use to trigger while testing this PR)
  • Identify the best place to run this job ie. ARC, self-hosted runner, Blacksmith, etc.
  • Integrate GitHub Action Benchmark performance tracking on master by setting up gh-pages branch or a separate repo (demo of what this will look like on my fork of Canoto here

Potential Future Work

  • Identify a reasonable micro-benchmark that can be run on every PR and provide a reasonable signal whether the change triggers a performance improvement or degradation
  • Implement a job to re-execute the entire C-Chain history for full coverage that uses multiple snapshots and executes multiple subranges in parallel (ie. [0, 5m], [5m, 10m], etc.)
  • Implement a task/job to archive all blocks over configured ranges to S3 by fetching directly from the network
  • Add an optional feature at the end of the re-execution job that can post the resulting state to S3 (useful for manual GH Action entry point to create new current state snapshots)
  • Add support for long-lived feature branches or alternative configurations to run as a cronjob in addition to the default configuration run on the master branch of AvalancheGo
  • Support multiple configurations (archive/full) and targets (EBS, SSD, etc) to run the benchmark against
  • Support all VMs (P-Chain and X-Chain) for complete test coverage and benchmarking other VMs performance
  • Provide a "How-To Guide" for custom VM testing using the same approach

@aaronbuchwald aaronbuchwald changed the title Aaronbuchwald/cchain reexecute range test test(cchain): Add test to re-execute specified range of mainnet blocks Jun 17, 2025
@aaronbuchwald aaronbuchwald changed the title test(cchain): Add test to re-execute specified range of mainnet blocks Add test to re-execute specified range of mainnet C-Chain blocks Jun 17, 2025
@aaronbuchwald
Copy link
Collaborator Author

Note: running locally correctly exports the metrics to Grafana, but when using Linus/Snoopy (or in general), must be careful to ensure that prometheus is running correctly.

I ran into some issues where

fullCmd := "nohup " + cmdName + " " + args + " > " + logFilename + " 2>&1 & echo -n \"$!\" > " + pidPath
fails silently to start exporting the metrics because the address was already occupied by Prometheus configured on these nodes.

Since StartPrometheus is not currently configurable, it may be desirable to make it configurable and ensure that it fails loudly rather than silently as it was not obvious why metrics were being exported locally, but not when running on Linus/Snoopy.

Comment on lines 63 to 73
- name: Download Previous Benchmark Result
uses: actions/cache@v4
with:
path: ./cache
key: ${{ runner.os }}-reexecute-cchain-range-benchmark.json
- name: Compare Benchmark Results
uses: benchmark-action/github-action-benchmark@v1
with:
tool: 'go'
output-file-path: $GITHUB_WORKSPACE/reexecution-data/reexecute-cchain-range.txt
external-data-json-path: ./cache/${{ runner.os }}-reexecute-cchain-range-benchmark.json
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing to the cache and comparing against the last entry enables a comparison.

This should be changed to write to the cache only on the master branch for comparison and other triggers should compare against the latest baseline set by master

@aaronbuchwald aaronbuchwald marked this pull request as ready for review July 16, 2025 18:53
@aaronbuchwald aaronbuchwald requested a review from maru-ava as a code owner July 16, 2025 18:53
@aaronbuchwald
Copy link
Collaborator Author

Trying out running on ARC and Blacksmith here:

start-block:
description: 'The start block for the benchmark.'
required: false
default: '100'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than duplicating the defaults here and in the job, maybe use env vars e.g.

env: 
  START_BLOCK: '100'
  ...

on:
  pull_request:
  workflow_dispatch:
    inputs:
      start-block:
        description: 'The start block for the benchmark.'
        required: false
        default: ${{ env.START_BLOCK }}
        ...

- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_S3_READ_ONLY_ROLE }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this secret won't be available from fork branches, you'll want to make job execution conditional (as per the example of run-monitored-tmpnet-cmd so that the job doesn't fail on fork branch PRs.

- name: Set task env via GITHUB_ENV
id: set-params
run: |
if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since github.event.inputs.* will be undefined when the job is not triggered by workflow_dispatch, a simpler way of setting each variable would be using || to supply a default:

echo "START_BLOCK=${{ github.event.inputs.start-block || env.START_BLOCK }}" >> $GITHUB_ENV`

Note that the echoed text needs to be appended to $GITHUB_ENV to affect the environment - the current form would only result in log output.

artifact_prefix: c-chain-reexecute-range
prometheus_username: ${{ secrets.PROMETHEUS_ID || '' }}
prometheus_password: ${{ secrets.PROMETHEUS_PASSWORD || '' }}
grafana_dashboard_id: 'c-chain-processing-custom-v8/c-chain-processing'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct dashboard ID now that you've incorporated the target metrics into the main C-Chain dashboard?

uses: ./.github/actions/run-monitored-tmpnet-cmd
with:
run: ./scripts/run_task.sh reexecute-cchain-range-with-copied-data EXECUTION_DATA_DIR=${{ github.workspace }}/reexecution-data BENCHMARK_OUTPUT_FILE=${{ github.workspace }}/reexecute-cchain-range-benchmark-res.txt
artifact_prefix: c-chain-reexecute-range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be omitted since it is only relevant if inputs.runtime is set to one of process or kube. If there is an artifact to collect, the job will need to call actions/upload-artifact itself.

// This default delay is set above the default scrape interval used by StartPrometheus.
time.Sleep(tmpnet.NetworkShutdownDelay)

r.NoError(server.Stop())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails, it will prevent the removal of sdConfigFilePath. Ensure that both cleanup steps will always execute by registering them separately with tb.Cleanup.

func getCounterMetricValue(tb testing.TB, registry prometheus.Gatherer, query string) (float64, error) {
metricFamilies, err := registry.Gather()
r := require.New(tb)
r.NoError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a given function, please avoid mixing testify assertions with a returned error. In this case, maybe just return the error given that the caller is already using testify to check it?

sdConfigFilePath, err = tmpnet.WritePrometheusSDConfig(name, tmpnet.SDConfig{
Targets: []string{server.Address()},
Labels: labels,
}, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to avoid the use of magic values like this one in favor of either a descriptively-named variable (e.g. withGithubLabels or by adding a comment e.g. true /* withGithubLabels */.

// "avalanche_snowman" and the chain label (ex. chain="C") that would be handled
// by the[chain manager](../../../chains/manager.go).
func newConsensusMetrics(registry prometheus.Registerer) (*consensusMetrics, error) {
errs := wrappers.Errs{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this suggested for a single potential error?

func (e *vmExecutor) executeSequence(ctx context.Context, blkChan <-chan blockResult, executionTimeout time.Duration) error {
blkID, err := e.vm.LastAccepted(ctx)
if err != nil {
return fmt.Errorf("failed to get last accepted block: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action required) Maybe differentiate the error from the subsequent one e.g. block -> block id?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants