Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triggering a subset of the pre-submit CI (ci.yaml) #255

Open
terrykong opened this issue Sep 20, 2023 · 3 comments · May be fixed by #775
Open

Triggering a subset of the pre-submit CI (ci.yaml) #255

terrykong opened this issue Sep 20, 2023 · 3 comments · May be fixed by #775
Assignees

Comments

@terrykong
Copy link
Contributor

The pre-submit CI ci.yaml is extremely helpful for testing a change E2E, but in many cases we don't need to run all of the CI to test every change. Addressing this will speed up our development velocity since we don't have to wait for a very long CI pipeline to pass in order merge in a change.

Here is a summary of some of our options:

Option 1: minimal pre-submit; reverts on failures

@yhtang initially proposed the idea of slimming down our pre-submit CI so that only minimal CI is run for PRs, and then the full ci.yaml can be run once the commit is in main. To handle the case where commits in main can cause failures, there can be added logic that will reverts the PR if ci.yaml fails

Option 2: selectively running tests

We can add another "command" similar to /assistant where members of JAX-Toolbox can enumerate the tests needed. So assuming there's four stages:

base -> jax -> paxml
                     |
                     -> t5x

We can comment on a PR something like:

/ci jax

which would jax build and test steps and any descendant jobs (t5x/pax).

/ci t5x paxml

which would run only t5x and paxml

/ci
/ci base

would run everything.

@ko3n1g
Copy link
Contributor

ko3n1g commented Oct 10, 2023

I would be curious whether using GH workflows paths filter as well as a docker caching strategy could help you in this regard?

Since you have quite clearly separated dependencies, I suppose you could quite easily (after re-arranging some files into folders) condition the execution of a workflow to changes of a specified set of files.

At that point docker caching might not even be relevant anymore but using the inline cache together with cache-from could in some situations maybe still speed up your ci pipeline. To still fetch security updates you could build the image nightly/weekly w/o using the cache.

Having said that, those custom commands you have implemented are admittedly quite cool (plus of course provide the benefit of having explicit control over the execution of your test suite).

@ko3n1g
Copy link
Contributor

ko3n1g commented Apr 23, 2024

I would like to share my perspective on this issue:

Option 1, so a triage-mode, sounds initially like a quicker solution but I'm afraid of some edge cases where it could backfire. The easiest I can think of is that between two nightly runs multiple PRs got merged. The auto-triage would revert all of these PRs even though potentially only a single was malicious. While we can definitely more quickly merge PRs into main, I am not sure if the 24hrs delay induced by nightly workflows will allows us to increase our velocity.

Option 2, so a way of triggering ci runs of selected stages, sounds very interesting. I have some remarks and comments about this approach in which we - one way or another - trigger sub-selected workflows run by /ci <stage>:

which would jax build and test steps and any descendant jobs (t5x/pax).

If I understand you correctly, the argument <stage> expresses the lowest-level stage to start the tests from which will include all higher-level stages automatically. Example: stage=jax will trigger paxml, t5x, maxtext tests too. Given that higher-level containers are dependent on lower-level containers, I think we want this mechanism to function the other way around. We want to define <stage> as the cutoff-point of the highest-level container and only test everything below it. Example: stage=jax will build & test only jax:base and jax:jax. That way, we can debug from bottom to top in a pareto-efficient way.

Last thought: We have two options on how to implement either of these approaches. We can do it via a "PR CLI" as @terrykong suggested, or we can go via GHA workflow_dispatch. Personally, I have no strong feeling over one or the other. However, I could imagine that a GHA workflow_dispatch is less complex and thus easier to understand & maintain.

//cc @yhtang

@terrykong
Copy link
Contributor Author

terrykong commented Apr 23, 2024

If I understand you correctly, the argument expresses the lowest-level stage to start the tests from which will include all higher-level stages automatically. Example: stage=jax will trigger paxml, t5x, maxtext tests too. Given that higher-level containers are dependent on lower-level containers, I think we want this mechanism to function the other way around. We want to define as the cutoff-point of the highest-level container and only test everything below it. Example: stage=jax will build & test only jax:base and jax:jax. That way, we can debug from bottom to top in a pareto-efficient way.

So I guess both of these serve different purposes. Here are some scenarios that have happened in the past:

  1. someone makes a change to stage=jax

In this case, we're interested in how that change affects all the leaf frameworks that depend on jax

  1. someone makes a change to stage=paxml

In this case, we're interested in testing just paxml and its descendants since we assume the same base, e.g., jax

Example: stage=jax will build & test only jax:base and jax:jax.

I think this is helpful as you said, in debugging, since it only tests the jax change so you can move quickly; but I think for the presubmit, we'd still want to know how an ancestor's change affects descendants since that's a simulation of what will happen in the nightly when the jax stack gets bumped.

@ko3n1g ko3n1g linked a pull request Apr 30, 2024 that will close this issue
@ko3n1g ko3n1g self-assigned this Apr 30, 2024
@ko3n1g ko3n1g linked a pull request Apr 30, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants