Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to re-run single jobs #432

Closed
tvdeyen opened this issue Apr 17, 2020 · 75 comments
Closed

Add ability to re-run single jobs #432

tvdeyen opened this issue Apr 17, 2020 · 75 comments
Labels
Service Feature Feature scope to the pipelines service and launch app

Comments

@tvdeyen
Copy link

tvdeyen commented Apr 17, 2020

Please add the ability to re-run single jobs of a workflow. This is such a basic feature.

alchemy_cms@5d8ae3b 2020-04-17 09-22-32

Please keep the environment in mind while prioritizing features.

🙏

@tvdeyen tvdeyen added the enhancement New feature or request label Apr 17, 2020
@TingluoHuang
Copy link
Member

@chrispat from the product team for this feedback. 😄

@TingluoHuang TingluoHuang added service Service Feature Feature scope to the pipelines service and launch app and removed enhancement New feature or request service labels Jun 8, 2020
@ygj6
Copy link

ygj6 commented Dec 2, 2020

Any updates?

@fuesec
Copy link

fuesec commented Jan 7, 2021

would love to see this feature

@schw4rzlicht
Copy link

Any updates on this? In a parallelized Workflow, we always spend 100 build minutes even if one of the 1-minute-jobs fails.

@rathboma
Copy link

rathboma commented Mar 18, 2021

💯 this is a much needed addition.

I maintain Beekeeper Studio, and random timeouts cause 1/5 jobs to fail fairly regularly. Being able to re-run only the jobs that failed would save us so much time.

I would also love to see being able to retry a single-step, and have other jobs not abort on a single failed job.

@Ferroin
Copy link

Ferroin commented Mar 19, 2021

This is a crucial feature for anyone doing multi-arch package builds and deployments with tooling that cannot build multiple architectures in parallel. WIthout this, much more complicated workflows are required to ensure packages don’t get deployed multiple times just because one of the jobs failed.

@sindrijo
Copy link

This is a much needed feature for my team, which is building a CI pipeline for build/testing/deploying multiple packages for multiple platforms, not having this feature seems very wasteful in both time and electrons.

@prettycoder
Copy link

have to chime in. My matrix generates 61 jobs and one of them usually fails because of where it collects data from. The next time I re-run 61 jobs another one fails ...

@tbarbugli
Copy link

Our team is currently exploding 1 workflow into many jobs, this is a terrible hack to get retries to work because you cannot make much sense of the status of a branch/commit anymore (the Actions UI does not do any grouping at that point)

@OmgImAlexis
Copy link

It's been a year, @TingluoHuang is there any status update on this?

Has it been at least considered?

@maragunde93
Copy link

maragunde93 commented Apr 13, 2021

Is there any updated on this? we have migrated from GitLab CI to GitHub actions and I am already regretting about it, we have a pipeline that deploys infrastructure and takes around 1 hour, if somethings fails retrying the whole pipeline is a big lost.

@johanhelsing-attensi
Copy link

johanhelsing-attensi commented May 7, 2021

I'm honestly quite disappointed that this doesn't seem to have been prioritized, and no response as to when/if it can be expected (just that it was "definitely on the backlog" back in 2019).

In order to provide some substance, and not just spam everyone with complaints and a +1, here's a summary about what I found with respect to workarounds:

https://github.51.almunity/t/re-run-jobs/16145/11

I think I found a workaround here. I have split my matrix build to multiple yaml files, each one with a different name but the same trigger. Each of them contain only a single run. As it looks, this enables me to re-run jobs individually. I do so by selecting “re-run all jobs”, where “all” is now always exactly one.

The price you need to pay is that you have some code duplication and cannot use the matrix feature. For me, this is acceptable as the CI is actually done by a script and shared code outside of that script is relatively minimal.

Unfortunately, this is a bit clunky to use, as at least last I checked neither includes nor yaml anchors are supported in workflows, so code reuse and maintainability across projects will be a pain. Also, I don't really understand how I could make rules like "deploy to staging once all builds pass".

The matrix feature is also really nice, and I'd hate to lose it.

There is another interesting hack, which stores the last run status in cache and then skips jobs based on that status:

How they used it:

    - name: Set default run status
      run: echo "::set-output name=last_run_status::default" > last_run_status

    - name: Restore last run status
      id: last_run
      uses: actions/cache@v2
      with:
        path: |
          last_run_status
        key: ${{ github.run_id }}-${{ matrix.os }}-${{ matrix.node-version }}-${{ matrix.webpack }}-${{ steps.date.outputs.date }}
        restore-keys: |
          ${{ github.run_id }}-${{ matrix.os }}-${{ matrix.node-version }}-${{ matrix.webpack }}-
    - name: Set last run status
      id: last_run_status
      run: cat last_run_status

    - name: Checkout ref
      uses: actions/checkout@v2
      with:
        ref: ${{ github.event.workflow_dispatch.ref }}

    - name: Use Node.js ${{ matrix.node-version }}
      if: steps.last_run_status.outputs.last_run_status != 'success'
      uses: actions/setup-node@v1
      with:
        node-version: ${{ matrix.node-version }}

@btjones-me
Copy link

+1 we would be grateful for this feature.

@gergo-papp
Copy link

+1 We are evaluating different CI providers right now (after potentially migrating from Travis) and I'm sure this is a really important feature for many other developers as well

@Sarga
Copy link

Sarga commented May 27, 2021

+1 we need this feature.

@rr-nick-tan
Copy link

looking for this feature too, otherwise, have to split the workflow into multiple ones

@marcelwa
Copy link

marcelwa commented Jun 1, 2021

+1 please safe the planet!

@domdfcoding
Copy link

To avoid spamming everyone with notifications please use GitHub's reaction buttons instead of commenting "+1 we want this". Thanks 😃

@Omzig
Copy link

Omzig commented Feb 1, 2022

Did you know that *.visualstudio.com can do this in devops?

@dylanbhughes
Copy link

Hope to see it this quarter 🙏

@ethomson
Copy link
Contributor

ethomson commented Mar 1, 2022

Hello everyone! I strongly agree that this is a thing we need - and in fact this is a thing that we're working on. However, this is a part of Actions itself, it's not a part of the runner application (meaning: the software that's in this repository).

In order to keep things tidy for the runner team - the developers who are working on this application - I'm going to close this issue where it will stay off of their bug list.

This is being tracked in our feedback repository which is where you can request features in GitHub Actions. Thanks for all the feedback, everyone, and I hope to see you in our feedback repo.

@ethomson ethomson closed this as completed Mar 1, 2022
@bartlettroscoe
Copy link

bartlettroscoe commented Mar 9, 2022

NOTE: Avoiding rerunning jobs that have already passed is more than just saving computing cycles. It also is critical to avoid the cumulative probability of failures in the different jobs that can significantly increase the number of testing iterations needed to get all passing jobs. For example, if you have a GitHub Actions setup with seven independent jobs that run to test a PR, if there is a 20% chance of a random failure in any one of the seven PR builds, then the chance of having at least one of the PR builds having a failure jumps to 1 - (1 - 0.2)^7 = 0.79 or 80%! And if any job fails, you have to rerun all of the jobs and the probability of failure the next time is still 80% and so on. The result is that it can take many PR testing iterations to get all of the jobs to pass.

This occurs relatively frequently, for example, in the Trilinos PR testing system (which currently uses a custom PR testing system which also lacks the ability to rerun individual jobs and where each job has a non-trivial random probability of failure).

What this means is that if you can't rerun single jobs that fail, then you just can't effectively scale to a large number of testing jobs. As another example, if you have 100 GitHub Actions jobs with just 1% chance of experiencing a failure (which is about the frequency of failure of just being able to fetch dependencies in a GitHub Actions job), then the cumulative probability of failure across these 100 jobs is 1 - (1-0.01)^100 = 0.63 or 63%! But if you can rerun individual jobs, the number of GitHub Actions jobs needed to pass goes way down and getting a set of passing jobs becomes much more probable after the first GitHub Actions jobs run that has a 63% cumulative probability of failure. If there is just a single job that failed in the first running of all of the GHA jobs (due to a random failure), then the rerunning of that one job would have just a 1% chance of failing or a 99% of passing. That reduces wasted computing resources and speeds up the testing cycle wall-clock time.

This is a big deal for projects that need many testing jobs and have a higher probability of failure in any individual job.

@piotrekkr
Copy link

piotrekkr commented Mar 16, 2022

Seems like it is live now and we can rerun single jobs. I'm really grateful for devs for implementing this 🙏 🎉

And now some tiny rant 😅

It's kinda broken when using job matrix and Cypress parallel tests...

Here is how it worked and why it is not working well with failed job rerun feature

  1. on "setup" job we generated unique ID for cypress tests run
  2. next we used job matrix to generate three workers that were running Cypress tests in parallel using generated ID
  3. when some jobs in matrix fail we rerun full workflow which generated new ID and run whole matrix again

Why it does not work with rerunning only one matrix job? Because unique ID is the same and Cypress consider this run as finished and do not run tests again. I did not find a way to force running them again with same ID. What can be done with this is:

  1. rerun whole workflow again (old way)
  2. rerun failed jobs only (will create new workflow run attempt only with failed matrix jobs and with same unique ID)
  3. rerun manually all failed jobs one by one (no way to manually rerun whole matrix again)
  4. rerun "setup" job will trigger new ID and also will trigger all dependent jobs

First approach is slow since all need to be rerun again. Second approach seems best at first because we could use UNIQUE_ID-RUN_ATTEMPT as Cypress ID, but it can be problematic when only 1 of 10 matrix jobs failed and one runner will need to handle all e2e tests again (no parallelization). Third approach is not good either since we cannot select multiple jobs to rerun manually at same attempt so no parallelization. Last approach is what we use now and it works ok but developers need to remember to rerun this setup job instead of just rerunning failed jobs.

So to sum up

Maybe we could add some flag to mark whole matrix as failed when one of jobs inside matrix fails? When we rerun all failed jobs it would rerun whole matrix again. What do you think?

Thanks

@tvdeyen
Copy link
Author

tvdeyen commented Mar 16, 2022

Wow. Finally. Two years of wasting precious resources later it finally shipped. Thanks for everyone involved.

alchemy_cms@1715663 2022-03-16 11-42-17

@Jolg42
Copy link

Jolg42 commented Mar 16, 2022

Can confirm it's here! 🎊
Screen Shot 2022-03-16 at 11 22 22
Screen Shot 2022-03-16 at 11 22 58

Looks like Santa was early this year 🎅🏼

@willyt150
Copy link

I saw the release announcement for supporting re-running single jobs, is this being released in phases or something? The GitHub Enterprise repos I'm working on still do not have any ability to re-run individual jobs.

I thought maybe it just wouldn't work with old runs, so I kicked off new ones and still nothing, just the re-run all jobs option.

@chrispat
Copy link
Member

It is currently available on github.com only and is slated to ship in the next update to GitHub enterprise. In addition there are still some issues related to reusable workflows that we are ironing out.

@davegallant
Copy link

davegallant commented Mar 21, 2022

It is currently available on github.com only and is slated to ship in the next update to GitHub enterprise. In addition there are still some issues related to reusable workflows that we are ironing out.

This is amazing work. Not seeing the option to re-run failed jobs for reusable workflows. Wasn't sure if it's because the call to the reusable we're using is dependent upon another job or not.

EDIT: For more context: the first job is reading configuration and then passing the config to the reusable workflow call that starts several jobs in a matrix.

@chrispat
Copy link
Member

We have temporarily disabled the feature for any run that references a reusable workflow while we iron out the issues. We hope to have those resolved towards the end of this week or early next week.

@debugger24
Copy link

debugger24 commented Mar 26, 2022

Unable to rerun single job when some jobs are pending review deployment.

Here, I want to rerun build_3 before approving build_4.

image

@piotrekkr
Copy link

piotrekkr commented Mar 27, 2022

@debugger24 This is my guess only but this is probably by design. Rerunning any job creates whole new run attempt for whole workflow. All jobs that are not dependent on job you want to rerurn, are "cloned" into new run attempt. But to clone you need a job result first so you need to wait for all jobs to finish.

@Drowze
Copy link

Drowze commented Mar 28, 2022

Found another unexpected behaviour:

  • given a job that submits a manual status check (e.g. via API) that has passed
  • and given a different job that has failed
  • when I retry only failed jobs, the resulting check group will not have the manual check (submitted by the job that has passed on the first try)

This is a problem to us: we have a manual check called "Rubocop" (submitted manually using reviewdog) that is required for a pull request to be merged. If we retry the workflow, we have all jobs passing, but the manual check is missing, so a PR can't be merged.

Screenshots of such case (1st with failed jobs, then 2nd re-ran, but without the manual rubocop status check)
Screenshot 2022-03-28 at 14 16 59Screenshot 2022-03-28 at 14 16 44

@mrmike
Copy link

mrmike commented Apr 5, 2022

We have temporarily disabled the feature for any run that references a reusable workflow while we iron out the issues. We hope to have those resolved towards the end of this week or early next week.

Do you have any public issue opened for this case? I'd like to track progress of this issue

@hugovk
Copy link
Contributor

hugovk commented Apr 8, 2022

This is now working, thanks!

@madhavajay
Copy link

Is it possible to re-run a failed job before the others finish? We have quite long running jobs which means the wait to retry a failed test due to some weird external issue is a really long time.
Screen Shot 2022-05-06 at 9 52 18 am

@janpio
Copy link

janpio commented Nov 1, 2023

It is not yet @madhavajay, so I created a feedback discussion to suggest that: https://github.com/orgs/community/discussions/73156 Leave an upvote or reaction over there! (also the 43 other people that upvote the previous comment optimally 😆)

@abhilash1in
Copy link

I don't see "Re-run failed jobs" option as a dropdown.

I also don't see an option to re-run individual jobs when I hover over them.

Is this a bug or am I doing something wrong?

bug

@piotrekkr
Copy link

I don't see "Re-run failed jobs" option as a dropdown.

I also don't see an option to re-run individual jobs when I hover over them.

Is this a bug or am I doing something wrong?

@abhilash1in Are you sure that all jobs inside workflow are finished? If they are not done yet there will be no option to rerun. GitHub requires for full workflow to finish before it can be rerun.

@abhilash1in
Copy link

I don't see "Re-run failed jobs" option as a dropdown.
I also don't see an option to re-run individual jobs when I hover over them.
Is this a bug or am I doing something wrong?

@abhilash1in Are you sure that all jobs inside workflow are finished? If they are not done yet there will be no option to rerun. GitHub requires for full workflow to finish before it can be rerun.

Erm, okay all jobs had not finished running when I was looking for the re-run failed jobs button.

But also, that doesn't make sense. If I see failed jobs, I should be able to re-run them individually without having to wait for all the jobs to finish.

@piotrekkr
Copy link

piotrekkr commented Apr 18, 2024

I don't see "Re-run failed jobs" option as a dropdown.
I also don't see an option to re-run individual jobs when I hover over them.
Is this a bug or am I doing something wrong?

@abhilash1in Are you sure that all jobs inside workflow are finished? If they are not done yet there will be no option to rerun. GitHub requires for full workflow to finish before it can be rerun.

Erm, okay all jobs had not finished running when I was looking for the re-run failed jobs button.

But also, that doesn't make sense. If I see failed jobs, I should be able to re-run them individually without having to wait for all the jobs to finish.

Yeah would be nice to be able to do this. However, I think that GitHub needs to store state of whole workflow run before you can rerun parts of it again. Some jobs are depending on results of other jobs (even if those jobs failed). They probably wait for all to be executed (skipped, failed, cancelled or successful), store workflow run state somewhere, and then they are able to know what jobs state to "copy" from previous run, and what to rerun again.

@Omzig
Copy link

Omzig commented Apr 18, 2024

in azure, i have to wait for the jobs to finish before i can rerun them.............

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Service Feature Feature scope to the pipelines service and launch app
Projects
None yet
Development

No branches or pull requests