Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident postmortem: Cargo cannot compile crates depending on indirect optional host dependencies #12064

Closed
weihanglo opened this issue Apr 30, 2023 · 3 comments

Comments

@weihanglo
Copy link
Member

weihanglo commented Apr 30, 2023

Summary

The open source Cargo project is the cornerstone of the Rust programming language. Cargo is both the community package manager interacting with the public and private package registries, and also the build system most Rust developers use daily.

At 01:52 UTC, 2022-10-28, pull request 11183 was merged into the master branch of the open source project Cargo. The pull request introduced a bug, causing Cargo to fail to build packages that use a particular, but very common dependency setup. However, not a single report of impact was raised because projects rarely depend on the Cargo master branch directly. The Cargo executable from the master branch is only publicly available once a maintainer synchronizes it into the main Rust project repository, which then queues it up for inclusion in the next “nightly” release of Rust along with any other changes on the Rust project’s master branch. The nightly releases are mainly used by Rust developers who want to try out new and unstable features.

At 09:24 UTC, 2022-11-03, the open source project Rust merged pull request 103860, which synchronised updates from the Cargo master branch. This propagated the bug such that it would be promoted to the next Rust nightly release, at the time scheduled for 15 hours later.

Two open source community members spotted an issue in a report from a bot post-merge, and then discussed on Zulip. The result of the discussion was to revert the merge (i.e., roll back) with another PR. Given that it takes 3 hours to complete a revert, and that it was kicked off 4 hours before midnight (when nightly releases are cut), everyone involved assumed sufficient action had been taken. However, there was a timeout in the first attempt of the revert that went unnoticed until 10pm. There was no longer enough time to kick off a new revert, so the upstream release team had to halt the nightly auto-release procedure and instead wait to perform a manual release once the revert was successfully completed. This step is rarely taken due to the high-touch nature of a manual release — the community has a policy that defines when such a manual intervention is appropriate.

Had we not acted quickly, the nightly release of Rust would have included the bug, which would have made nearly 30k public packages unbuildable for anyone attempting to use the latest nightly release.

Metrics / Graphs

The bug affects all Rust packages that take an optional dependency on a package that itself has an optional “host dependency”. A host dependency is one that is used to help in building a package, such as a code generation tool or one that discovers the location of native dependencies, rather than providing Rust code for use in the dependent package.

# An example of dependency tree
my-lib v0.1.0
└── serde v0.1.0
    └── serde_derive v0.1.0 (proc-macro) 
           ↑↑↑
           (a host dependency of a dependency of the package `my-lib`)

An example of a prominent host dependency is serde_derive; a package with more than 100 million downloads on crates.io (the official Rust community’s package registry, similar to PyPI to Python or npm to Node.js). It is an optional dependency of serde, one of the most widely used Rust packages, and satisfies the criteria for triggering the bug. This means that the bug would impact nearly every Rust project, as such a vast majority of them depend on serde. As of 2022-12-05, serde is a direct dependency for at least 22k packages on crates.io, and an indirect dependency for more than 30k packages. It is downloaded over 5 million times per month.

User Impact

There was no report of user issues on rust-lang/cargo, mainly because the bug landed on Cargo master branch. We don’t have a way to count the number of users that directly depend on the Cargo master branch, but the Cargo master branch does not generally guarantee stability, so the main users are developers who work on Cargo itself.

The sync up pull request landed on rust-lang/rust at 09:24 UTC, 2022-11-03, and got reverted at 00:29 UTC, 2022-11-04. Some contributors to Rust itself may have been affected during this period, but only the small number of individuals working specifically on Rust’s integration with Cargo, since Rust development builds more broadly use artifacts from a previous Rust release.

The impact would have been much broader had the change made it into the nightly release, but this was avoided. The main impact to the broader community, or at least those specifically waiting for the next nightly release (primarily eager developers), was that the nightly release was 2 hours late that night (the manual release was kicked off around 02:00 UTC, 2022-11-04).

flowchart LR
   PR[Cargo PR] 
   -- "merge\n~1h" --> 
   cargo[rust-lang/cargo]
   -- "merge\n~3h" --> 
   rust[rust-lang/rust]
   -- "every\nnight" -->
   nightly[nightly]
   -- every\n6 weeks --> 
   beta[beta]
   -- every\n6 weeks --> 
   stable[stable]
   
   classDef affected fill:#f96
   class PR,cargo,rust affected
Loading

The orange part represents the affected phases. The issue was mitigated before it got into the next nightly release.

Incident Response Analysis

  • How was the event detected (e.g. our alarm, another team’s alarm, manual)?
    • The Rust project has automation bots to analyze code changes. One such bot for performance benchmarking pulls a random cross-section of real world projects, compiles them and checks for any performance regression. The bot failed to compile some packages, and reported an “ACTION NEEDED” comment. Two Rust open source community members from the performance working group then chimed in and pinged the Cargo Team for further investigation.
  • How could time to detection be improved? As a thought exercise, how would you have cut the time in half?
    • There are several existing mechanisms to ensure a patch is still valid in the real world. Crater as an example, compiles packages across parts of or all of the Rust ecosystem. cargotest is also a mechanism inside the Rust project acting as this role. The Cargo project could:
      • First introduce a manual check when doing a sync up pull request against rust-lang/rust main repository. If there is something touching dependency resolution, it should trigger a crater run for a subset of packages, say, top 100 packages on crates.io.
      • Have its own test suite independent from normal test suites, acting as a smoke test. This “smoke test suite” checks if keys packages in the community compile and fails fast if not. This is similar to what the existing cargotest CI step does, but at an earlier stage.
    • The time between the bot spotted the regression and a community member triaged that error was 30 minutes. If it had been triaged sooner, that would have reduced the time to detection. Improving this metric in an open-source, volunteer-based project is difficult. One option would be to employ on-call staff to monitor the Rust project for these kinds of errors.
  • How did you reach the point where you knew how to mitigate the impact?
    • The bot that reported a regression reported it specifically on the PR that synchronized the buggy Cargo into the Rust project. Thus, the path to mitigation was immediately clear once the event was detected: revert, and ensure that revert lands before the next release. The Rust infrastructure team is the primary contact for anything related to the infrastructure of the Rust project, and they were looped in once the revert was in place.
  • How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?
    • We should have monitored for the timeout of the pipeline that ensured it didn’t get stuck before the PR got merged, as there was not an incident notification system in the community at that time. The timeout wasn’t discovered until 10pm, but a revert build would take ~3 hours.
    • We could make the performance bot ping the corresponding maintainers to get in the process at an earlier stage before the release.

Post Incident Analysis

  • How was the root cause diagnosed?
    • The performance bot reported a compilation failure.
    • One member of the Rust infrastructure team looked into it and recognized it as a bug in Cargo. They then pinged the Cargo team on a Zulip public channel one hour after the compilation failure.
    • One Cargo maintainer was the author of a recent change of dependency resolution, so the maintainer had fresh memory of what they'd done and realized there wasn't a single integration test exercising the specific but common scenario.
  • How could time to diagnosis be improved? As a thought exercise, how would you have cut the time in half?
    • In terms of diagnosing the bug before merge, one of the most impactful approaches would be to put more checks in the CI pipelines. Either a subset of crater runs or cargotest runs could help diagnose the error before merge.
    • To improve the diagnosis time, the Rust community could parallelize the build process more, in order to break down CI tasks into smaller pieces and distribute them to more machines for the CI pipeline. So that each build can be complete in less than 3 hours, making people more likely to track the build and diagnose.
    • It was lucky that one of the maintainers is the author. If it was a bug introduced by a huge pull request by a random contributor, we could’ve spent more time on diagnosis. There are at least two ways to improve the diagnosis time. One is having Cargo become more modularized, so everyone can read the codebase easily without being a expert. The other is having more tests to cover each scenario, so contributors can make changes without fear.
  • Did you have an existing backlog item or ticket which, if addressed, would’ve prevented or greatly reduced impact of this event?
    • No. The scenario is so common that everyone thought it has been covered since day one.

Timeline

5 Whys

  • Why did Cargo's behavior regress?
    There was a bug introduced in pull request 11183.
    • Why was a change made to this logic in the first place?
      The change in pull request 11183 was an attempt to fix issue 10526, which regarded to the feature resolution in the Cargo project.
    • Why was the change erroneous?
      The bug detected activated dependencies in Cargo in a different way. Some packages that should be activated are de-activated during the feature resolution. The blast radius is all packages with some indirect optional host dependencies, which is nearly all packages due to key packages in the ecosystems making use of this feature.
      • Why did the contributor fail to find the bug during development?
        The contributor was relatively new to the feature resolution part, knowing only the documented rationale but edge cases were overlooked.
    • Why didn’t the Cargo team spot the bug during the review in Cargo?
      The contributor explained the change in a convincing way. In the Cargo team, only the team lead has the complete knowledge of Cargo. The team lead is a human. There was no second pair of eyes to help. Also, the feature resolution of Cargo is one of the convoluted parts in Cargo, which should have had a higher test coverage but didn’t.
      • Why was there a limited review capacity of Cargo?
        Unlike enterprise projects, open source projects do not have a formal way to recruit new maintainers. It is also not easy to earn trusts from existing maintainers, as maintainers always seek people that are more likely to make a long-term commitment on projects. It's a somewhat chicken-egg problem to have new joiners. Cargo is not modular enough because only a few people maintain it, and Cargo has so few maintainers because it is not modular enough. An approach could be initiated from existing members to start documenting more on Cargo internals and mentoring more people in the world to make Cargo more sustainable.
  • Why did the error make it all the way into Rust's master branch without being caught?
    There is no sufficient test mechanism, either automatic or manual, to catch bugs in real world projects.
    • Why did the error make it into rust-lang/cargo?
      The current test suite didn’t cover the scenario of the regression.
      • Why was there insufficient integration test to catch this bug prior to merging?
        Cargo does have lots of integration tests, around 2500 as of 2022-12-05. However, those integration tests are sophisticatedly crafted in the style of unit testing; the project does not have formal verification or property tests, nor broader “test-build real projects” smoke tests. In the real world there are too many combination of those features and Cargo test suite hardly covers them all. Although this case is common within the ecosystem, it is still a relatively complex test scenario when comparing to other existing tests.
      • Why aren’t there formal verification or property tests in Cargo?
        There are some property testing around dependency resolver in Cargo. For other parts, such as the feature resolver where the bug resided, there are not. There is a barrier to understand property testing, not to say formal verification. The Cargo Team can first adopt property testing on areas with well-defined inputs, such as the feature resolver.
      • Why doesn't Cargo compile common packages as part of CI to test itself?
        The CI time has suffered for a while. It takes around 1 hour to complete. In order to add a new CI workflow, one should try not to stack up the total CI time.
      • Why doesn't Cargo run a crater run for every landed PR?
        A crater run usually takes a couple of hours to complete, depending on how many packages are included. It's not practical to make it a routine for every landed PR, since the Cargo project merges ~20 PRs a week. Doing so would over burden the already-limited CI resources. It could, however, enable the ability to run crater on-demand if Cargo maintainers want to.
    • Why did the error make it into rust-lang/rust?
      The test suite cargotest, which tests against real world projects, didn’t catch the bug before the merge.
      • Why did the rust-lang/rust PR-checking not catch the regression when building common packages?
        The test suite cargotest didn’t make it to catch the bug; packages it tested against were determined in 2016 and might not reflect the state of the ecosystem in 2022.
        In addition, the maintainer who merged it thought it was unnecessary for a crater run, when is very likely to catch the regression.
        • Why didn’t cargotest include popular packages in its testsuite?
          There was a community proposal to modernize cargotest. Although, the author was laid-off immediately after the proposal came out. That could be the reason why the proposal was abandoned. The other reason could be cargotest running tests in serial, which leads to long CI time when more tests are added.
      • Why does the performance bot (which highlighted the bug) run after merging?
        The performance bot is not made for functional regression, so it should not block a PR being merged.
      • Why doesn't rust-lang/rust run a crater run for every landed PR?
        Same as running crater in rust-lang/cargo.
  • Why did Cargo need a revert instead of a real patch to the issue the bug was trying to fix?
    The Cargo release model is on a daily basis. At every UTC 00:00, an automated job runs that kicks off the next nightly release. If we want to eliminate bugs in a to-release master branch, the best way is to do a revert. If we rush in a fix and ship it within few hours, it might introduce more unexpected behavior than we thought. It also takes more time to merge into Cargo master branch (1 hour) and then do a Cargo version bump in rust-lang/rust repository (3 hours).
    • Why do we need a manual revert instead of automatic one?
      Realistically, the Rust infrastructure team could’ve chosen to do a nightly release on time but from an earlier commit. However, this might introduce a lot of infrastructure complexity and still involves humans in the loop, in practice.
  • Why couldn’t the Rust infra team rollout a nightly release on time?
    There was a timeout in CI pipeline but no one noticed it at first.
    • Why was there a timeout in CI pipeline but nobody noticed at first?
      There was no a mechanism to notify relevant people except GitHub Notifications and emails at that time.
    • Why didn’t Cargo maintainers get notified automatically?
      The benchmark report automation is run independently for each pull request in rust-lang/rust repository. Only the compiler-performance working group gets notifications when there is a failure or performance loss after the pull request is merged. The performance report could probably detect which submodule was changed and notify corresponding groups.

Lessons Learned

  • When contributing to open source projects, keep ownership in mind for your contributions. Take care of them even after the release for a period of time.
  • When doing a sync up pull request from rust-lang/cargo to rust-lang/rust, decide if it needs a crater run before merging.
  • When filing a change that touches a relatively complicated area, try to reach out to people in advance that can help validate the change, and — if an incident happens — address the incident.

Action items

  • HIGH Recommend to add additional integration tests to prevent this specific regression in the Cargo project.
  • MEDIUM Recommend to add tests to build a set of key packages in the Rust ecosystem in the pipeline of the Cargo project before merging.
  • MEDIUM Recommend to invest in the cargotest enhancement proposal, particularly for covering more real world scenarios.
  • LOW Recommend to make property-based testing or similar approaches covering more modules in the Cargo project.
  • LOW Recommend to make Rust automation bots alert corresponding teams of the pull request when a build is failing.
  • LOW Recommend to make Rust automation bots alert corresponding teams of the pull request earlier when a build timed out.
  • LOW Recommend to make the Cargo project more modularized to reduce the complexity of contributions.
  • LOW Recommend to document more design rationales and expected behavior of the core part of the Cargo project codebase.
@weihanglo
Copy link
Member Author

Close as it serves as a reference, not an issue tracking stuff.

@Amleto
Copy link

Amleto commented May 6, 2023

The performance bot is not made for functional regression, so it should >>not<< block a PR being merged.

Significant missing word @weihanglo

@teor2345
Copy link

  • It also takes more time to merge into Cargo master branch (1 hour) than do a Cargo version bump in rust-lang/rust repository (3 hours).

Significant misspelling: "than" should be "then"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants