Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

offset: allow zero-byte offset on arbitrary pointers #117329

Merged
merged 2 commits into from
May 22, 2024

Conversation

RalfJung
Copy link
Member

@RalfJung RalfJung commented Oct 28, 2023

As per prior @rust-lang/opsem discussion and FCP:

  • Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
  • Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
  • offset_from on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The offset_from change is needed to maintain the equivalence with offset: if let ptr2 = ptr1.offset(N) is well-defined, then ptr2.offset_from(ptr1) should be well-defined and return N. Now consider the case where N is 0 and ptr1 dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes #65108
Tracking issue | T-lang summary

Cc @nikic

@rustbot
Copy link
Collaborator

rustbot commented Oct 28, 2023

r? @cuviper

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 28, 2023
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@RalfJung
Copy link
Member Author

Uh, what...? Removing the inbounds leads to a segfault...?

@nikic
Copy link
Contributor

nikic commented Oct 28, 2023

Dropping inbounds changes optimization behavior in a really major way, which is likely to expose latent issues. Keeping inbounds for all LLVM versions is a lot less risky.

@RalfJung
Copy link
Member Author

@nikic That's fine for me, I just wonder whether it can cause issues on its own when people start relying on offset(0) being allowed. LLVM was changed in two places to make that legal:

Could one write Rust code that is legal under the new model but miscompiled by old versions of LLVM?

How fixed is our LLVM support policy -- is it always the last 3 versions, or once LLVM 18 is released, could we say that we require LLVM 17?

@nikic
Copy link
Contributor

nikic commented Jan 11, 2024

Could one write Rust code that is legal under the new model but miscompiled by old versions of LLVM?

I think the second patch is not relevant to rust. The first one could in theory be, but you'd need some quite exotic, artificially constructed code for it. I don't think there is any chance that this would affect "real" rust code.

How fixed is our LLVM support policy -- is it always the last 3 versions, or once LLVM 18 is released, could we say that we require LLVM 17?

The earliest we can drop LLVM 16 support is when Fedora 38 goes EOL, which would be in May.

@apiraino
Copy link
Contributor

If I read the prev comment correctly, this PR can switch back to the author.

@rustbot author

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 25, 2024
@RalfJung RalfJung marked this pull request as draft February 5, 2024 21:14
@RalfJung RalfJung added T-lang Relevant to the language team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 19, 2024
@RalfJung
Copy link
Member Author

RalfJung commented Feb 19, 2024

Nominating for t-lang discussion. This implements the t-opsem consensus from rust-lang/opsem-team#10, rust-lang/unsafe-code-guidelines#472 to generally allow zero-sized accesses on all pointers. Also see the tracking issue.

  • Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
  • Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
  • offset_from on two pointers derived from the same allocation is always allowed when they have the same address

This means the following function is safe to be called on any pointer:

fn test_ptr(ptr: *mut ()) { unsafe {
    // Reads and writes.
    let mut val = *ptr;
    *ptr = val;
    ptr.read();
    ptr.write(());
    // Memory access intrinsics.
    // - memcpy (1st and 2nd argument)
    ptr.copy_from_nonoverlapping(&(), 1);
    ptr.copy_to_nonoverlapping(&mut val, 1);
    // - memmove (1st and 2nd argument)
    ptr.copy_from(&(), 1);
    ptr.copy_to(&mut val, 1);
    // - memset
    ptr.write_bytes(0u8, 1);
    // Offset.
    let _ = ptr.offset(0);
    let _ = ptr.offset(1); // this is still 0 bytes
    // Distance.
    let ptr = ptr.cast::<i32>();
    ptr.offset_from(ptr);
} }

Some specific concerns warrant closer scrutiny.

Null pointers

t-opsem decided to allow zero-sized reads and writes on null pointers. This is mostly for consistency: we definitely want to allow zero-sized offsets on null pointers (ptr::null::<T>().offset(0)), since this is allowed in C++ (and a proposal is being made to allow it in C) and there's no reason for us to have more UB than C++ here. But if we allow this, and therefore consider the null pointer to have a zero-sized region of "inbounds" memory, then it would be inconsistent to not allow reading from / writing to that region.

offset_from

We want to maintain the property that if let ptr2 = ptr1.offset(N) is well-defined, then ptr2.offset_from(ptr1) should be well-defined and return N. Now consider the case where N is 0 and ptr1 dangles: the pointers are derived from the same allocated object, but they are not in bounds of anything. So we need a special case saying that if they point to the same address, offset_from is allowed even if they dangle.

For now we maintain the requirement that they must be derived from the same allocation (I think we'll have to change that in the future, but let's disentangle that from the rest of this PR).

@RalfJung RalfJung marked this pull request as ready for review February 19, 2024 08:58
@rustbot
Copy link
Collaborator

rustbot commented Feb 19, 2024

The Miri subtree was changed

cc @rust-lang/miri

Some changes occurred to the CTFE / Miri engine

cc @rust-lang/miri

@RalfJung RalfJung added the I-lang-nominated Nominated for discussion during a lang team meeting. label Feb 19, 2024
@RalfJung
Copy link
Member Author

r? @oli-obk since by far the biggest chunk of this now is interpreter changes

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 21, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request May 22, 2024
…cottmcm

offset: allow zero-byte offset on arbitrary pointers

As per prior `@rust-lang/opsem` [discussion](rust-lang/opsem-team#10) and [FCP](rust-lang/unsafe-code-guidelines#472 (comment)):

- Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
- Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
- `offset_from` on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The `offset_from` change is needed to maintain the equivalence with `offset`: if `let ptr2 = ptr1.offset(N)` is well-defined, then `ptr2.offset_from(ptr1)` should be well-defined and return N. Now consider the case where N is 0 and `ptr1` dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes rust-lang#65108
[Tracking issue](rust-lang#117945) | [T-lang summary](rust-lang#117329 (comment))

Cc `@nikic`
@bors
Copy link
Contributor

bors commented May 22, 2024

⌛ Testing commit 9526ce6 with merge 11392c9...

@rust-log-analyzer
Copy link
Collaborator

A job failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
  SDKROOT: /Applications/Xcode_14.3.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk
  AR: ar
##[endgroup]
curl: (28) Operation too slow. Less than 100 bytes/sec transferred the last 5 seconds
Error: Failure while executing; `/usr/bin/env /usr/local/Homebrew/Library/Homebrew/shims/shared/curl --disable --cookie /dev/null --globoff --user-agent Homebrew/4.3.0\ \(Macintosh\;\ Intel\ Mac\ OS\ X\ 13.6.6\)\ curl/8.4.0 --header Accept-Language:\ en --fail --progress-bar --silent --remote-time --output /Users/runner/Library/Caches/Homebrew/api/formula.jws.json --location --disable --cookie /dev/null --globoff --show-error --user-agent Homebrew/4.3.0\ \(Macintosh\;\ Intel\ Mac\ OS\ X\ 13.6.6\)\ curl/8.4.0 --header Accept-Language:\ en --fail --progress-bar --silent --compressed --speed-limit 100 --speed-time 5 https://formulae.brew.sh/api/formula.jws.json` exited with 28. Here's the output:
##[error]Failure while executing; `/usr/bin/env /usr/local/Homebrew/Library/Homebrew/shims/shared/curl --disable --cookie /dev/null --globoff --user-agent Homebrew/4.3.0\ \(Macintosh\;\ Intel\ Mac\ OS\ X\ 13.6.6\)\ curl/8.4.0 --header Accept-Language:\ en --fail --progress-bar --silent --remote-time --output /Users/runner/Library/Caches/Homebrew/api/formula.jws.json --location --disable --cookie /dev/null --globoff --show-error --user-agent Homebrew/4.3.0\ \(Macintosh\;\ Intel\ Mac\ OS\ X\ 13.6.6\)\ curl/8.4.0 --header Accept-Language:\ en --fail --progress-bar --silent --compressed --speed-limit 100 --speed-time 5 https://formulae.brew.sh/api/formula.jws.json` exited with 28. Here's the output:
curl: (28) Operation too slow. Less than 100 bytes/sec transferred the last 5 seconds

##[error]Process completed with exit code 1.
Post job cleanup.

@bors
Copy link
Contributor

bors commented May 22, 2024

💔 Test failed - checks-actions

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 22, 2024
@RalfJung
Copy link
Member Author

@bors retry dist-apple: curl: (28) Operation too slow. Less than 100 bytes/sec transferred the last 5 seconds

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 22, 2024
@bors
Copy link
Contributor

bors commented May 22, 2024

⌛ Testing commit 9526ce6 with merge 5d328a1...

@bors
Copy link
Contributor

bors commented May 22, 2024

☀️ Test successful - checks-actions
Approved by: oli-obk,scottmcm
Pushing 5d328a1 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label May 22, 2024
@bors bors merged commit 5d328a1 into rust-lang:master May 22, 2024
7 checks passed
@rustbot rustbot added this to the 1.80.0 milestone May 22, 2024
@RalfJung RalfJung deleted the offset-by-zero branch May 22, 2024 17:41
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (5d328a1): comparison URL.

Overall result: ❌ regressions - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.9% [0.8%, 1.2%] 6
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary -3.7%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.7% [-3.7%, -3.7%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -3.7% [-3.7%, -3.7%] 1

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 672.314s -> 671.339s (-0.15%)
Artifact size: 315.50 MiB -> 315.69 MiB (0.06%)

github-actions bot pushed a commit to rust-lang/miri that referenced this pull request May 23, 2024
offset: allow zero-byte offset on arbitrary pointers

As per prior `@rust-lang/opsem` [discussion](rust-lang/opsem-team#10) and [FCP](rust-lang/unsafe-code-guidelines#472 (comment)):

- Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
- Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
- `offset_from` on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The `offset_from` change is needed to maintain the equivalence with `offset`: if `let ptr2 = ptr1.offset(N)` is well-defined, then `ptr2.offset_from(ptr1)` should be well-defined and return N. Now consider the case where N is 0 and `ptr1` dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes rust-lang/rust#65108
[Tracking issue](rust-lang/rust#117945) | [T-lang summary](rust-lang/rust#117329 (comment))

Cc `@nikic`
@apiraino apiraino removed the to-announce Announce this issue on triage meeting label May 24, 2024
flip1995 pushed a commit to flip1995/rust-clippy that referenced this pull request May 24, 2024
offset: allow zero-byte offset on arbitrary pointers

As per prior `@rust-lang/opsem` [discussion](rust-lang/opsem-team#10) and [FCP](rust-lang/unsafe-code-guidelines#472 (comment)):

- Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
- Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
- `offset_from` on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The `offset_from` change is needed to maintain the equivalence with `offset`: if `let ptr2 = ptr1.offset(N)` is well-defined, then `ptr2.offset_from(ptr1)` should be well-defined and return N. Now consider the case where N is 0 and `ptr1` dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes rust-lang/rust#65108
[Tracking issue](rust-lang/rust#117945) | [T-lang summary](rust-lang/rust#117329 (comment))

Cc `@nikic`
@joshlf joshlf mentioned this pull request May 26, 2024
81 tasks
bors added a commit to rust-lang/rust-analyzer that referenced this pull request Jun 20, 2024
offset: allow zero-byte offset on arbitrary pointers

As per prior `@rust-lang/opsem` [discussion](rust-lang/opsem-team#10) and [FCP](rust-lang/unsafe-code-guidelines#472 (comment)):

- Zero-sized reads and writes are allowed on all sufficiently aligned pointers, including the null pointer
- Inbounds-offset-by-zero is allowed on all pointers, including the null pointer
- `offset_from` on two pointers derived from the same allocation is always allowed when they have the same address

This removes surprising UB (in particular, even C++ allows "nullptr + 0", which we currently disallow), and it brings us one step closer to an important theoretical property for our semantics ("provenance monotonicity": if operations are valid on bytes without provenance, then adding provenance can't make them invalid).

The minimum LLVM we require (v17) includes https://reviews.llvm.org/D154051, so we can finally implement this.

The `offset_from` change is needed to maintain the equivalence with `offset`: if `let ptr2 = ptr1.offset(N)` is well-defined, then `ptr2.offset_from(ptr1)` should be well-defined and return N. Now consider the case where N is 0 and `ptr1` dangles: we want to still allow offset_from here.

I think we should change offset_from further, but that's a separate discussion.

Fixes rust-lang/rust#65108
[Tracking issue](rust-lang/rust#117945) | [T-lang summary](rust-lang/rust#117329 (comment))

Cc `@nikic`
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jul 15, 2024
…oli-obk

offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang#124921 (comment)).
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jul 15, 2024
…oli-obk

offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang#124921 (comment)).
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jul 15, 2024
…oli-obk

offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang#124921 (comment)).
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Jul 15, 2024
Rollup merge of rust-lang#124921 - RalfJung:offset-from-same-addr, r=oli-obk

offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang#124921 (comment)).
github-actions bot pushed a commit to rust-lang/miri that referenced this pull request Jul 16, 2024
offset_from: always allow pointers to point to the same address

This PR implements the last remaining part of the t-opsem consensus in rust-lang/unsafe-code-guidelines#472: always permits offset_from when both pointers have the same address, no matter how they are computed. This is required to achieve *provenance monotonicity*.

Tracking issue: rust-lang/rust#117945

### What is provenance monotonicity and why does it matter?

Provenance monotonicity is the property that adding arbitrary provenance to any no-provenance pointer must never make the program UB. More specifically, in the program state, data in memory is stored as a sequence of [abstract bytes](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#abstract-byte), where each byte can optionally carry provenance. When a pointer is stored in memory, all of the bytes it is stored in carry that provenance. Provenance monotonicity means: if we take some byte that does not have provenance, and give it some arbitrary provenance, then that cannot change program behavior or introduce UB into a UB-free program.

We care about provenance monotonicity because we want to allow the optimizer to remove provenance-stripping operations. Removing a provenance-stripping operation effectively means the program after the optimization has provenance where the program before the optimization did not -- since the provenance removal does not happen in the optimized program. IOW, the compiler transformation added provenance to previously provenance-free bytes. This is exactly what provenance monotonicity lets us do.

We care about removing provenance-stripping operations because `*ptr = *ptr` is, in general, (likely) a provenance-stripping operation. Specifically, consider `ptr: *mut usize` (or any integer type), and imagine the data at `*ptr` is actually a pointer (i.e., we are type-punning between pointers and integers). Then `*ptr` on the right-hand side evaluates to the data in memory *without* any provenance (because [integers do not have provenance](https://rust-lang.github.io/rfcs/3559-rust-has-provenance.html#integers-do-not-have-provenance)). Storing that back to `*ptr` means that the abstract bytes `ptr` points to are the same as before, except their provenance is now gone. This makes  `*ptr = *ptr`  a provenance-stripping operation  (Here we assume `*ptr` is fully initialized. If it is not initialized, evaluating `*ptr` to a value is UB, so removing `*ptr = *ptr` is trivially correct.)

### What does `offset_from` have to do with provenance monotonicity?

With `ptr = without_provenance(N)`, `ptr.offset_from(ptr)` is always well-defined and returns 0. By provenance monotonicity, I can now add provenance to the two arguments of `offset_from` and it must still be well-defined. Crucially, I can add *different* provenance to the two arguments, and it must still be well-defined. In other words, this must always be allowed: `ptr1.with_addr(N).offset_from(ptr2.with_addr(N))` (and it returns 0). But the current spec for `offset_from` says that the two pointers must either both be derived from an integer or both be derived from the same allocation, which is not in general true for arbitrary `ptr1`, `ptr2`.

To obtain provenance monotonicity, this PR hence changes the spec for offset_from to say that if both pointers have the same address, the function is always well-defined.

### What further consequences does this have?

It means the compiler can no longer transform `end2 = begin.offset(end.offset_from(begin))` into `end2 = end`. However, it can still be transformed into `end2 = begin.with_addr(end.addr())`, which later parts of the backend (when provenance has been erased) can trivially turn into `end2 = end`.

The only alternative I am aware of is a fundamentally different handling of zero-sized accesses, where a "no provenance" pointer is not allowed to do zero-sized accesses and instead we have a special provenance that indicates "may be used for zero-sized accesses (and nothing else)". `offset` and `offset_from` would then always be UB on a "no provenance" pointer, and permit zero-sized offsets on a "zero-sized provenance" pointer. This achieves provenance monotonicity. That is, however, a breaking change as it contradicts what we landed in rust-lang/rust#117329. It's also a whole bunch of extra UB, which doesn't seem worth it just to achieve that transformation.

### What about the backend?

LLVM currently doesn't have an intrinsic for pointer difference, so we anyway cast to integer and subtract there. That's never UB so it is compatible with any relaxation we may want to apply.

If LLVM gets a `ptrsub` in the future, then plausibly it will be consistent with `ptradd` and [consider two equal pointers to be inbounds](rust-lang/rust#124921 (comment)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. merged-by-bors This PR was explicitly merged by bors. relnotes Marks issues that should be documented in the release notes of the next release. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ptr::offset should explicitly clarify 0-sized offset semantics