-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix an issue with sync::Barrier
leader selection with reuse
#102
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up writing this test because I wasn't sure about the case where 1 < barrier_size < num_threads
; I think it subsumes some of yours:
// Test that any combination of threads can be chosen as leader of an epoch.
fn barrier_test_nondeterministic_leader(batch_size: usize, num_threads: usize) {
// Don't test cases that might deadlock; every thread needs to be released eventually
assert!(num_threads % batch_size == 0);
let seen_leaders = Arc::new(std::sync::Mutex::new(HashSet::new()));
{
let seen_leaders = Arc::clone(&seen_leaders);
check_dfs(
move || {
let barrier = Arc::new(Barrier::new(batch_size));
let leaders = Arc::new(std::sync::Mutex::new(vec![]));
let handles = (0..num_threads).map(|i| {
let barrier = Arc::clone(&barrier);
let leaders = Arc::clone(&leaders);
thread::spawn(move || {
let result = barrier.wait();
if result.is_leader() {
leaders.lock().unwrap().push(i);
}
})
}).collect::<Vec<_>>();
for thd in handles {
thd.join().unwrap();
}
let leaders = Arc::try_unwrap(leaders).unwrap().into_inner().unwrap();
seen_leaders.lock().unwrap().insert(leaders);
},
None,
);
}
let seen_leaders = Arc::try_unwrap(seen_leaders).unwrap().into_inner().unwrap();
let num_batches = num_threads / batch_size;
let fact = |i: usize| (2..i+1).reduce(|acc, i| acc * i).unwrap_or(1);
// Every execution should have one leader per batch
assert!(seen_leaders.iter().all(|l| l.len() == num_batches));
// There should be `num_threads permute num_batches` different leader vectors
assert_eq!(seen_leaders.len(), fact(num_threads) / fact(num_threads - num_batches));
}
#[test]
fn barrier_test_nondeterministic_leader_1() {
barrier_test_nondeterministic_leader(1, 4);
}
#[test]
fn barrier_test_nondeterministic_leader_2() {
barrier_test_nondeterministic_leader(2, 4);
}
#[test]
fn barrier_test_nondeterministic_leader_4() {
barrier_test_nondeterministic_leader(4, 4);
}
src/sync/barrier.rs
Outdated
// Mark the task as requesting an explicit yield and schedule a thread switch, otherwise | ||
// some schedulers would nearly always pick to continue executing the current task. | ||
// Requesting a yield allows other threads to be the first one to wake up and grab the | ||
// leader token. | ||
ExecutionState::request_yield(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike #101 I think we don't want to yield in this case. We needed it there for fairness because park
gets used in spinloops, which are doomed by unfair schedules. But barriers aren't used like that—threads generally do useful work once unblocked by a barrier. (I do wish we had better rules for thinking about this, but it's all random in the end anyway, so it gets a bit handwave-y).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention here was to prevent an unfair schedule from making it so that it's (almost) always the last thread that gets to the barrier that becomes the leader, by giving every other thread a chance to wake up first and become the leader instead. In most situations I can think of though, it's probably already random enough which thread happens to call wait last, so the "extra" fairness here won't help much. I'll remove the request_yield
here.
src/sync/barrier.rs
Outdated
// Block current thread | ||
assert!(state.waiters.insert(me)); // current thread shouldn't already be in the set | ||
// The current thread shouldn't already be in `waiters`. | ||
assert!(state.waiters.insert(me)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I marginally preferred the existing code that didn't needlessly add the current thread to the waiters set, so that the invariant was that all waiters are blocked. It ends up having roughly the same number of special cases for me
, just different ones. But we should still assert here that me
isn't already a waiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ends up having roughly the same number of special cases for me, just different ones.
What special cases for me
do you mean? Assuming I kill assert_eq!(t.blocked(), tid != me)
(which is really just for debugging), then the only use of me
in the function is in the line you commented on here which I could just inline as:
assert!(state.waiters.insert(ExecutionState::me()));
[...] the invariant was that all waiters are blocked.
Arguably, that still is an invariant. The barrier's state
is never release while this isn't true (and every thread in waiters
is always one that called wait
).
@@ -22,9 +22,19 @@ impl BarrierWaitResult { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write a comment like the one we have for Condvar explaining how/why the implementation works?
Line 40 in 94b9d8b
// We implement `Condvar` by tracking the `CondvarWaitStatus` of each thread currently waiting on |
It should probably be a doc comment on BarrierState
? (the Condvar one should probably have been a doc comment on CondvarState
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a doc comment to BarrierState
in latest revision.
The Rust standard library documentation for `sync::Barrier` (https://doc.rust-lang.org/std/sync/struct.Barrier.html) states the following about `Barrier::wait`: > Blocks the current thread until all threads have rendezvoused here. > Barriers are re-usable after all threads have rendezvoused once, > and can be used continuously. A single (arbitrary) thread will > receive a BarrierWaitResult that returns true from > BarrierWaitResult::is_leader() when returning from this function, > and all other threads will receive a result that will return false > from BarrierWaitResult::is_leader(). This documentation is a bit ambiguous. It says that "a single thread" becomes the leader after returning from `wait`, but it isn't clear if that means a single thread for the existence of the `Barrier`, or a single thread for each "batch" of threads released by the barrier. The latter interpretation seems to make the most sense given the context of being able to reuse barriers, and it seems more useful to be able to assume that a leader is set for each batch. That is in fact the behavior implemented by the standard library as evidenced by the implementation (https://doc.rust-lang.org/src/std/sync/barrier.rs.html#126-143) and this unit test, which reuses a barrier on a new thread: ``` let barrier = Barrier::new(1); assert!(barrier.wait().is_leader()); let thd = thread::spawn(move || { assert!(barrier.wait().is_leader()); }); thd.join().unwrap(); ``` Let's assume this behavior is correct, and that ambiguity in the Rust documentation would be resolved in favor of the existing implementation. Then the Shuttle behavior for `Barrier` is incorrect and causes this test to fail, because there is only a single, global `leader` field on the barrier which is set once the first time it's used. This commit fixes the problem by adding the notion of an "epoch" to the `Barrier` implementation, which allows there to be different leaders for each batch / group of threads released by the barrier. (This is similar to the standard library implementation's notion of generation). The commit also adds some new unit tests to capture the issue with the old behavior and prevent regressions. It also tweaks `Barrier::wait` to become an explicit thread yield point, in order to vary which thread becomes the leader rather than just choosing the first one that happens to call `wait`. Co-authored-by: James Bornholt <bornholt@amazon.com>
The Rust standard library documentation for `sync::Barrier` (https://doc.rust-lang.org/std/sync/struct.Barrier.html) states the following about `Barrier::wait`: > Blocks the current thread until all threads have rendezvoused here. > Barriers are re-usable after all threads have rendezvoused once, > and can be used continuously. A single (arbitrary) thread will > receive a BarrierWaitResult that returns true from > BarrierWaitResult::is_leader() when returning from this function, > and all other threads will receive a result that will return false > from BarrierWaitResult::is_leader(). This documentation is a bit ambiguous. It says that "a single thread" becomes the leader after returning from `wait`, but it isn't clear if that means a single thread for the existence of the `Barrier`, or a single thread for each "batch" of threads released by the barrier. The latter interpretation seems to make the most sense given the context of being able to reuse barriers, and it seems more useful to be able to assume that a leader is set for each batch. That is in fact the behavior implemented by the standard library as evidenced by the implementation (https://doc.rust-lang.org/src/std/sync/barrier.rs.html#126-143) and this unit test, which reuses a barrier on a new thread: ``` let barrier = Barrier::new(1); assert!(barrier.wait().is_leader()); let thd = thread::spawn(move || { assert!(barrier.wait().is_leader()); }); thd.join().unwrap(); ``` Let's assume this behavior is correct, and that ambiguity in the Rust documentation would be resolved in favor of the existing implementation. Then the Shuttle behavior for `Barrier` is incorrect and causes this test to fail, because there is only a single, global `leader` field on the barrier which is set once the first time it's used. This commit fixes the problem by adding the notion of an "epoch" to the `Barrier` implementation, which allows there to be different leaders for each batch / group of threads released by the barrier. (This is similar to the standard library implementation's notion of generation). The commit also adds some new unit tests to capture the issue with the old behavior and prevent regressions. It also tweaks `Barrier::wait` to become an explicit thread yield point, in order to vary which thread becomes the leader rather than just choosing the first one that happens to call `wait`. Co-authored-by: James Bornholt <bornholt@amazon.com>
The Rust standard library documentation for
sync::Barrier
states the following aboutBarrier::wait
:This documentation is a bit ambiguous. It says that "a single thread" becomes the leader after returning from
wait
, but it isn't clear if that means a single thread for the existence of theBarrier
, or a single thread for each "batch" of threads released by the barrier. The latter interpretation seems to make the most sense given the context of being able to reuse barriers, and it seems more useful to be able to assume that a leader is set for each batch. That is in fact the behavior implemented by the standard library as evidenced by the implementation and this unit test, which reuses a barrier on a new thread:Let's assume this behavior is correct, and that ambiguity in the Rust documentation would be resolved in favor of the existing implementation. Then the Shuttle behavior for
Barrier
is incorrect and causes this test to fail, because there is only a single, globalleader
field on the barrier which is set once the first time it's used.This commit fixes the problem by adding the notion of an "epoch" to the
Barrier
implementation, which allows there to be different leaders for each batch / group of threads released by the barrier. (This is similar to the standard library implementation's notion of generation).The commit also adds some new unit tests to capture the issue with the old behavior and prevent regressions. It also tweaks
Barrier::wait
to become an explicit thread yield point, in order to vary which thread becomes the leader rather than just choosing the first one that happens to call `wait.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.