-
Notifications
You must be signed in to change notification settings - Fork 40
simln-lib/refactor: fully deterministic produce events #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
simln-lib/refactor: fully deterministic produce events #277
Conversation
Opening in draft still need to fix some issues. |
btw don't worry about fixups until this is out of draft - when review hasn't properly started it's okay to just squash em! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direction looking good here! Main comment is that I think we need to have a step where we replenish our heap by calling generate_payments
again?
- If
payment_count().is_none()
we return a single payment fromgenerate_payments
- For
RandomPaymentActivity
, this means we'll do one payment per node and then shut down?
Related to this is that we possibly don't want to queue up tons of events for when payment_count
is defined (say, we want a million payments, we'll queue up a million items which is a bit of a memory waste). This probably isn't much of a big deal, because I'd imagine this use case is primarily for smaller numbers but something to keep in mind as we address the above requirement.
Also would be good to rebase this early on to get to a more current base 🙏
The idea would be to generate all the payments at once, so the master task would dispatch the events.
Yes, in this case, only one payment is generated
Yes, right now it is working in this mode 🤔
Yes, you are right, maybe it would be better to create some batches of payments. I am going to try to come up with an alternative to reduce the memory waste. 🤔 |
b06a289
to
1b3a21f
Compare
Hi @carlaKC , I've developed a new approach for the event generation system. The core idea is to centralize the random number generation to ensure deterministic outcomes for our simulations. Here's a breakdown of the design:
This design ensures that the wait times and final destinations are entirely deterministic across simulation runs. However, there is a new challenge with the non-deterministic order of thread execution. The Determinism ChallengeWhile the values generated (wait times, destinations) are fixed if the random number generator is seeded, the order in which the executor threads request these values is not guaranteed. For example, if we have ex1 and ex2 executors:
This means that even though the sequence of random numbers from the central manager is the same, which executor consumes which number from that sequence is left to the operating system's scheduler, leading to variations in the overall simulation flow. Proposed Solution for Execution OrderTo achieve full simulation determinism, including the order of execution, I'm considering adding a tiny, randomized initial sleep time before each executor thread begins its main loop. While seemingly counter-intuitive, this jitter can effectively "break ties" in thread scheduling in a controlled, reproducible way when combined with a seeded random number generator. This would allow us to deterministically influence which thread acquires the next available random number from the central manager. WDYT? |
Deleted previous comment - it had some misunderstandings. Why can't we keep the current approach of generating a queue of events and then replenish the queue when we run out of events? By generating all of our payment data in one place, we don't need to worry about thread execution order. I think that this can be as simple as pushing a new event to the queue every time we pop one? We might need to track some state for payment count (because we'll need to remember how many we've had), but for random activity it should be reasonable. |
Rough sketch of what I was picturing: Queue up initial set of events:
Read from heap:
Instinct about this is:
|
a99bbff
to
2beccfa
Compare
hi @carlaKC I think that now it is working as expected 💪 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some relative timestamp issues that we need to address - didn't occur to me earlier in design phase.
I noticed this when testing out against a toy network, would be good to have some unit tests to assert that we get the payment sequence that we're expecting (should be reasonable to do with defined activities with set wait times / counts).
simln-lib/src/lib.rs
Outdated
log::info!( | ||
"Payment count has been met for {}: {c} payments. Stopping the activity." | ||
, executor.source_info); | ||
return Ok(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we continue
here? If one activity hits its payment_count
, it doesn't mean that we're finished with all other payments?
simln-lib/src/lib.rs
Outdated
payment_generator: Box<dyn PaymentGenerator>, | ||
} | ||
|
||
struct PaymentEvent { | ||
wait_time: Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to have to use an absolute time here - my bad, should have realized that earlier in the design process.
Since we process and sleep all in one queue, a relative value doesn't work because we're always starting from zero when we pop an item off.
For example: say we have two activities, one executes every 5 seconds, the other every 10.
- We push two items to the heap, one in 5 seconds one in 10 seconds
- We sleep 5 seconds, then pop the 5 second event and fire it
- We re-generate another event which should sleep 5 seconds
- We'll push that onto the heap, and it's the next soonest event
- We then sleep another 5 seconds, then pop the next 5 second event
We'll continue like this forever, never hitting the 10 second event. If we have absolute times, that won't happen because the 10 second event will eventually bubble up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh!! right good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the latest version - isn't relative time still an issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, really. In the latest version, absolute_time
is used for heap ordering, whereas wait_time
is only for sleeping.
I added two unit tests, one for the defined activities with a defined payment count and the other for testing the random Rng, and the behaviour is as expected 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, really. In the latest version, absolute_time is used for heap ordering, whereas wait_time is only for sleeping.
Seems like unnecessary complexity to duplicate these values when we could just track absolute_time
and sleep for abs time - now
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in each tick of the loop, the now time changes, and the difference between the abs time and now is not the same as the expected wait_time 🤔
Don't we want this? The amount of time that we want to wait for each payment is the time between payments for that node - not the amount of time since we popped it off our heap?
Eg: starting at t0, with the following payments in the heap
abs_time = t5
abs_time = t7
- We start at
t0
- Pop
t5
off the heap - Now =
t0
- Wait =
t5
-t0
= 5 seconds - Dispatch payment after 5 seconds, now =
t5
- Pop
t7
off the heap - Wait =
t7
-t5
= 2 seconds - Dispatch payment after 2 seconds
Here, we've waited a total of 7 seconds to dispatch the payment that was set to go off at absolute time t7
. If we tracked the time that we added it to calculate the total wait time, then we would have waited for 7 seconds plus the sum of all the payment waits before it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right @carlaKC I think I found the problem. Right now, the next payment is generated before the sleep
; therefore, the difference between abs time
and now
is, on some occasions, negative. If the next payment is generated after the sleep
part, the difference is always positive.
Using the new approach, there are still problems with negative times
, but in the order of -847.822µs
, which should be zero.
.. quite difficult dealing with time 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the new approach, there are still problems with negative times, but in the order of -847.822µs, which should be zero.
We might want to just add some sort of quality check in here that this isn't drifting too far? It we're lagging too far behind then the expected number of payments we simulate will be off. We can just pick a magic number and error out if it's too big.
Happy to leave this for a follow up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part that is dealing with it https://github.com/bitcoin-dev-project/sim-ln/pull/277/files#diff-887dd37ec76113778ea02ac50b44aa1ee99a96b73e917f2fa1d3e42358792f43R1104
For now, if the times are negative, it will always return 0s.
I also found another issue related to how time measurements are done in Rust. For example, instead of getting 4s, the result will be 3.99324
, and at the end, summing up those differences would result in breaking the deterministic aspect that we want to achieve.
simln-lib/src/lib.rs
Outdated
} else { | ||
generate_payments(&mut heap, executor, current_count + 1)?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't need the else branch if we're continuing
simln-lib/src/lib.rs
Outdated
@@ -1561,6 +1610,31 @@ async fn track_payment_result( | |||
Ok(()) | |||
} | |||
|
|||
fn generate_payments( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: generate_payment
? we're only adding one
simln-lib/src/lib.rs
Outdated
|
||
tasks.spawn(async move { | ||
let source = executor.source_info.clone(); | ||
generate_payments(&mut heap, executor, 0)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again - I think we can actually just store the payment event in the heap and keep a hashmap of executors
/ payment counts. When we pop an event off, we just look up the stuff we need in the hashmap and increment the count there, rather than clagging up the hashmap with large structs.
6d7aa41
to
aea34ab
Compare
Still thinking about how to add tests to assert that we get the payment sequence that we're expecting 🤔 |
simln-lib/src/lib.rs
Outdated
if c == payload.current_count { | ||
log::info!( | ||
"Payment count has been met for {}: {c} payments. Stopping the activity." | ||
, executor.source_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just to be aware of, cargo fmt
breaks in selects for some ungodly reason, so it's worth trying to do a little manual formatting
4b07890
to
3f23420
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding the design decisions in this PR quite difficult to follow. To me the obvious approach is to queue cheap-to-copy payment events and track node stats/executors in one big hashmap. Happy to hear arguments for this way, but currently not convinced.
simln-lib/src/lib.rs
Outdated
let payment_amount = executor.payment_generator.payment_amount(payload.capacity); | ||
let amount = match payment_amount { | ||
Ok(amt) => { | ||
if amt == 0 { | ||
log::debug!( | ||
"Skipping zero amount payment for {source} -> {}.", payload.destination | ||
); | ||
generate_payment(&mut heap, executor, payload.current_count, payment_event_payloads.clone()).await?; | ||
continue; | ||
} | ||
} | ||
amt | ||
}, | ||
Err(e) => { | ||
return Err(SimulationError::PaymentGenerationError(e)); | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let payment_amount = executor
.payment_generator
.payment_amount(payload.capacity)
.map_err(SimulationError::PaymentGenerationError)?;
generate_payment(&mut heap, executor, payload.current_count + 1, payment_event_payloads.clone()).await?;
if payment_amount == 0 {
log::debug!(
"Skipping zero amount payment for {source} -> {}.",
payload.destination,
);
continue;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow with this change, rust is not able to infer the Err type and force me to add return Ok::<(), SimulationError>(())
🤔
simln-lib/src/lib.rs
Outdated
payment_generator: Box<dyn PaymentGenerator>, | ||
} | ||
|
||
struct PaymentEvent { | ||
wait_time: Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, really. In the latest version, absolute_time is used for heap ordering, whereas wait_time is only for sleeping.
Seems like unnecessary complexity to duplicate these values when we could just track absolute_time
and sleep for abs time - now
?
Thanks for the feedback, using one big hashmap for the executors is a much better design. |
c404905
to
8d6dd4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't reviewed the tests yet, but looking semantically correct now 👍
No real need for fixups here since this is one commit - feel free to squash what you have and just force push changes, there isn't a commit structure to preserve in this case.
simln-lib/src/lib.rs
Outdated
payment_generator: Box<dyn PaymentGenerator>, | ||
} | ||
|
||
struct PaymentEvent { | ||
pubkey: PublicKey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pubkey -> source?
simln-lib/src/lib.rs
Outdated
capacity: Option<u64>, | ||
} | ||
|
||
struct PaymentEventPayload { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't every intuitively named.
simln-lib/src/lib.rs
Outdated
|
||
// Next, we'll spin up our actual producers that will be responsible for triggering the configured activity. | ||
// The producers will use their own TaskTracker so that the simulation can be shutdown if they all finish. | ||
let producer_tasks = TaskTracker::new(); | ||
match self | ||
.dispatch_producers(activities, consumer_channels, &producer_tasks) | ||
.dispatch_producers(activities, event_sender.clone(), &producer_tasks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary clone
simln-lib/src/lib.rs
Outdated
@@ -995,18 +1020,17 @@ impl<C: Clock + 'static> Simulation<C> { | |||
active_nodes.insert(node_info.pubkey, (node_info, capacity)); | |||
} | |||
|
|||
// Create a network generator with a shared RNG for all nodes. | |||
let network_generator = Arc::new(Mutex::new( | |||
// in order to choose a deterministic destination is necessary to sort the nodes by its public key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Comments should start with capital and end with .
simln-lib/src/lib.rs
Outdated
NetworkGraphView::new( | ||
active_nodes.values().cloned().collect(), | ||
sorted_actives_nodes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the caller somewhat assumes understanding of the implementation of the network generator.
I think that we should move this sort into NetworkGraphView
and then add a note on the trait that implementations should be deterministic if called with the same seed.
simln-lib/src/lib.rs
Outdated
|
||
struct PaymentEventPayload { | ||
executor: ExecutorKit, | ||
current_count: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on making this an Option
? Since we don't need it for payments that aren't count-bounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not necessary. It will make the code unnecessarily more verbose. Right now, the counter is only used here.
if let Some(c) = payment_tracker.executor.payment_generator.payment_count() {
if c == payment_tracker.payments_completed {
log::info!( "Payment count has been met for {}: {c} payments. Stopping the activity.", payment_tracker.executor.source_info);
continue
}
}
And also in get_payment_delay
, which does not accept an option
simln-lib/src/lib.rs
Outdated
} | ||
}, | ||
None => { | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a log here - can see it being useful in debugging
simln-lib/src/lib.rs
Outdated
log::debug!( | ||
"Skipping zero amount payment for {source} -> {}.", destination | ||
); | ||
generate_payment(&mut heap, pubkey, payload.current_count, &mut payment_event_payloads, SystemTime::now()).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the ?
errors will be caught now, because we're spawning this in a task? I believe the loop will exit but we won't see the error.
I think that we should pull this out into a function and then the spawn callsite is responsible for checking whether it exists with an error and triggering shutdown appropriately if it does (like the tasks spawned in internal_run
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generate_payment
is not part of the task.spawn
; it is located just before.
As I understand, the select!
macro does not spawn tasks 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that we spawn all of this code in a task on line 1088. Any ?
that we use in the block of code that is spawned in the task is just swallowed, so we can silently fail rather than printing out the error.
To try this yourself, add a return(Err())
anywhere in this select - the program shuts down but we can't see the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I see it now 😅
simln-lib/src/lib.rs
Outdated
// the output of duration_since is not perfectly round affecting the waiting time | ||
// and as consequence the results are not deterministic; for this reason | ||
// it is necessary to round the output. | ||
Duration::from_secs(elapsed.as_secs_f64().round() as u64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation for using seconds here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is also available as_millis_f64()
but is not stable yet
#[unstable(feature = "duration_millis_float", issue = "122451")]
#[must_use]
#[inline]
#[rustc_const_unstable(feature = "duration_consts_float", issue = "72440")]
pub const fn as_millis_f64(&self) -> f64 {
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need then!
simln-lib/src/lib.rs
Outdated
// Only proceed with a payment if the amount is non-zero, otherwise skip this round. If we can't get | ||
// a payment amount something has gone wrong (because we should have validated that we can always | ||
// generate amounts), so we exit. | ||
let amount = payload.executor.payment_generator.payment_amount(capacity).map_err(SimulationError::PaymentGenerationError)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not generate this along with the rest of the details we push onto the heap?
Doesn't necessarily have to change, just interested in motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 not motivation. I am going to move it along with the other details so is more concise.
8d6dd4a
to
25cb506
Compare
simln-lib/refactor: fully deterministic produce events
25cb506
to
767fe9c
Compare
Description
The goal of this PR is to achieve fully deterministic runs to get reproducible simulations
Changes
nodes: HashMap<PublicKey, Arc<Mutex<dyn LightningNode>>>
UpdateHashMap
for aBTreeMap
. A HashMap does not maintain an order, which has an impact when the simulation is running, making the results unpredictable. Using aBTreeMap
, the order of the nodes is always the same.dispatch_producers
acts as a master task, generating all the payments of the nodes, getting the random destination, and only then spawning a threat for producing the events (produce_events
)Addresses #243