Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

Open
wants to merge 14 commits into
base: unstable
Choose a base branch
from

Conversation

jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Aug 16, 2024

Issue Addressed

Built on top of #5829, an optimization came up by @michaelsproul, to fetches blobs from the EL to reduce the delay to block import.

This PR goes further to publish the blobs to the network, which helps improve the resiliency of the network, by having nodes with more resources contribute to blob propagation. This experimental solution is an attempt to solve the self building proposer bandwidth issue discussed on R&D Discord and described in @dankrad's post here.

The benefits of this proposals are:

  • Reduces block import latency: nodes can retrieve blobs from EL without waiting for them from gossip, hence making blocks attestable earlier.
  • Improves blob propagation and network resiliency: blob propagation work from is spread out from 1 node to the entire network, which reduces the likelihood of missed block due to delays in propagtion.
  • Allows scaling without sacrificing decentralization: nodes with more resources will participate in blob building and propagation, allowing nodes with limited bandwidth to continue to produce block post-PeerDAS.

Proposed Changes

  • Deneb: fetch_blobs_and_publish is triggered after a node has processed a gossip / rpc block and is still missing blob components. Once the node fetches the blob from EL, it then publishes the remaining blobs that hasn't seen on gossip to the network.
  • PeerDAS: Same trigger as above, however only supernodes will publish data columns that are unseen on gossip to the network.

Next steps:

  • To maintain low bandwidth for smaller stakers (single validator BN), we could allow some optimisation on block publish behaviour for these nodes only. There are some strategies proposed by @cskiraly to bring the outbound bandwidth requirements for a 32 blobs block to the same level as Deneb (6 blobs). However this wouldn't be recommended for nodes with enough bandwidth.
  • Collect some realistic metrics for a network with 32 blobs per block.

Challenges:

  • Current KZG libraries (c-kzg-4844 and rust-eth-kzg) may struggle with constructing large number of cells and proofs at once due to the current memory allocation approach.
  • Even if we are able reduce the bandwidth usage on the CL side, the bandwidth challenge remains on the EL side, as the node still need to pull the blob transactions into its mempool, to a lesser extent though, because:
    • it's dealing with raw blobs (4096kb for 32 blobs) rather than erasure coded blobs
    • it's pulled-based (eth/68) hence doesn't incur the same gossip amplification cost (8x) on the CL.

TODO before merging

Reference:

@jimmygchen jimmygchen added work-in-progress PR is a work-in-progress das Data Availability Sampling labels Aug 16, 2024
@jimmygchen jimmygchen marked this pull request as ready for review August 16, 2024 07:47
@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Aug 16, 2024
@jimmygchen
Copy link
Member Author

jimmygchen commented Aug 19, 2024

Some early testing results:

  • Proposers withholding all blobs propose blocks with blobs with 100% success rate
  • No outbound bandwidth spike for the full nodes with limited upload

Bandwidth-limited fullnode (cl-01) vs supernode (cl-02):

image
(Thanks to @KatyaRyazantseva for the dashboard above☺️ )

This shows EL inbound traffic (fetch blos from peers) isn't too bad for MAX 6 blobs
The outbound traffic for EL is less relevant here because it includes sending blobs to CL.

image

Next steps:

  • Add more metrics
    • Blocks made available via EL blobs
    • Number of blobs / data columns from EL blobs published
    • EL blob fetch timing
    • Compute cells and proof time
  • Make MAX_BLOBS_PER_BLOCK configurable
  • Try 32 blobs per block
    • EL gas constant update
    • Potential update on derived configs
    • Potential batching of KZG computation to avoid overflow
  • Add --limit-blob-publish (single validator only), which allows for lower mesh peers for data columns topics and withholding certain amount of data columns

@jimmygchen
Copy link
Member Author

This was originally intended experimental but it's been pretty stable, 800 epochs on devnet and 100% participation, and I think we can merge this.
I'll address the review comments, thanks @dapplion for the review 🙏

@dapplion
Copy link
Collaborator

Add --limit-blob-publish (single validator only), which allows for lower mesh peers for data columns topics and withholding certain amount of data columns

Bring the attack to prod


let trusted_setup: TrustedSetup = serde_json::from_reader(TRUSTED_SETUP_BYTES)
.map_err(|e| format!("Unable to read trusted setup file: {}", e))
.expect("should have trusted setup");
let kzg = Kzg::new_from_trusted_setup(trusted_setup).expect("should create kzg");

let blob_refs = blobs.iter().collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might make sense for kzg libraries to take in a impl Iterator<Item = Blob> to possibly avoid these allocations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelsproul and i made an attempt today but it touches too much existing code, we might refactor this in a separate PR.

@kevaundray
Copy link
Contributor

@jimmygchen Any issues with the kzg libraries?

@jimmygchen
Copy link
Member Author

Bring the attack to prod

🙈 Trying to not mention the flag I use for testing. Luckily git blame on that line shows someone else's name haha

Just realized we need the execution-apis PR is merged before merge this - although there's probably no harm getting this one merged as soon as the inclusion is confirmed, as the fetch_blobs function handles this gracefully.

beacon_node/beacon_chain/src/fetch_blobs.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/fetch_blobs.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/fetch_blobs.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/fetch_blobs.rs Show resolved Hide resolved
beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/fetch_blobs.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/fetch_blobs.rs Outdated Show resolved Hide resolved
beacon_node/network/src/network_beacon_processor/mod.rs Outdated Show resolved Hide resolved
consensus/types/src/beacon_block_body.rs Show resolved Hide resolved
@jimmygchen
Copy link
Member Author

@jimmygchen Any issues with the kzg libraries?

Nope all good so far! I was just flagging the potential challenges that I could imagine when we increase blob count, but I haven't actually run into any issues yet. Will keep you updated, thanks ☺️

@jimmygchen jimmygchen added spec_change A change related to the Eth2 spec optimization Something to make Lighthouse run more efficiently. labels Aug 20, 2024
@jimmygchen jimmygchen mentioned this pull request Aug 21, 2024
51 tasks
@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Aug 22, 2024
@mergify mergify bot deleted the branch sigp:unstable August 27, 2024 04:10
@mergify mergify bot closed this Aug 27, 2024
@michaelsproul michaelsproul reopened this Aug 27, 2024
@michaelsproul michaelsproul changed the base branch from das to unstable August 27, 2024 04:21
@jimmygchen
Copy link
Member Author

engine_getBlobsV1 now implemented in

Co-authored-by: Michael Sproul <michael@sigmaprime.io>
@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Aug 29, 2024
# Conflicts:
#	beacon_node/beacon_chain/src/beacon_chain.rs
#	beacon_node/beacon_chain/src/block_verification.rs
#	beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
# Conflicts:
#	beacon_node/beacon_chain/src/beacon_chain.rs
@@ -3551,11 +3604,12 @@ impl<T: BeaconChainTypes> BeaconChain<T> {
self: &Arc<Self>,
slot: Slot,
availability: Availability<T::EthSpec>,
recv: Option<Receiver<DataColumnSidecarList<T::EthSpec>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docs for this parameter?

// Store the block and its state, and execute the confirmation batch for the intermediate
// states, which will delete their temporary flags.
// If the write fails, revert fork choice to the version from disk, else we can
// end up with blocks in fork choice that are missing from disk.
// See https://github.com/sigp/lighthouse/issues/2028
let (_, signed_block, blobs, data_columns) = signed_block.deconstruct();
let custody_columns_count = self.data_availability_checker.get_custody_columns_count();
let data_columns = data_columns.filter(|columns| columns.len() >= custody_columns_count);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this filter necessary? Why would we have too many columns? If that is the case why are the first columns in the vec more important than the latter?

}
}
pub fn is_available(&self, custody_column_count: usize) -> bool {
self.num_expected_blobs()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pendantic, but should rename to block_commitments_count? We technically don't expect blobs in peerdas

.map_or(false, |num_expected_blobs| {
let all_blobs_received = num_expected_blobs == self.num_received_blobs();
let all_columns_received = num_expected_blobs == 0
|| custody_column_count == self.num_received_data_columns();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beacon_chain code seems to imply that we can have too many columns, should this be

self.num_received_data_columns() >= custody_column_count

pending_components.verified_data_columns.len() >= num_of_columns / 2;
is_super_node
&& has_missing_columns
&& is_reconstruction_possible
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah these two branches conflicts a lot and I'm planning to resolve them once one of them is merged.

Another conflict is at the trigger point. They're both triggered when we have missing components - and we'd need to decide whether to fetch from EL or perform reconstruction in the same place. The benchmark for both are very similar but I think we should prioritise reconstruction if we have 50%, so we don't have to try fetch from EL, and fallback to reconstruction.

"No blobs fetched from the EL";
"num_expected_blobs" => num_expected_blobs,
);
inc_counter(&metrics::BLOBS_FROM_EL_MISS_TOTAL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric BLOBS_FROM_EL_MISS_TOTAL is being used with different meaning on each if branch. Here it means 0 fetched blobs, in peerdas it means != total.

Also, we could add another histogram to observe the count of missing blobs from the expected total.

return Ok(None);
}

inc_counter(&metrics::BLOBS_FROM_EL_HIT_TOTAL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, very different meaning to the metric in peerdas

let batch_size =
data_columns_to_publish.len() / SUPERNODE_DATA_COLUMN_PUBLICATION_BATCHES;

for batch in data_columns_to_publish.chunks(batch_size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should de-duplicate this from beacon_node/beacon_chain/src/fetch_blobs.rs if we are following the same batch strategy

);
self.chain.recompute_head_at_current_slot().await;
}
AvailabilityProcessingStatus::MissingComponents(_, _) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it's possible before peerdas if the EL returns more than 0 blobs but not all of them

);
}
},
Ok(None) => { /* No blobs fetched from the EL. Reasons logged separately. */ }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth it to still log a debug here?

@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling optimization Something to make Lighthouse run more efficiently. spec_change A change related to the Eth2 spec waiting-on-author The reviewer has suggested changes and awaits thier implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants