-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268
base: unstable
Are you sure you want to change the base?
Conversation
Some early testing results:
Bandwidth-limited fullnode (
This shows EL inbound traffic (fetch blos from peers) isn't too bad for MAX 6 blobs Next steps:
|
This was originally intended experimental but it's been pretty stable, 800 epochs on devnet and 100% participation, and I think we can merge this. |
dbf4bb8
to
01b91cb
Compare
Bring the attack to prod |
|
||
let trusted_setup: TrustedSetup = serde_json::from_reader(TRUSTED_SETUP_BYTES) | ||
.map_err(|e| format!("Unable to read trusted setup file: {}", e)) | ||
.expect("should have trusted setup"); | ||
let kzg = Kzg::new_from_trusted_setup(trusted_setup).expect("should create kzg"); | ||
|
||
let blob_refs = blobs.iter().collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it might make sense for kzg libraries to take in a impl Iterator<Item = Blob>
to possibly avoid these allocations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelsproul and i made an attempt today but it touches too much existing code, we might refactor this in a separate PR.
@jimmygchen Any issues with the kzg libraries? |
🙈 Trying to not mention the flag I use for testing. Luckily Just realized we need the |
Nope all good so far! I was just flagging the potential challenges that I could imagine when we increase blob count, but I haven't actually run into any issues yet. Will keep you updated, thanks |
|
Co-authored-by: Michael Sproul <michael@sigmaprime.io>
bd24cdd
to
662d6cf
Compare
# Conflicts: # beacon_node/beacon_chain/src/beacon_chain.rs # beacon_node/beacon_chain/src/block_verification.rs # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
# Conflicts: # beacon_node/beacon_chain/src/beacon_chain.rs
@@ -3551,11 +3604,12 @@ impl<T: BeaconChainTypes> BeaconChain<T> { | |||
self: &Arc<Self>, | |||
slot: Slot, | |||
availability: Availability<T::EthSpec>, | |||
recv: Option<Receiver<DataColumnSidecarList<T::EthSpec>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add docs for this parameter?
// Store the block and its state, and execute the confirmation batch for the intermediate | ||
// states, which will delete their temporary flags. | ||
// If the write fails, revert fork choice to the version from disk, else we can | ||
// end up with blocks in fork choice that are missing from disk. | ||
// See https://github.com/sigp/lighthouse/issues/2028 | ||
let (_, signed_block, blobs, data_columns) = signed_block.deconstruct(); | ||
let custody_columns_count = self.data_availability_checker.get_custody_columns_count(); | ||
let data_columns = data_columns.filter(|columns| columns.len() >= custody_columns_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this filter necessary? Why would we have too many columns? If that is the case why are the first columns in the vec more important than the latter?
} | ||
} | ||
pub fn is_available(&self, custody_column_count: usize) -> bool { | ||
self.num_expected_blobs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pendantic, but should rename to block_commitments_count
? We technically don't expect blobs in peerdas
.map_or(false, |num_expected_blobs| { | ||
let all_blobs_received = num_expected_blobs == self.num_received_blobs(); | ||
let all_columns_received = num_expected_blobs == 0 | ||
|| custody_column_count == self.num_received_data_columns(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beacon_chain code seems to imply that we can have too many columns, should this be
self.num_received_data_columns() >= custody_column_count
pending_components.verified_data_columns.len() >= num_of_columns / 2; | ||
is_super_node | ||
&& has_missing_columns | ||
&& is_reconstruction_possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah these two branches conflicts a lot and I'm planning to resolve them once one of them is merged.
Another conflict is at the trigger point. They're both triggered when we have missing components - and we'd need to decide whether to fetch from EL or perform reconstruction in the same place. The benchmark for both are very similar but I think we should prioritise reconstruction if we have 50%, so we don't have to try fetch from EL, and fallback to reconstruction.
"No blobs fetched from the EL"; | ||
"num_expected_blobs" => num_expected_blobs, | ||
); | ||
inc_counter(&metrics::BLOBS_FROM_EL_MISS_TOTAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric BLOBS_FROM_EL_MISS_TOTAL
is being used with different meaning on each if branch. Here it means 0 fetched blobs, in peerdas it means != total.
Also, we could add another histogram to observe the count of missing blobs from the expected total.
return Ok(None); | ||
} | ||
|
||
inc_counter(&metrics::BLOBS_FROM_EL_HIT_TOTAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, very different meaning to the metric in peerdas
let batch_size = | ||
data_columns_to_publish.len() / SUPERNODE_DATA_COLUMN_PUBLICATION_BATCHES; | ||
|
||
for batch in data_columns_to_publish.chunks(batch_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should de-duplicate this from beacon_node/beacon_chain/src/fetch_blobs.rs
if we are following the same batch strategy
); | ||
self.chain.recompute_head_at_current_slot().await; | ||
} | ||
AvailabilityProcessingStatus::MissingComponents(_, _) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it's possible before peerdas if the EL returns more than 0 blobs but not all of them
); | ||
} | ||
}, | ||
Ok(None) => { /* No blobs fetched from the EL. Reasons logged separately. */ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth it to still log a debug here?
Issue Addressed
Built on top of #5829, an optimization came up by @michaelsproul, to fetches blobs from the EL to reduce the delay to block import.
This PR goes further to publish the blobs to the network, which helps improve the resiliency of the network, by having nodes with more resources contribute to blob propagation. This experimental solution is an attempt to solve the self building proposer bandwidth issue discussed on R&D Discord and described in @dankrad's post here.
The benefits of this proposals are:
Proposed Changes
fetch_blobs_and_publish
is triggered after a node has processed a gossip / rpc block and is still missing blob components. Once the node fetches the blob from EL, it then publishes the remaining blobs that hasn't seen on gossip to the network.Next steps:
Challenges:
c-kzg-4844
andrust-eth-kzg
) may struggle with constructing large number of cells and proofs at once due to the current memory allocation approach.eth/68
) hence doesn't incur the same gossip amplification cost (8x) on the CL.TODO before merging
engine_getBlobsV1
ethereum/consensus-specs#3864engine_getBlobsV1
ethereum/execution-apis#559Reference:
unstable
Get blobs from the EL's blob pool #5829