Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

jimmygchen · 2024-08-16T01:30:56Z

Issue Addressed

Built on top of #5829, an optimization came up by @michaelsproul, to fetches blobs from the EL to reduce the delay to block import.

This PR goes further to publish the blobs to the network, which helps improve the resiliency of the network, by having nodes with more resources contribute to blob propagation. This experimental solution is an attempt to solve the self building proposer bandwidth issue discussed on R&D Discord and described in @dankrad's post here.

The benefits of this proposals are:

Reduces block import latency: nodes can retrieve blobs from EL without waiting for them from gossip, hence making blocks attestable earlier.
Improves blob propagation and network resiliency: blob propagation work from is spread out from 1 node to the entire network, which reduces the likelihood of missed block due to delays in propagtion.
Allows scaling without sacrificing decentralization: nodes with more resources will participate in blob building and propagation, allowing nodes with limited bandwidth to continue to produce block post-PeerDAS.

Proposed Changes

Deneb: fetch_blobs_and_publish is triggered after a node has processed a gossip / rpc block and is still missing blob components. Once the node fetches the blob from EL, it then publishes the remaining blobs that hasn't seen on gossip to the network.
PeerDAS: Same trigger as above, however only supernodes will publish data columns that are unseen on gossip to the network.

Next steps:

To maintain low bandwidth for smaller stakers (single validator BN), we could allow some optimisation on block publish behaviour for these nodes only. There are some strategies proposed by @cskiraly to bring the outbound bandwidth requirements for a 32 blobs block to the same level as Deneb (6 blobs). However this wouldn't be recommended for nodes with enough bandwidth.
Collect some realistic metrics for a network with 32 blobs per block.

Challenges:

Current KZG libraries (c-kzg-4844 and rust-eth-kzg) may struggle with constructing large number of cells and proofs at once due to the current memory allocation approach.
Even if we are able reduce the bandwidth usage on the CL side, the bandwidth challenge remains on the EL side, as the node still need to pull the blob transactions into its mempool, to a lesser extent though, because:
- it's dealing with raw blobs (4096kb for 32 blobs) rather than erasure coded blobs
- it's pulled-based (eth/68) hence doesn't incur the same gossip amplification cost (8x) on the CL.

TODO before merging

Wait for spec to be agreed on by EL clients
- P2P clarifications when introducing engine_getBlobsV1 ethereum/consensus-specs#3864
- Define engine_getBlobsV1 ethereum/execution-apis#559

Reference:

@michaelsproul's original PR targeting unstable Get blobs from the EL's blob pool #5829
R&D Discord discussion thread
How to help self builders with blobs by @dankrad
Fetch blobs from EL pool by @dapplion

beacon_node/beacon_chain/src/beacon_chain.rs

beacon_node/beacon_chain/src/fetch_blobs.rs

consensus/types/src/blob_sidecar.rs

jimmygchen · 2024-08-19T11:41:38Z

Some early testing results:

Proposers withholding all blobs propose blocks with blobs with 100% success rate
No outbound bandwidth spike for the full nodes with limited upload

Bandwidth-limited fullnode (cl-01) vs supernode (cl-02):

(Thanks to @KatyaRyazantseva for the dashboard above☺️ )

This shows EL inbound traffic (fetch blos from peers) isn't too bad for MAX 6 blobs
The outbound traffic for EL is less relevant here because it includes sending blobs to CL.

Next steps:

Add more metrics
- Blocks made available via EL blobs
- Number of blobs / data columns from EL blobs published
- EL blob fetch timing
- Compute cells and proof time
Make MAX_BLOBS_PER_BLOCK configurable
Try 32 blobs per block
- EL gas constant update
- Potential update on derived configs
- Potential batching of KZG computation to avoid overflow
Add --limit-blob-publish (single validator only), which allows for lower mesh peers for data columns topics and withholding certain amount of data columns

jimmygchen · 2024-08-20T01:29:12Z

This was originally intended experimental but it's been pretty stable, 800 epochs on devnet and 100% participation, and I think we can merge this.
I'll address the review comments, thanks @dapplion for the review 🙏

dapplion · 2024-08-20T11:25:08Z

Add --limit-blob-publish (single validator only), which allows for lower mesh peers for data columns topics and withholding certain amount of data columns

Bring the attack to prod

kevaundray · 2024-08-20T11:40:07Z

beacon_node/beacon_chain/src/kzg_utils.rs


        let trusted_setup: TrustedSetup = serde_json::from_reader(TRUSTED_SETUP_BYTES)
            .map_err(|e| format!("Unable to read trusted setup file: {}", e))
            .expect("should have trusted setup");
        let kzg = Kzg::new_from_trusted_setup(trusted_setup).expect("should create kzg");

+        let blob_refs = blobs.iter().collect::<Vec<_>>();


I guess it might make sense for kzg libraries to take in a impl Iterator<Item = Blob> to possibly avoid these allocations

@michaelsproul and i made an attempt today but it touches too much existing code, we might refactor this in a separate PR.

kevaundray · 2024-08-20T11:40:21Z

@jimmygchen Any issues with the kzg libraries?

jimmygchen · 2024-08-20T11:49:41Z

Bring the attack to prod

🙈 Trying to not mention the flag I use for testing. Luckily git blame on that line shows someone else's name haha

Just realized we need the execution-apis PR is merged before merge this - although there's probably no harm getting this one merged as soon as the inclusion is confirmed, as the fetch_blobs function handles this gracefully.

beacon_node/beacon_chain/src/fetch_blobs.rs

beacon_node/beacon_chain/src/beacon_chain.rs

beacon_node/beacon_chain/src/fetch_blobs.rs

beacon_node/network/src/network_beacon_processor/mod.rs

beacon_node/network/src/network_beacon_processor/sync_methods.rs

consensus/types/src/beacon_block_body.rs

jimmygchen · 2024-08-20T12:03:16Z

@jimmygchen Any issues with the kzg libraries?

Nope all good so far! I was just flagging the potential challenges that I could imagine when we increase blob count, but I haven't actually run into any issues yet. Will keep you updated, thanks ☺️

jimmygchen · 2024-08-28T00:33:36Z

engine_getBlobsV1 now implemented in

Nethermind : Add engine_getBlobsV1 NethermindEth/nethermind#7322 (merged)
Reth: Implement engine_getBlobsV1 paradigmxyz/reth#9723 (open)

Co-authored-by: Michael Sproul <michael@sigmaprime.io>

# Conflicts: # beacon_node/beacon_chain/src/beacon_chain.rs # beacon_node/beacon_chain/src/block_verification.rs # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

# Conflicts: # beacon_node/beacon_chain/src/beacon_chain.rs

beacon_node/beacon_chain/src/fetch_blobs.rs

dapplion · 2024-09-18T06:07:44Z

beacon_node/beacon_chain/src/beacon_chain.rs

@@ -3551,11 +3604,12 @@ impl<T: BeaconChainTypes> BeaconChain<T> {
        self: &Arc<Self>,
        slot: Slot,
        availability: Availability<T::EthSpec>,
+        recv: Option<Receiver<DataColumnSidecarList<T::EthSpec>>>,


Can you add docs for this parameter?

dapplion · 2024-09-18T06:10:03Z

beacon_node/beacon_chain/src/beacon_chain.rs

        // Store the block and its state, and execute the confirmation batch for the intermediate
        // states, which will delete their temporary flags.
        // If the write fails, revert fork choice to the version from disk, else we can
        // end up with blocks in fork choice that are missing from disk.
        // See https://github.com/sigp/lighthouse/issues/2028
        let (_, signed_block, blobs, data_columns) = signed_block.deconstruct();
+        let custody_columns_count = self.data_availability_checker.get_custody_columns_count();
+        let data_columns = data_columns.filter(|columns| columns.len() >= custody_columns_count);


Why is this filter necessary? Why would we have too many columns? If that is the case why are the first columns in the vec more important than the latter?

dapplion · 2024-09-18T06:12:06Z

beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

-            }
-        }
+    pub fn is_available(&self, custody_column_count: usize) -> bool {
+        self.num_expected_blobs()


Pendantic, but should rename to block_commitments_count? We technically don't expect blobs in peerdas

dapplion · 2024-09-18T06:12:53Z

beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

+            .map_or(false, |num_expected_blobs| {
+                let all_blobs_received = num_expected_blobs == self.num_received_blobs();
+                let all_columns_received = num_expected_blobs == 0
+                    || custody_column_count == self.num_received_data_columns();


beacon_chain code seems to imply that we can have too many columns, should this be

self.num_received_data_columns() >= custody_column_count

dapplion · 2024-09-18T06:15:28Z

beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

+            pending_components.verified_data_columns.len() >= num_of_columns / 2;
+        is_super_node
+            && has_missing_columns
+            && is_reconstruction_possible


Conflicts with

Refactor data column reconstruction and avoid blocking processing #6403

yeah these two branches conflicts a lot and I'm planning to resolve them once one of them is merged.

Another conflict is at the trigger point. They're both triggered when we have missing components - and we'd need to decide whether to fetch from EL or perform reconstruction in the same place. The benchmark for both are very similar but I think we should prioritise reconstruction if we have 50%, so we don't have to try fetch from EL, and fallback to reconstruction.

dapplion · 2024-09-18T06:38:52Z

beacon_node/beacon_chain/src/fetch_blobs.rs

+                "No blobs fetched from the EL";
+                "num_expected_blobs" => num_expected_blobs,
+            );
+            inc_counter(&metrics::BLOBS_FROM_EL_MISS_TOTAL);


The metric BLOBS_FROM_EL_MISS_TOTAL is being used with different meaning on each if branch. Here it means 0 fetched blobs, in peerdas it means != total.

Also, we could add another histogram to observe the count of missing blobs from the expected total.

dapplion · 2024-09-18T06:40:20Z

beacon_node/beacon_chain/src/fetch_blobs.rs

+            return Ok(None);
+        }
+
+        inc_counter(&metrics::BLOBS_FROM_EL_HIT_TOTAL);


Again, very different meaning to the metric in peerdas

dapplion · 2024-09-18T06:41:54Z

beacon_node/network/src/network_beacon_processor/gossip_methods.rs

+                let batch_size =
+                    data_columns_to_publish.len() / SUPERNODE_DATA_COLUMN_PUBLICATION_BATCHES;
+
+                for batch in data_columns_to_publish.chunks(batch_size) {


Should de-duplicate this from beacon_node/beacon_chain/src/fetch_blobs.rs if we are following the same batch strategy

dapplion · 2024-09-18T06:43:22Z

beacon_node/network/src/network_beacon_processor/mod.rs

+                    );
+                    self.chain.recompute_head_at_current_slot().await;
+                }
+                AvailabilityProcessingStatus::MissingComponents(_, _) => {


Yes I think it's possible before peerdas if the EL returns more than 0 blobs but not all of them

dapplion · 2024-09-18T06:44:04Z

beacon_node/network/src/network_beacon_processor/mod.rs

+                    );
+                }
+            },
+            Ok(None) => { /* No blobs fetched from the EL. Reasons logged separately. */ }


Is it worth it to still log a debug here?

…to das-fetch-blobs

jimmygchen added work-in-progress PR is a work-in-progress das Data Availability Sampling labels Aug 16, 2024

jimmygchen marked this pull request as ready for review August 16, 2024 07:47

jimmygchen added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Aug 16, 2024

dapplion reviewed Aug 16, 2024

View reviewed changes

beacon_node/beacon_chain/src/beacon_chain.rs Show resolved Hide resolved

beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved

dapplion reviewed Aug 16, 2024

View reviewed changes

beacon_node/beacon_chain/src/fetch_blobs.rs Show resolved Hide resolved

dapplion reviewed Aug 16, 2024

View reviewed changes

consensus/types/src/blob_sidecar.rs Outdated Show resolved Hide resolved

jimmygchen mentioned this pull request Aug 19, 2024

Reconstruct data columns without blocking processing and import #5990

Closed

jimmygchen force-pushed the das-fetch-blobs branch from dbf4bb8 to 01b91cb Compare August 20, 2024 02:21

kevaundray reviewed Aug 20, 2024

View reviewed changes

dapplion reviewed Aug 20, 2024

View reviewed changes

jimmygchen mentioned this pull request Aug 20, 2024

P2P clarifications when introducing engine_getBlobsV1 ethereum/consensus-specs#3864

Open

jimmygchen added spec_change A change related to the Eth2 spec optimization Something to make Lighthouse run more efficiently. labels Aug 20, 2024

jimmygchen mentioned this pull request Aug 21, 2024

DAS - Tracking Issue #4983

Open

51 tasks

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Aug 22, 2024

mergify bot deleted the branch sigp:unstable August 27, 2024 04:10

mergify bot closed this Aug 27, 2024

michaelsproul reopened this Aug 27, 2024

michaelsproul changed the base branch from das to unstable August 27, 2024 04:21

michaelsproul mentioned this pull request Aug 27, 2024

Get blobs from the EL's blob pool #5829

Closed

3 tasks

jimmygchen mentioned this pull request Aug 28, 2024

Remove default target peer for proposer-only mode for PeerDAS. #6164

Open

Get blobs from EL.

662d6cf

Co-authored-by: Michael Sproul <michael@sigmaprime.io>

jimmygchen force-pushed the das-fetch-blobs branch from bd24cdd to 662d6cf Compare August 29, 2024 00:55

jimmygchen added 3 commits August 29, 2024 11:40

Avoid cloning blobs after fetching blobs.

0c23848

Address review comments and refactor code.

89dfaaa

Fix lint.

401231b

jimmygchen added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Aug 29, 2024

jimmygchen added 4 commits August 29, 2024 15:21

Move blob computation metric to the right spot.

2efc99b

Merge branch 'unstable' into das-fetch-blobs

db6318e

Merge branch 'unstable' into das-fetch-blobs

aa79ec6

# Conflicts: # beacon_node/beacon_chain/src/beacon_chain.rs # beacon_node/beacon_chain/src/block_verification.rs # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

Merge branch 'unstable' into das-fetch-blobs

36b23b2

# Conflicts: # beacon_node/beacon_chain/src/beacon_chain.rs

jimmygchen commented Sep 12, 2024

View reviewed changes

beacon_node/beacon_chain/src/fetch_blobs.rs Show resolved Hide resolved

michaelsproul and others added 4 commits September 13, 2024 13:49

Gradual publication of data columns for supernodes.

6bff4ab

Recompute head after importing block with blobs from the EL.

7977999

Fix lint

5e75527

Merge branch 'unstable' into das-fetch-blobs

3444281

dapplion reviewed Sep 18, 2024

View reviewed changes

jimmygchen added 2 commits September 19, 2024 15:04

Use blocking task instead of async when computing cells.

e76d21f

Merge branch 'das-fetch-blobs' of github.com:jimmygchen/lighthouse in…

4b2956f

…to das-fetch-blobs

jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

jimmygchen commented Aug 16, 2024 •

edited

Loading

jimmygchen commented Aug 19, 2024 •

edited

Loading

jimmygchen commented Aug 20, 2024

dapplion commented Aug 20, 2024

kevaundray Aug 20, 2024

jimmygchen Aug 29, 2024

kevaundray commented Aug 20, 2024

jimmygchen commented Aug 20, 2024

jimmygchen commented Aug 20, 2024

jimmygchen commented Aug 28, 2024

dapplion Sep 18, 2024

dapplion Sep 18, 2024

dapplion Sep 18, 2024

dapplion Sep 18, 2024

dapplion Sep 18, 2024

jimmygchen Sep 19, 2024

dapplion Sep 18, 2024

dapplion Sep 18, 2024

dapplion Sep 18, 2024

dapplion Sep 18, 2024

dapplion Sep 18, 2024

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

Are you sure you want to change the base?

Improving blob propagation post-PeerDAS with Decentralized Blob Building #6268

Conversation

jimmygchen commented Aug 16, 2024 • edited Loading

Issue Addressed

Proposed Changes

Next steps:

Challenges:

TODO before merging

Reference:

jimmygchen commented Aug 19, 2024 • edited Loading

jimmygchen commented Aug 20, 2024

dapplion commented Aug 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevaundray commented Aug 20, 2024

jimmygchen commented Aug 20, 2024

jimmygchen commented Aug 20, 2024

jimmygchen commented Aug 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimmygchen commented Aug 16, 2024 •

edited

Loading

jimmygchen commented Aug 19, 2024 •

edited

Loading