Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking wishlist #200

Open
adr1anh opened this issue Aug 12, 2024 · 0 comments
Open

Benchmarking wishlist #200

adr1anh opened this issue Aug 12, 2024 · 0 comments

Comments

@adr1anh
Copy link
Contributor

adr1anh commented Aug 12, 2024

The Stow talk at SBC presents their benchmarks in a very helpful way. In particular, the following pie-chart gives an excellent overview of the time spent in the different phases of a proof.

image

These types of numbers are absolutely necessary in order to understand where we should spend resources optimizing the performance of sphinx and loam. Below is a wish-list of numbers that we should be tracking with each PR.

Arithmetization

Given that we are compiling Lair down to AirBuilder constraints, we are losing insight into how large our traces actually are. In the same way that we used to have expect-test to log how many constraints and witnesses we were using, we should have the same functionality in lair.

For each Chip, we should log the following information:

  • Total width of the chip
  • Total number of lookups
    • Number of bytes operations
    • Number of provide/provide
    • Number of memory load/store
  • Number of constraints, and their "complexity" given by counting the number of additions vs multiplications necessary to evaluate it (may require a "byte code representation of the constraints which could be embedded in the verification key later on)
  • Number of selectors which represent the number of mutually exclusive branches that can be taken
    • For each branch, log the same information as above, as this can indicate expensive branches that could benefit from their own chip
  • Number of multiplicative inverses (whose cost is 37 field multiplications) since we could potentially batch these using Montgomery's trick

The above list is exhaustive, but given all of this we would be able to better distinguish areas in our arithmetization that would benefit from "fine-tuning". Moreover, if this information is collected in a struct, it would allow us to more easily approximate the performance impact of future prover optimizations, such as an improved lookup protocol that bypasses the overflowing multiplicities issue. We would be able to deduce the shape of a shard with this new configuration and run a mock prover that accounts for the change in number of columns and potential extra rounds of interaction.

Prover

Given the fixed Lurk VM arithmetization and a set of benchmark programs to run, we should log the same type of information that was given in the Stow benchmark. Getting these numbers should only be a matter of adding tailored tracing to the prove_shard method. We could then get CI to output the following numbers

  • Program execution
    • How much time does it take to just evaluate the program to get the output, and populate the record of all events to be proved
  • Sharding information
    • For each chip, describe how many events will be proved in the chip. For example, how many Poseidon2 hashes of each size are required, how many arithmetic operations, and what does our memory look like.
    • We should do this for both sharded and unsharded traces, since we will want to figure out what a better sharding strategy would be for Lurk where execution is not sequential as in RiscV.
  • Trace generation
    • For each chip with a given number of events. How much time does it take to generate just that trace. This could be a good indicator of whether we need to optimize data storage and witness generation. For example, we could store integers as their native types and only perform the conversion to field elements during trace generation, rather than converting back and forth during execution.
    • It is also useful to just know the total time spent to generate all the traces, assuming proper parallelism. This is the part of the stack that can be optimized almost entirely in loam, as it does not depend on the proving strategy.
    • We also want to calculate the total trace generation time for the logUp columns, as these require the computation of multiplicative inverses which can be batched.
  • Constraint evaluation time ref
    • The Stwo prover spends about 30% of the proving time evaluating constraints in order to compute the quotient. This is where the prover calls Air::eval approximately 2n times, where n is the number of rows in the trace. This evaluation is always the same, and therefore should not exhibit any dynamic behavior, and ideally should not touch the heap at all (it should only apply arithmetic operations over slices of the trace).
    • At the moment, we are parsing the constraints dynamically at every iteration, resulting in the same vector allocations and pointer chasing at every call. I suspect that this has a strong negative impact on performance, though it may be the case that the CPU's branch predictor is able to cache the constraint and avoid the unnecessary memory operations. It is critical that we benchmark this portion of our prover against SP1 to understand whether we need to apply efforts towards optimizing our constraint evaluation.
    • One solution would be to build the constraint evaluation functions using macros or code generation, though this would add a lot of complexity.
  • Commitment
    • For all of preprocessed, main, permutation and quotient traces, we should log what proportion of the total time is spent performing iFFTs, FFTs and Merkle tree commitment. Note that these numbers depend on the hash function used, and the width/height of the trace.
  • Polynomial Commitment opening
    • This phase should depend only on the shape of the shard and should be very similar to SP1 given traces of the same size.
  • Proof size
    • Number of hashes for authentication paths
    • number of opening values
    • With these numbers, as well as estimates about the amount of time required to verify hashed inside a circuit, we would be able to estimate the "recursion overhead" without needing to implement a fully fledged verifier just yet.
    • We would also be able to estimate the savings against a future STIR implementation.

Overall, we need to collect some of these numbers when comparing Lurk against SP1. We are doing some things very differently and we need to understand the places where we need to focus optimization efforts.

Verifier complexity

At the moment, we are not able to accurately benchmark Lurk in a "real-world" setting as we are not taking into account the complexity of the recursive verifier.

The following information will help draw a better picture

  • Size of byte-code required for evaluating constraints
    • These translate to the number of instructions performed by the verifier when checking that the quotient was correct
  • Number of columns
    • The prover opens two rows of every trace, which are the leaves of the Merkle tree commitment. We open these for every FRI query, and so this leads to more hashes performed by the verifier.
  • Total number of hashes performed in the proof.

I suspect that including 3 variations of the Poseidon2 hash is likely the biggest contributor to the complexity of our constraints. This both drastically increases the number of columns and arithmetic operations performed by the verifier. In almost every other VM, only a single instance of Poseidon 2 is used with a width of 16, and a special VM circuit is used to call the permutation over arrays of field elements. While this model may have a small impact on the hashing speed of Lurk data, it may make a big difference when it comes to the recursive verifier.

Moreover, we would also be able to estimate the savings in recursive verification due to STIR, or a potential implementation of an accumulation scheme that replaces FRI.

Finally, it would help us understand if there are any benefits of modifying the prover to only open a single row in the case where there are no transition constraints (this would require the AirBuilder to expose a RowID selector that would replace our current "nonce").

In the mean time, we should also be logging the non-parallelized verification time of our proofs, as it would allow us to at least estimate when changes we are making would affect a future recursive verifier.

Conclusion

The above is an exhaustive list of many numbers that would be very useful to track as continue building Lurk on top of sphinx. Ideally, we can report some of these number as part of the CI pipeline directly when making changes to the VM or backend.
Some of these numbers could also be included in the code directly with expect-test. Given the many different aspects that can affect proving performance, we need to understand where our efforts should be targeted.

@adr1anh adr1anh changed the title Benchmark: Constraint evaluation time Benchmarking wishlist Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant