Measure acceptance rate #34

jmamou · 2024-07-18T13:57:37Z

No description provided.

keyboardAnt

My key concerns are that ExperimentAcceptanceRate doesn't follow the _Experiment interface and therefore fails on initialization, and that there are no tests to cover the bug or anything else. Please add tests, which should reveal the bug.

dsi/offline/acceptance/experiment.py

keyboardAnt · 2024-07-21T15:43:12Z

dsi/offline/acceptance/experiment.py

+    ar_config = ConfigAcceptanteRate(
+        model="lmsys/vicuna-7b-v1.3",
+        draft_model="double7/vicuna-68m",
+        dataset="cnn_dailymail",
+        subset="2.0.0",
+    )


Please add all the configurations used in the paper to a new config object similar to the Plots class (see dsi/configs/plot/plots.py).

same config (models, dataset) as table 1. Please add tab 1 config

please explain; i couldn't understand your comment

what do you mean by "configurations used in the paper"? Don't you mean the configurations used to generate table 1?
beside AR column, where are defined configs related to table 1?

The table is configured here and is executed by the CLI here. Please add the configuration used in the paper and the code that calls it. Eventually, reproducing the table should be as easy as running python -m dsi type=table.

keyboardAnt · 2024-07-21T15:47:04Z

dsi/offline/acceptance/experiment.py

+def main():
+    ar_config = ConfigAcceptanteRate(
+        model="lmsys/vicuna-7b-v1.3",
+        draft_model="double7/vicuna-68m",
+        dataset="cnn_dailymail",
+        subset="2.0.0",
+    )
+    target_gen_config = ConfigGen(do_sample=False, temperature=1.0, top_p=1.0)
+    draft_gen_config = ConfigGen(do_sample=False, temperature=1.0, top_p=1.0)
+    mar = ExperimentAcceptanceRate(
+        config=ar_config,
+        gen_config=target_gen_config,
+        draft_gen_config=draft_gen_config,
+    )
+    mar.run()


Please add this functionality to dsi/main.py. For example, you can add it under a new function offline_acceptance. Then, call the new function offline_acceptance from the existing offline function.

I guess the relevant method in dsi/main.py will generate table 1. Please add the method and I will add code to add AR column

It's already exist

I see the plots but not table 1

To avoid dups, please see this previous comment

dsi/configs/experiment/acceptance.py

dsi/offline/acceptance/experiment.py

keyboardAnt · 2024-07-21T16:00:27Z

dsi/offline/acceptance/experiment.py

+            output_target = target_model.generate(**inputs, **target_gen_kwargs)
+            prompt_len = len(inputs.input_ids[0])
+
+            for i in range(prompt_len, len(output_target[0])):


What is len(output_target[0])? Why use range(prompt_len, len(output_target[0]))? Please document this part.

Is it still unclear what output_target is and why access with output_target[0]. Please consider adding descriptive variables and/or functions to enhance readability. Documentation would also be helpful.

keyboardAnt · 2024-07-21T17:21:15Z

dsi/offline/acceptance/experiment.py

+                inputs["attention_mask"] = torch.tensor(
+                    [[1] * i], device=draft_model.device
+                )


Why not use torch.ones?

Suggested change

inputs["attention_mask"] = torch.tensor(

[[1] * i], device=draft_model.device

)

inputs["attention_mask"] = torch.ones(i, device=draft_model.device)

dsi/offline/acceptance/experiment.py

_load_draft_model Co-authored-by: Nadav Timor <nadavtim@mit.edu>

keyboardAnt · 2024-07-23T01:03:04Z

@jmamou, lmk when you want me to do another iteration by hitting the "Re-request review". I added a few comments meanwhile. Also, it seems like the tests only cover the new configuration object, and no tests run the experiment.

keyboardAnt

Thanks @jmamou. I added several comments. My key concern is that test_single_repeat_all_match passes, indicating a bug. Please fix and add more tests, especially for the core logic. Lmk when you want me to do another iteration by hitting the "Re-request review".

keyboardAnt · 2024-07-23T14:39:03Z

dsi/offline/acceptance/experiment.py

+    def __init__(
+        self,
+        config: ConfigAcceptanteRate,
+        gen_config: ConfigGen,
+        draft_gen_config: ConfigGen,
+    ):
+        self.config: ConfigAcceptanteRate
+        super().__init__(config, gen_config)
+        self.draft_gen_config: ConfigGen = draft_gen_config


Please rebase the main branch. To follow the interface of _Experiment, please use the same __init__ signature. That is

Suggested change

def __init__(

self,

config: ConfigAcceptanteRate,

gen_config: ConfigGen,

draft_gen_config: ConfigGen,

):

self.config: ConfigAcceptanteRate

super().__init__(config, gen_config)

self.draft_gen_config: ConfigGen = draft_gen_config

def __init__(self, config: ConfigAcceptanteRate)

self.config: ConfigAcceptanteRate

super().__init__(config)

The other arguments (gen_config: ConfigGen and draft_gen_config: ConfigGen) should be passed within config: ConfigAcceptanteRate. After rebasing main, you'll see that ExperimentLatency is now fixed similarly.

keyboardAnt · 2024-07-23T14:43:55Z

dsi/configs/experiment/acceptance.py

+    """Includes all the parameters needed for measuring the acceptance rate
+    of a (target, draft, dataset) triplet.
+    """
+


Suggested change

draft_gen_config: ConfigGen = Field(

default_factory=ConfigGen, title="Configuration of the generation from the drafter"

)

keyboardAnt · 2024-07-23T14:49:49Z

dsi/offline/acceptance/experiment.py

+
+        # Check if tokenizers are the same
+        if not self._are_tokenizers_same(target_tokenizer, draft_tokenizer):
+            raise ValueError("The target and draft tokenizers are not the same.")


Could you define a custom exception in dsi/types/exception.py instead of using ValueError? Doing so aligns with best practices by enhancing the specificity and clarity of our error handling.

keyboardAnt · 2024-07-23T14:57:15Z

dsi/offline/acceptance/experiment.py

+        model = torch.compile(model) if self.config.draft_compile_model else model
+        return model, tokenizer
+
+    def _are_tokenizers_same(self, tokenizer1, tokenizer2) -> bool:


Why not use a static method? Also, please rename the function to communicate that it only checks the the tokens and their corresponding input id. (The current function returns True for tokenizers with the same vocab—even if they encode an input id to different vectors)

Suggested change

def _are_tokenizers_same(self, tokenizer1, tokenizer2) -> bool:

@staticmethod

def _are_vocabs_same(tokenizer1, tokenizer2) -> bool:

keyboardAnt · 2024-07-23T15:27:32Z

dsi/offline/acceptance/experiment.py

+            # iterate over the tokens generated by the target and
+            # check whether the draft produces the same token
+            for i in range(prompt_len, len(output_target[0])):
+                inputs["input_ids"] = output_target[0, 0:i].view(1, -1)


It's not clear what output_target refers to and why we access it with output_target[0, 0:i]. To enhance readability and make the code easier to understand without needing to run debug mode or insert print statements, consider encapsulating this logic in a descriptive function. Here's a proposed function:

def get_input_ids_prefix(output_seqs, tok_pos_last: int): """Returns a prefix of a sequence of input ids. Args: output_seqs: The output from Hugging Face's transformers `model.generate` function, expected to be a sequence-like data structure. tok_pos_last: int - The last token position to include in the returned sequence. """ return output_seqs[0, 0:tok_pos_last]

keyboardAnt · 2024-07-23T15:30:25Z

dsi/offline/acceptance/experiment.py

+            output_target = target_model.generate(**inputs, **target_gen_kwargs)
+            prompt_len = len(inputs.input_ids[0])
+
+            for i in range(prompt_len, len(output_target[0])):


Is it still unclear what output_target is and why access with output_target[0]. Please consider adding descriptive variables and/or functions to enhance readability. Documentation would also be helpful.

keyboardAnt · 2024-07-23T15:44:43Z

dsi/offline/acceptance/experiment.py

+                if output_draft[-1, i] == output_target[-1, i]:
+                    n_matches[-1] += 1
+                elif i < len(output_target[0]) - 1:  # new window
+                    n_matches.append(0)


Please add tests covering the experiment. It seems like this line makes n_matches = [x, 0], then we extend all_n_matches such that all_n_matches += [x, 0]. At the beginning of the examples loop, we initialize a new n_matches = [0]. Is this a bug? Anyway, it is critical to add such tests.

keyboardAnt · 2024-07-23T15:50:12Z

tests/offline/test_acceptance.py

+    # Since all tokens match, acceptance rate should be 1
+    assert result.acceptance_rate[0] == 0.8


It seems like a discrepancy. The test passes, suggesting that there is a bug.

@keyboardAnt
for the paper, we agreed to ignore the last matching window for each dataset input, since generation can stop because we reach EOS or max length, and not because draft model bad prediction.
Not sure we wrote it in the paper (please correct me).
According to the paper formula, AR is 0.8.

@jmamou
Even when ignoring the test's 4th (and last) match, why the returned acceptance rate is < 1? The first 3 tokens match in 100%.

@keyboardAnt
For the paper, we agreed to ignore the last matching window, not the last matching token.
In real scenarios, we will never get AR=1 since n cannot be infinite.

@jmamou

What is the definition of acceptance rate for such windows? How is it different from the acceptance rate for single tokens? (For single tokens, the acceptance rate is the probability of accepting the drafter's next token prediction using an exact match or Miao's rejection sampling algorithm, assuming iid target tokens)

Why is ignoring the last matching token counted as a mismatch in test_single_repeat_all_match? Please correct me if I'm wrong here. In the test, the target outputs the sequence of four input ids [0, 1, 2, 3]. Then, the drafter correctly predicts all four, one by one. However, the calculated acceptance rate equals 0.8. Why the acceptance rate isn't 1?

@jmamou, as we discussed today over the phone, the acceptance rate has only one definition:

The acceptance rate is the probability of accepting the drafter's next token prediction using an exact match or Miao's rejection sampling algorithm, assuming iid target tokens.

We estimate the acceptance rate as follows. Let A be the sum of the number of drafts accepted over all iterations, except the last iteration per example, over all the examples, in all datasets. Denote the total number of such iterations by N. Note that N <= the total number of examples. If N == the number of examples, the drafter predicts the example in a single iteration. In other words, we accepted all draft tokens without any rejections. In that case, we can raise a warning and return 1 (recommended) or raise an exception. Otherwise, we calculate the average number of accepted drafts n := A/N (dividing the sum by the total number of iterations). The estimated acceptance rate is then 1 - 1 / (1 + n) as mentioned in the paper. The issue with test_single_repeat_all_match is that 1 - 1 / (1 + n) != 1 for any n.

In practice, drafters may have a 100% acceptance rate. For example, an instance of the target running on faster hardware as a drafter, with both the target and drafter sampled greedily. But in such cases the perfect acceptance rate is guaranteed, and I do not see a reason to test it using our experiment. On the contrary, in most practical settings, the acceptance rate is < 1, and estimating it as == 1 means we haven't reached the false discovery rate. In other words, we need to consider more examples. Also, it might indicate a bug. So, this is why I think we should raise a warning or exception.

jmamou added 3 commits July 18, 2024 06:48

initial commit

d92eb28

minor format fix

c5177e2

formatting

d2da6c0

jmamou requested a review from keyboardAnt July 21, 2024 13:23

keyboardAnt requested changes Jul 21, 2024

View reviewed changes

jmamou and others added 4 commits July 22, 2024 08:47

Update dsi/offline/acceptance/experiment.py

48117c5

_load_draft_model Co-authored-by: Nadav Timor <nadavtim@mit.edu>

fix _single_repeat + unit tests

7f7df49

add comment on the iteration over target token

ae64a2a

format

6e76cc0

jmamou added 2 commits July 23, 2024 05:36

adding test AR

9999f6f

same tokenizer + tests

bc3f79c

jmamou requested a review from keyboardAnt July 23, 2024 14:03

keyboardAnt requested changes Jul 23, 2024

View reviewed changes

+    draft_gen_config: ConfigGen = Field(
+        default_factory=ConfigGen, title="Configuration of the generation from the drafter"
+    )

	def _are_tokenizers_same(self, tokenizer1, tokenizer2) -> bool:
	@staticmethod
	def _are_vocabs_same(tokenizer1, tokenizer2) -> bool:

		# Since all tokens match, acceptance rate should be 1
		assert result.acceptance_rate[0] == 0.8

Measure acceptance rate #34

Are you sure you want to change the base?

Measure acceptance rate #34

Uh oh!

Conversation

jmamou commented Jul 18, 2024

Uh oh!

keyboardAnt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

keyboardAnt commented Jul 23, 2024

Uh oh!

keyboardAnt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keyboardAnt Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keyboardAnt Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

keyboardAnt left a comment •

edited

Loading

keyboardAnt left a comment •

edited

Loading

keyboardAnt Jul 24, 2024 •

edited

Loading

keyboardAnt Jul 24, 2024 •

edited

Loading