-
Notifications
You must be signed in to change notification settings - Fork 1
Measure acceptance rate #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My key concerns are that ExperimentAcceptanceRate
doesn't follow the _Experiment
interface and therefore fails on initialization, and that there are no tests to cover the bug or anything else. Please add tests, which should reveal the bug.
dsi/offline/acceptance/experiment.py
Outdated
ar_config = ConfigAcceptanteRate( | ||
model="lmsys/vicuna-7b-v1.3", | ||
draft_model="double7/vicuna-68m", | ||
dataset="cnn_dailymail", | ||
subset="2.0.0", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add all the configurations used in the paper to a new config object similar to the Plots
class (see dsi/configs/plot/plots.py
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same config (models, dataset) as table 1. Please add tab 1 config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please explain; i couldn't understand your comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "configurations used in the paper"? Don't you mean the configurations used to generate table 1?
beside AR column, where are defined configs related to table 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dsi/offline/acceptance/experiment.py
Outdated
def main(): | ||
ar_config = ConfigAcceptanteRate( | ||
model="lmsys/vicuna-7b-v1.3", | ||
draft_model="double7/vicuna-68m", | ||
dataset="cnn_dailymail", | ||
subset="2.0.0", | ||
) | ||
target_gen_config = ConfigGen(do_sample=False, temperature=1.0, top_p=1.0) | ||
draft_gen_config = ConfigGen(do_sample=False, temperature=1.0, top_p=1.0) | ||
mar = ExperimentAcceptanceRate( | ||
config=ar_config, | ||
gen_config=target_gen_config, | ||
draft_gen_config=draft_gen_config, | ||
) | ||
mar.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this functionality to dsi/main.py
. For example, you can add it under a new function offline_acceptance
. Then, call the new function offline_acceptance
from the existing offline
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the relevant method in dsi/main.py
will generate table 1. Please add the method and I will add code to add AR column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the plots but not table 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid dups, please see this previous comment
output_target = target_model.generate(**inputs, **target_gen_kwargs) | ||
prompt_len = len(inputs.input_ids[0]) | ||
|
||
for i in range(prompt_len, len(output_target[0])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is len(output_target[0])
? Why use range(prompt_len, len(output_target[0]))
? Please document this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still unclear what output_target
is and why access with output_target[0]
. Please consider adding descriptive variables and/or functions to enhance readability. Documentation would also be helpful.
inputs["attention_mask"] = torch.tensor( | ||
[[1] * i], device=draft_model.device | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use torch.ones
?
inputs["attention_mask"] = torch.tensor( | |
[[1] * i], device=draft_model.device | |
) | |
inputs["attention_mask"] = torch.ones(i, device=draft_model.device) |
_load_draft_model Co-authored-by: Nadav Timor <nadavtim@mit.edu>
@jmamou, lmk when you want me to do another iteration by hitting the "Re-request review". I added a few comments meanwhile. Also, it seems like the tests only cover the new configuration object, and no tests run the experiment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jmamou. I added several comments. My key concern is that test_single_repeat_all_match
passes, indicating a bug. Please fix and add more tests, especially for the core logic. Lmk when you want me to do another iteration by hitting the "Re-request review".
def __init__( | ||
self, | ||
config: ConfigAcceptanteRate, | ||
gen_config: ConfigGen, | ||
draft_gen_config: ConfigGen, | ||
): | ||
self.config: ConfigAcceptanteRate | ||
super().__init__(config, gen_config) | ||
self.draft_gen_config: ConfigGen = draft_gen_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase the main
branch. To follow the interface of _Experiment
, please use the same __init__
signature. That is
def __init__( | |
self, | |
config: ConfigAcceptanteRate, | |
gen_config: ConfigGen, | |
draft_gen_config: ConfigGen, | |
): | |
self.config: ConfigAcceptanteRate | |
super().__init__(config, gen_config) | |
self.draft_gen_config: ConfigGen = draft_gen_config | |
def __init__(self, config: ConfigAcceptanteRate) | |
self.config: ConfigAcceptanteRate | |
super().__init__(config) |
The other arguments (gen_config: ConfigGen
and draft_gen_config: ConfigGen
) should be passed within config: ConfigAcceptanteRate
. After rebasing main
, you'll see that ExperimentLatency
is now fixed similarly.
"""Includes all the parameters needed for measuring the acceptance rate | ||
of a (target, draft, dataset) triplet. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
draft_gen_config: ConfigGen = Field( | |
default_factory=ConfigGen, title="Configuration of the generation from the drafter" | |
) |
|
||
# Check if tokenizers are the same | ||
if not self._are_tokenizers_same(target_tokenizer, draft_tokenizer): | ||
raise ValueError("The target and draft tokenizers are not the same.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you define a custom exception in dsi/types/exception.py
instead of using ValueError
? Doing so aligns with best practices by enhancing the specificity and clarity of our error handling.
model = torch.compile(model) if self.config.draft_compile_model else model | ||
return model, tokenizer | ||
|
||
def _are_tokenizers_same(self, tokenizer1, tokenizer2) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a static method? Also, please rename the function to communicate that it only checks the the tokens and their corresponding input id. (The current function returns True
for tokenizers with the same vocab—even if they encode an input id to different vectors)
def _are_tokenizers_same(self, tokenizer1, tokenizer2) -> bool: | |
@staticmethod | |
def _are_vocabs_same(tokenizer1, tokenizer2) -> bool: |
# iterate over the tokens generated by the target and | ||
# check whether the draft produces the same token | ||
for i in range(prompt_len, len(output_target[0])): | ||
inputs["input_ids"] = output_target[0, 0:i].view(1, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what output_target
refers to and why we access it with output_target[0, 0:i]
. To enhance readability and make the code easier to understand without needing to run debug mode or insert print statements, consider encapsulating this logic in a descriptive function. Here's a proposed function:
def get_input_ids_prefix(output_seqs, tok_pos_last: int):
"""Returns a prefix of a sequence of input ids.
Args:
output_seqs: The output from Hugging Face's transformers `model.generate` function, expected to be a sequence-like data structure.
tok_pos_last: int - The last token position to include in the returned sequence.
"""
return output_seqs[0, 0:tok_pos_last]
output_target = target_model.generate(**inputs, **target_gen_kwargs) | ||
prompt_len = len(inputs.input_ids[0]) | ||
|
||
for i in range(prompt_len, len(output_target[0])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still unclear what output_target
is and why access with output_target[0]
. Please consider adding descriptive variables and/or functions to enhance readability. Documentation would also be helpful.
if output_draft[-1, i] == output_target[-1, i]: | ||
n_matches[-1] += 1 | ||
elif i < len(output_target[0]) - 1: # new window | ||
n_matches.append(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests covering the experiment. It seems like this line makes n_matches = [x, 0]
, then we extend all_n_matches
such that all_n_matches += [x, 0]
. At the beginning of the examples loop, we initialize a new n_matches = [0]
. Is this a bug? Anyway, it is critical to add such tests.
# Since all tokens match, acceptance rate should be 1 | ||
assert result.acceptance_rate[0] == 0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a discrepancy. The test passes, suggesting that there is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keyboardAnt
for the paper, we agreed to ignore the last matching window for each dataset input, since generation can stop because we reach EOS or max length, and not because draft model bad prediction.
Not sure we wrote it in the paper (please correct me).
According to the paper formula, AR is 0.8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmamou
Even when ignoring the test's 4th (and last) match, why the returned acceptance rate is < 1? The first 3 tokens match in 100%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keyboardAnt
For the paper, we agreed to ignore the last matching window, not the last matching token.
In real scenarios, we will never get AR=1 since n cannot be infinite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What is the definition of acceptance rate for such windows? How is it different from the acceptance rate for single tokens? (For single tokens, the acceptance rate is the probability of accepting the drafter's next token prediction using an exact match or Miao's rejection sampling algorithm, assuming iid target tokens)
- Why is ignoring the last matching token counted as a mismatch in
test_single_repeat_all_match
? Please correct me if I'm wrong here. In the test, the target outputs the sequence of four input ids[0, 1, 2, 3]
. Then, the drafter correctly predicts all four, one by one. However, the calculated acceptance rate equals 0.8. Why the acceptance rate isn't 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmamou, as we discussed today over the phone, the acceptance rate has only one definition:
The acceptance rate is the probability of accepting the drafter's next token prediction using an exact match or Miao's rejection sampling algorithm, assuming iid target tokens.
We estimate the acceptance rate as follows. Let A
be the sum of the number of drafts accepted over all iterations, except the last iteration per example, over all the examples, in all datasets. Denote the total number of such iterations by N
. Note that N <=
the total number of examples. If N ==
the number of examples, the drafter predicts the example in a single iteration. In other words, we accepted all draft tokens without any rejections. In that case, we can raise a warning and return 1 (recommended) or raise an exception. Otherwise, we calculate the average number of accepted drafts n := A/N
(dividing the sum by the total number of iterations). The estimated acceptance rate is then 1 - 1 / (1 + n)
as mentioned in the paper. The issue with test_single_repeat_all_match
is that 1 - 1 / (1 + n) != 1
for any n
.
In practice, drafters may have a 100% acceptance rate. For example, an instance of the target running on faster hardware as a drafter, with both the target and drafter sampled greedily. But in such cases the perfect acceptance rate is guaranteed, and I do not see a reason to test it using our experiment. On the contrary, in most practical settings, the acceptance rate is < 1
, and estimating it as == 1
means we haven't reached the false discovery rate. In other words, we need to consider more examples. Also, it might indicate a bug. So, this is why I think we should raise a warning or exception.
No description provided.