Skip to content

Measure acceptance rate #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Measure acceptance rate #34

wants to merge 9 commits into from

Conversation

jmamou
Copy link
Collaborator

@jmamou jmamou commented Jul 18, 2024

No description provided.

@jmamou jmamou requested a review from keyboardAnt July 21, 2024 13:23
Copy link
Owner

@keyboardAnt keyboardAnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My key concerns are that ExperimentAcceptanceRate doesn't follow the _Experiment interface and therefore fails on initialization, and that there are no tests to cover the bug or anything else. Please add tests, which should reveal the bug.

Comment on lines 94 to 99
ar_config = ConfigAcceptanteRate(
model="lmsys/vicuna-7b-v1.3",
draft_model="double7/vicuna-68m",
dataset="cnn_dailymail",
subset="2.0.0",
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add all the configurations used in the paper to a new config object similar to the Plots class (see dsi/configs/plot/plots.py).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same config (models, dataset) as table 1. Please add tab 1 config

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain; i couldn't understand your comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "configurations used in the paper"? Don't you mean the configurations used to generate table 1?
beside AR column, where are defined configs related to table 1?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is configured here and is executed by the CLI here. Please add the configuration used in the paper and the code that calls it. Eventually, reproducing the table should be as easy as running python -m dsi type=table.

Comment on lines 93 to 107
def main():
ar_config = ConfigAcceptanteRate(
model="lmsys/vicuna-7b-v1.3",
draft_model="double7/vicuna-68m",
dataset="cnn_dailymail",
subset="2.0.0",
)
target_gen_config = ConfigGen(do_sample=False, temperature=1.0, top_p=1.0)
draft_gen_config = ConfigGen(do_sample=False, temperature=1.0, top_p=1.0)
mar = ExperimentAcceptanceRate(
config=ar_config,
gen_config=target_gen_config,
draft_gen_config=draft_gen_config,
)
mar.run()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this functionality to dsi/main.py. For example, you can add it under a new function offline_acceptance. Then, call the new function offline_acceptance from the existing offline function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the relevant method in dsi/main.py will generate table 1. Please add the method and I will add code to add AR column

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already exist

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the plots but not table 1

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid dups, please see this previous comment

output_target = target_model.generate(**inputs, **target_gen_kwargs)
prompt_len = len(inputs.input_ids[0])

for i in range(prompt_len, len(output_target[0])):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is len(output_target[0])? Why use range(prompt_len, len(output_target[0]))? Please document this part.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still unclear what output_target is and why access with output_target[0]. Please consider adding descriptive variables and/or functions to enhance readability. Documentation would also be helpful.

Comment on lines +76 to +78
inputs["attention_mask"] = torch.tensor(
[[1] * i], device=draft_model.device
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use torch.ones?

Suggested change
inputs["attention_mask"] = torch.tensor(
[[1] * i], device=draft_model.device
)
inputs["attention_mask"] = torch.ones(i, device=draft_model.device)

@keyboardAnt
Copy link
Owner

@jmamou, lmk when you want me to do another iteration by hitting the "Re-request review". I added a few comments meanwhile. Also, it seems like the tests only cover the new configuration object, and no tests run the experiment.

@jmamou jmamou requested a review from keyboardAnt July 23, 2024 14:03
Copy link
Owner

@keyboardAnt keyboardAnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jmamou. I added several comments. My key concern is that test_single_repeat_all_match passes, indicating a bug. Please fix and add more tests, especially for the core logic. Lmk when you want me to do another iteration by hitting the "Re-request review".

Comment on lines +21 to +29
def __init__(
self,
config: ConfigAcceptanteRate,
gen_config: ConfigGen,
draft_gen_config: ConfigGen,
):
self.config: ConfigAcceptanteRate
super().__init__(config, gen_config)
self.draft_gen_config: ConfigGen = draft_gen_config
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase the main branch. To follow the interface of _Experiment, please use the same __init__ signature. That is

Suggested change
def __init__(
self,
config: ConfigAcceptanteRate,
gen_config: ConfigGen,
draft_gen_config: ConfigGen,
):
self.config: ConfigAcceptanteRate
super().__init__(config, gen_config)
self.draft_gen_config: ConfigGen = draft_gen_config
def __init__(self, config: ConfigAcceptanteRate)
self.config: ConfigAcceptanteRate
super().__init__(config)

The other arguments (gen_config: ConfigGen and draft_gen_config: ConfigGen) should be passed within config: ConfigAcceptanteRate. After rebasing main, you'll see that ExperimentLatency is now fixed similarly.

"""Includes all the parameters needed for measuring the acceptance rate
of a (target, draft, dataset) triplet.
"""

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
draft_gen_config: ConfigGen = Field(
default_factory=ConfigGen, title="Configuration of the generation from the drafter"
)


# Check if tokenizers are the same
if not self._are_tokenizers_same(target_tokenizer, draft_tokenizer):
raise ValueError("The target and draft tokenizers are not the same.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you define a custom exception in dsi/types/exception.py instead of using ValueError? Doing so aligns with best practices by enhancing the specificity and clarity of our error handling.

model = torch.compile(model) if self.config.draft_compile_model else model
return model, tokenizer

def _are_tokenizers_same(self, tokenizer1, tokenizer2) -> bool:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a static method? Also, please rename the function to communicate that it only checks the the tokens and their corresponding input id. (The current function returns True for tokenizers with the same vocab—even if they encode an input id to different vectors)

Suggested change
def _are_tokenizers_same(self, tokenizer1, tokenizer2) -> bool:
@staticmethod
def _are_vocabs_same(tokenizer1, tokenizer2) -> bool:

# iterate over the tokens generated by the target and
# check whether the draft produces the same token
for i in range(prompt_len, len(output_target[0])):
inputs["input_ids"] = output_target[0, 0:i].view(1, -1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what output_target refers to and why we access it with output_target[0, 0:i]. To enhance readability and make the code easier to understand without needing to run debug mode or insert print statements, consider encapsulating this logic in a descriptive function. Here's a proposed function:

def get_input_ids_prefix(output_seqs, tok_pos_last: int):
    """Returns a prefix of a sequence of input ids.
       Args: 
        output_seqs: The output from Hugging Face's transformers `model.generate` function, expected to be a sequence-like data structure.
        tok_pos_last: int - The last token position to include in the returned sequence.
    """
    return output_seqs[0, 0:tok_pos_last]

output_target = target_model.generate(**inputs, **target_gen_kwargs)
prompt_len = len(inputs.input_ids[0])

for i in range(prompt_len, len(output_target[0])):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still unclear what output_target is and why access with output_target[0]. Please consider adding descriptive variables and/or functions to enhance readability. Documentation would also be helpful.

if output_draft[-1, i] == output_target[-1, i]:
n_matches[-1] += 1
elif i < len(output_target[0]) - 1: # new window
n_matches.append(0)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests covering the experiment. It seems like this line makes n_matches = [x, 0], then we extend all_n_matches such that all_n_matches += [x, 0]. At the beginning of the examples loop, we initialize a new n_matches = [0]. Is this a bug? Anyway, it is critical to add such tests.

Comment on lines +88 to +89
# Since all tokens match, acceptance rate should be 1
assert result.acceptance_rate[0] == 0.8
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a discrepancy. The test passes, suggesting that there is a bug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keyboardAnt
for the paper, we agreed to ignore the last matching window for each dataset input, since generation can stop because we reach EOS or max length, and not because draft model bad prediction.
Not sure we wrote it in the paper (please correct me).
According to the paper formula, AR is 0.8.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmamou
Even when ignoring the test's 4th (and last) match, why the returned acceptance rate is < 1? The first 3 tokens match in 100%.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keyboardAnt
For the paper, we agreed to ignore the last matching window, not the last matching token.
In real scenarios, we will never get AR=1 since n cannot be infinite.

Copy link
Owner

@keyboardAnt keyboardAnt Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmamou

  1. What is the definition of acceptance rate for such windows? How is it different from the acceptance rate for single tokens? (For single tokens, the acceptance rate is the probability of accepting the drafter's next token prediction using an exact match or Miao's rejection sampling algorithm, assuming iid target tokens)
  2. Why is ignoring the last matching token counted as a mismatch in test_single_repeat_all_match? Please correct me if I'm wrong here. In the test, the target outputs the sequence of four input ids [0, 1, 2, 3]. Then, the drafter correctly predicts all four, one by one. However, the calculated acceptance rate equals 0.8. Why the acceptance rate isn't 1?

Copy link
Owner

@keyboardAnt keyboardAnt Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmamou, as we discussed today over the phone, the acceptance rate has only one definition:

The acceptance rate is the probability of accepting the drafter's next token prediction using an exact match or Miao's rejection sampling algorithm, assuming iid target tokens.

We estimate the acceptance rate as follows. Let A be the sum of the number of drafts accepted over all iterations, except the last iteration per example, over all the examples, in all datasets. Denote the total number of such iterations by N. Note that N <= the total number of examples. If N == the number of examples, the drafter predicts the example in a single iteration. In other words, we accepted all draft tokens without any rejections. In that case, we can raise a warning and return 1 (recommended) or raise an exception. Otherwise, we calculate the average number of accepted drafts n := A/N (dividing the sum by the total number of iterations). The estimated acceptance rate is then 1 - 1 / (1 + n) as mentioned in the paper. The issue with test_single_repeat_all_match is that 1 - 1 / (1 + n) != 1 for any n.

In practice, drafters may have a 100% acceptance rate. For example, an instance of the target running on faster hardware as a drafter, with both the target and drafter sampled greedily. But in such cases the perfect acceptance rate is guaranteed, and I do not see a reason to test it using our experiment. On the contrary, in most practical settings, the acceptance rate is < 1, and estimating it as == 1 means we haven't reached the false discovery rate. In other words, we need to consider more examples. Also, it might indicate a bug. So, this is why I think we should raise a warning or exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants