Add batch inferencing support for GPT2LMHeadModel #7552

cccntu · 2020-10-03T10:48:36Z

What does this PR do?

This adds correct (absolute) positional embedding to the output, when given attention mask. The positional embedding is calculated using attention mask.
Fixes #3021
Here is an example usage:

from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=True)

# when generating, we will use the logits of right-most token to predict the next token
# so the padding should be on the left
tokenizer.padding_side = "left" 
tokenizer.pad_token = tokenizer.eos_token # to avoid an error

sentences = ["Hello, my dog is a little",
            "Hello, my dog is", # use different length sentences to test batching
            ]
inputs = tokenizer(sentences, return_tensors="pt", padding=True)


output_sequences = model.generate(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    do_sample=False, # disable sampling to test if batching affects output
)

for i in range(len(sentences)):
    print(tokenizer.decode(output_sequences[i]))
    # you can use skip_special_tokens=True in decode() to remove padding token
    # but note that it will also remove other special_tokens

outputs:

Hello, my dog is a little bit of a mess. I'm not sure if he's going
<|endoftext|><|endoftext|>Hello, my dog is a little bit of a mess. I'm not sure if he

comment:

I think this should be used in examples/text-generation/run_generation.py, but I don't know much about other models, and it (code) would be weird if only gpt2 supports batch inferencing.

albert, bert, GPT2, XLM: @LysandreJik
TextGeneration: @TevenLeScao
documentation: @sgugger
@patrickvonplaten

cccntu · 2020-10-09T06:22:30Z

This enables significantly faster generation.
Here is a simple test I ran.

	generate 20 tokens	generate 100 tokens
batch size = 1	45.2 s	3min 42s
batch size = 32	2.25 s (20x)	8.36 s (26.5x)

# following above code
data = sentences * 128 # total 256 sentences
model.cuda();
data = [' '.join([x]*10) for x in data] # make the prompt longer to be more realistic
from tqdm.auto import tqdm

def test(batchsize = 1, max_gen_len = 20):
    for i in tqdm(range(0, len(data), batchsize)):
        batch = data[i: i+batchsize]
        inputs = tokenizer(batch, return_tensors="pt", padding=True)

        output_sequences = model.generate(
            input_ids=inputs['input_ids'].to(model.device),
            attention_mask=inputs['attention_mask'].to(model.device),
            do_sample=False, # disable sampling to test if batching affects output
            pad_token_id=tokenizer.eos_token_id,
            max_length=len(inputs['input_ids'][0]) + max_gen_len, # let it generate longer
        )
        outputs = [tokenizer.decode(x) for x in output_sequences]


%time test(1, 20)

%time test(32, 20)

%time test(1, 100)

%time test(32, 100)

patrickvonplaten · 2020-10-13T21:47:01Z

Hey @cccntu - this is a great addition! I very much like your appraoch here.
I also checked that all GPT2 SLOW tests function correctly and added a test to make sure batch generation works as expected!

With the current implementation, the user would not be able to define his own position_ids for generate, since they are always overwritten in the prepare_input_ids_for_generation, but I think this is OK because:

Previously, it was impossible for the user to use position_ids because they would have to be extended by 1 each generation step - a feature which is not implemented
I don't see any reason why position_ids should be different from the way it is implement in the PR right now

@LysandreJik - this feature was heavily requested by the community (linked a couple of issues below) and I think this is a great way to handle GPT2 batch generation. What do you think?

patrickvonplaten · 2020-10-13T21:59:05Z

Related issues: #6742, #4746,
#4824

patrickvonplaten · 2020-10-13T22:02:16Z

@cccntu - Great work on this PR! If this PR is merged and you want to help the community a tiny bit more, you could give a short description (similar to what you've done above) on how to do batch generation with GPT2 here: https://discuss.huggingface.co/t/batch-generation-with-gpt2/1517. Many people have been asking for this so they would be very glad to see a short forum post about it.

Thanks a lot again!

cccntu · 2020-10-14T05:00:21Z

src/transformers/modeling_gpt2.py

+        position_ids = kwargs.get("position_ids", None)
+
+        if attention_mask is not None and position_ids is None:
+            # create postion_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+        else:
+            position_ids = None


@patrickvonplaten
Now that you add
position_ids = kwargs.get("position_ids", None)
I think we can get rid of
else: position_ids = None

Also inspired by this related PR #7355, I think we should move all the if past together, just above return

Should I add another commit?

No strong opinions on this, will let @patrickvonplaten decide to merge with or without this

@cccntu - yeah I thought about this as well. The problem with this and PR #7355 and passing position_ids is that we would have to incrementally add new tokens to position_ids in generate() which would be pretty hacky since not all models support position_ids => so I'd rather not do this before doing a bigger refactor of generate, see: #6949 (will continue on the bigger refactor soon).

We can always change that later without breaking backwards compatibility.

LysandreJik

This is great, very simple implementation! Thanks a lot @cccntu.

patrickvonplaten · 2020-10-14T11:40:20Z

Awesome, great work @cccntu ! It would be amazing if you could write a little description of how your PR works on the forum: https://discuss.huggingface.co/t/batch-generation-with-gpt2/1517 - the community would be very thankful I think :-)

cccntu · 2020-10-14T14:13:46Z

@patrickvonplaten Thanks for the suggestions! I just added some description to the forum post. 😃

link to the post for future reference: https://discuss.huggingface.co/t/batch-generation-with-gpt2/1517/2

LSinev · 2020-10-19T07:54:20Z

Can you please add batch inferencing for GPT2DoubleHeadsModel too?

* Add support for gpt2 batch inferencing * add test * remove typo Co-authored-by: patrickvonplaten <patrick.v.platen@gmail.com>

…e#7552)" This reverts commit cbc1be9.

spate141 · 2021-03-01T16:38:43Z

@patrickvonplaten @cccntu

I can see how batch generation is now available. I was wondering, if there's already a way to do the same but with different arguments of max_len & min_length per encoded_text in a batch in model.generate(). Goal here is to generate new text for a batch of encoded text with variable size.

cccntu · 2021-03-02T00:36:04Z

Hi @spate141,

Did you mean passing a max_len & min_length as n-element array?
It would fail here:

transformers/src/transformers/generation_utils.py

Line 289 in 121dd43

    
           assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."

Actually, the main issue is here:

transformers/src/transformers/generation_utils.py

Line 539 in 121dd43

next_token_logits = outputs.logits[:, -1, :]

We need the right-most logits not be padding, and without modifying generation_utils.py, we need to use left-padding, and consequently we need this PR to make sure the positional embedding is correct.

You can also checkout the discussions in #3021, or the forum post: https://discuss.huggingface.co/t/batch-generation-with-gpt2/1517/3

spate141 · 2021-03-02T14:25:29Z

Did you mean passing a max_len & min_length as n-element array?

Yes, exactly! Instead of single int values for all texts in a batch... an array of values for each text in a batch.

I saw the code and I can see why it will fail. #3021 seems informative, I'll take a look.

Meanwhile I found this way to get what I mentioned:

Let's assume a model accepts input of max_len = 64 and we want to generate new text for a piece of text of size 300 tokens.
Since we know what's the max_len is, we have make sure that we split our input text into 5 batches: [64, 60, 58, 50, 56, 12].
- This was done in some clever way to ensure that each text segment follows valid grammar rule and also don't go above that max_len limit.
For all these 6 text segments we want to generate new text with following min, max values:
- min_values: [100, 100, 100, 100, 100, 25]
- max_values: [120, 120, 120, 120, 120, 50]
To do that, I can just pass a global min & max values (i.e. 100, 120 respectively) to model.generate() along with a tokenized batch of input text segments.
- input_ids_shape: (6, 64), min_len: 100, max_len: 120
My only issue here is regarding last text segment in a batch of (6, 64) tokenized tensor. Ideally, we want new generated text of size min of 25 tokens and max of 50 tokens. Generating a new text of size 100 tokens from an input of 12 tokens will be gobbledygook.
To handle this, I can just take the last segment of generated text that belongs to our last input text; and split the text and discard everything above its ideal original min/max limit, i.e. (25, 50)

OR

I can just go with doing same but I combine first 5 text segments and generate text on (5, 64) and generate text for the last one (1, 64) in two pass

OR

I can just generate everything in 6 pass for each 6 text segments and pass their ideal individual min/max limits

@cccntu In your 2nd comment to this pull request, you posted some impressive results on why doing batch_generation is ideal, specially let's say when you have a GPU. I'm just trying to figure out if doing the same in my case is worth the latency when I have to do some post-processing. I'll post some latency results once I have this setup ready.

spate141 · 2021-03-02T20:58:24Z

Update: @cccntu

I went with my 1st approach where I'm generating text for all texts in a single batch with global min, max values. In most cases where my last text chunk in batch is smaller meaning its min/max values are smaller than rest of text chunks in a same batch; I'm just trimming tokens. Results are impressive so far. Some numbers just in case someone stumble upon this thread in future:

Fixed size text batches:

This shows when passing list of text chunks as single batch tensor Vs passing text chunks as individual in for loop. max_len, min_len variables are kept same in both. Y-axis shows total time in seconds for model to finish generating text.
All the text chunks are of same size.

Variable size text batches:

Same as above, but here I'm using variable size text chunks.
For example: 2 Long, 1 Short means my input is 2 long size texts + 1 short size text. This is to test what happens when I'm generating text for variable size text chunks in a single batch.
Also to note that I'm trimming generated text for short text chunks in post processing. So, time on Y-axis include that.

Overall, batch text generation seems very useful(🎉) despite one has to add some overhead on top to manage some use cases.

callzhang · 2022-01-24T07:26:04Z

@cccntu Thanks for your great work! I stumbled upon this thread and would like to know:

Would this batching mechanism works for GPT-NEO?
Would this batching mechanism works for pipeline inference?
If so, is there any changes or considerations I need to do or know?

thomas-li-sjtu · 2022-03-03T08:06:14Z

Thanks for the code! I wonder if now I could generate sentences in a batch withother models (BertGeneration, for instance)? Looking forward to your reply!

irasin · 2023-01-19T07:54:34Z

@cccntu Thanks for your code. By using the correct position_id in this case, we can do batch inference in pytorch model now.

But when we export the gpt2 model to onnx with GPT2OnnxConfig

onnx_config = GPT2OnnxConfig(model.config)
## or using past_key_values mode
# onnx_config = GPT2OnnxConfig(model.config, use_past=True)

Then the onnx model inputs don't contation position_id but only input_ids nand attention_masks。
So we can't do correct batch_inference with onnx model now, right？

williamLyh · 2023-01-29T20:31:44Z

Thank you for the code. I wonder if you have tested whether there is performance drop when using batch generation? Especially when the GPT-2 model is finetuned with right-padded data.

Add support for gpt2 batch inferencing

8393596

LSinev mentioned this pull request Oct 8, 2020

Add token_type_ids to prepare_inputs_for_generation for gpt/gpt2 #7355

Closed

add test

729733b

remove typo

380af05

patrickvonplaten requested a review from LysandreJik October 13, 2020 21:52

cccntu commented Oct 14, 2020

View reviewed changes

LysandreJik approved these changes Oct 14, 2020

View reviewed changes

patrickvonplaten merged commit 121dd43 into huggingface:master Oct 14, 2020

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Add batch inferencing support for GPT2LMHeadModel (huggingfac…

5cc5025

…e#7552)" This reverts commit cbc1be9.

NielsRogge mentioned this pull request Mar 14, 2021

How to generate texts in huggingface in a batch way? #10704

Closed

bnurbekov mentioned this pull request Nov 25, 2021

Logits warper for batch generation #14530

Closed

niansong1996 mentioned this pull request Nov 29, 2021

GPT model generate() function not correctly skipping the padding tokens indicated by attention_mask #14521

Closed

NielsRogge mentioned this pull request Apr 5, 2022

Improving T5 Docs #16614

Closed

NielsRogge mentioned this pull request Jul 21, 2022

The problem in BATCH generation of GPT model #18211

Closed

JIBSIL mentioned this pull request Apr 2, 2024

Slow Kaggle Performance (2x T4) unslothai/unsloth#260

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch inferencing support for GPT2LMHeadModel #7552

Add batch inferencing support for GPT2LMHeadModel #7552

cccntu commented Oct 3, 2020 •

edited

Loading

cccntu commented Oct 9, 2020

patrickvonplaten commented Oct 13, 2020 •

edited

Loading

patrickvonplaten commented Oct 13, 2020

patrickvonplaten commented Oct 13, 2020

cccntu Oct 14, 2020

LysandreJik Oct 14, 2020

patrickvonplaten Oct 14, 2020

LysandreJik left a comment

patrickvonplaten commented Oct 14, 2020 •

edited

Loading

cccntu commented Oct 14, 2020 •

edited

Loading

LSinev commented Oct 19, 2020

spate141 commented Mar 1, 2021

cccntu commented Mar 2, 2021

spate141 commented Mar 2, 2021

spate141 commented Mar 2, 2021 •

edited

Loading

callzhang commented Jan 24, 2022

thomas-li-sjtu commented Mar 3, 2022

irasin commented Jan 19, 2023 •

edited

Loading

williamLyh commented Jan 29, 2023

Add batch inferencing support for GPT2LMHeadModel #7552

Add batch inferencing support for GPT2LMHeadModel #7552

Conversation

cccntu commented Oct 3, 2020 • edited Loading

What does this PR do?

cccntu commented Oct 9, 2020

patrickvonplaten commented Oct 13, 2020 • edited Loading

patrickvonplaten commented Oct 13, 2020

patrickvonplaten commented Oct 13, 2020

cccntu Oct 14, 2020

Choose a reason for hiding this comment

LysandreJik Oct 14, 2020

Choose a reason for hiding this comment

patrickvonplaten Oct 14, 2020

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 14, 2020 • edited Loading

cccntu commented Oct 14, 2020 • edited Loading

LSinev commented Oct 19, 2020

spate141 commented Mar 1, 2021

cccntu commented Mar 2, 2021

spate141 commented Mar 2, 2021

Meanwhile I found this way to get what I mentioned:

spate141 commented Mar 2, 2021 • edited Loading

callzhang commented Jan 24, 2022

thomas-li-sjtu commented Mar 3, 2022

irasin commented Jan 19, 2023 • edited Loading

williamLyh commented Jan 29, 2023

cccntu commented Oct 3, 2020 •

edited

Loading

patrickvonplaten commented Oct 13, 2020 •

edited

Loading

patrickvonplaten commented Oct 14, 2020 •

edited

Loading

cccntu commented Oct 14, 2020 •

edited

Loading

spate141 commented Mar 2, 2021 •

edited

Loading

irasin commented Jan 19, 2023 •

edited

Loading