Fp8 Support #1726

Narsil · 2024-04-11T11:12:08Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

docs/source/basic_tutorials/launcher.md

HuggingFaceDocBuilderDev · 2024-04-11T11:15:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

drbh

LGTM 👌

server/text_generation_server/utils/layers.py

Co-authored-by: Dong Shin <d0104.shin@gmail.com>

Narsil · 2024-04-12T06:11:58Z

server/text_generation_server/utils/layers.py

+        self.dtype = weight.dtype
+        self.qweight, self.scale = fp8_quantize(weight)
+
+        self.bias = bias.cuda(device) if bias is not None else None


Suggested change

self.bias = bias.cuda(device) if bias is not None else None

self.bias = bias if bias is not None else None

server/text_generation_server/utils/layers.py

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

…roject#4118) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

@OlivierDehaene

Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Dong Shin <d0104.shin@gmail.com>

…roject#4118) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

@OlivierDehaene

# What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Dong Shin <d0104.shin@gmail.com>

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

…roject#4118) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

Narsil added 8 commits April 11, 2024 10:38

Initial fp8.

50d5a3c

Dummy but working version.

e1e9a18

Updating docs.

be59a6b

typo removal.

6568e48

Marking the flag as really not the fastest and BETA.

eb40f8c

Forgot to update docs.

8cd198a

Update docs2.

c31cb32

Fp8 support.

b24bdb9

Narsil commented Apr 11, 2024

View reviewed changes

docs/source/basic_tutorials/launcher.md Outdated Show resolved Hide resolved

Update docs/source/basic_tutorials/launcher.md

66195d8

Style.

a352563

drbh previously approved these changes Apr 11, 2024

View reviewed changes

dongs0104 reviewed Apr 12, 2024

View reviewed changes

server/text_generation_server/utils/layers.py Outdated Show resolved Hide resolved

Update server/text_generation_server/utils/layers.py

5ef2a48

Co-authored-by: Dong Shin <d0104.shin@gmail.com>

Narsil dismissed drbh’s stale review via 5ef2a48 April 12, 2024 06:11

Narsil commented Apr 12, 2024

View reviewed changes

server/text_generation_server/utils/layers.py Outdated Show resolved Hide resolved

Update server/text_generation_server/utils/layers.py

666cde0

Narsil merged commit 408dbc4 into main Apr 12, 2024
4 checks passed

Narsil deleted the fp8 branch April 12, 2024 06:13

comaniac mentioned this pull request Apr 16, 2024

[Kernel][FP8] Initial support with dynamic per-tensor scaling vllm-project/vllm#4118

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fp8 Support #1726

Fp8 Support #1726

Narsil commented Apr 11, 2024

HuggingFaceDocBuilderDev commented Apr 11, 2024

drbh left a comment

Narsil Apr 12, 2024

	self.bias = bias.cuda(device) if bias is not None else None
	self.bias = bias if bias is not None else None

Fp8 Support #1726

Fp8 Support #1726

Conversation

Narsil commented Apr 11, 2024

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 11, 2024

drbh left a comment

Choose a reason for hiding this comment

Narsil Apr 12, 2024

Choose a reason for hiding this comment