Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pytorch engine #2104

Merged
merged 98 commits into from
Sep 10, 2024
Merged

Refactor pytorch engine #2104

merged 98 commits into from
Sep 10, 2024

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Jul 22, 2024

It is hard to switch kernel implementations in PyTorch Engine, and patching models of transformers makes it difficult for us to carry out more aggressive optimizations.

This PR plan to refactor pytorch engine. We added an operator abstraction layer and made it capable of selecting the most suitable operator backend based on the current context.

  • lmdeploy/pytorch/layers: The op abstraction layer. Deploy model would be built with these Infrastructure.
  • lmdeploy/pytorch/backends: Implementation of op would be dispatched here by the device and environments.
  • cudagraph support, kernel launch will not be the main bottleneck.

# Copyright (c) OpenMMLab. All rights reserved.
import torch

from lmdeploy.pytorch.kernels.cuda import apply_rotary_pos_emb
Copy link
Collaborator

@yao-fengchen yao-fengchen Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here cuda directly use apply_rotary_pos_emb in lmdeploy.pytorch.kernels.cuda instead of apply_rotary_pos_emb in lmdeploy.pytorch.kernels. Is it possible that apply_rotary_pos_emb in lmdeploy.pytorch.kernels will not be used and lmdeploy/pytorch/kernels/apply_rotary_pos_emb.py can be deleted? This is also the issue with other kernels.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is expected.

@grimoire
Copy link
Collaborator Author

grimoire commented Sep 6, 2024

@zhulinJulia24 Fixed 1, 2

Unable to reproduce Issue 3. 4 is caused by awq_kernels

@zhulinJulia24
Copy link
Collaborator

zhulinJulia24 commented Sep 9, 2024

@grimoire
OOM when do benchmark test with 256 concurrency

root@4c6619530244:/__w/lmdeploy/lmdeploy# CUDA_VISIBLE_DEVICES=2,3 python3 benchmark/profile_throughput.py /nvme/qa_test_models/datasets/ShareGPT_V3_unfiltered_cleaned_split.json /nvme/qa_test_models/internlm/internlm2_5-20b-chat  --backend pytorch --concurrency 256 --num-prompts 3000 --tp 2 
  0%|                                                                                                                                                                                    | 0/3000 [00:00<?, ?it/s][rank0]:[W CUDAGraph.cpp:150] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
  3%|████▎                                                                                                                                                                      | 76/3000 [00:27<07:19,  6.65it/s]2024-09-09 10:57:59,402 - lmdeploy - ERROR - Engine loop failed with error: CUDA out of memory. Tried to allocate 1.40 GiB. GPU 
Traceback (most recent call last):
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/request.py", line 17, in _raise_exception_on_finish
    task.result()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 904, in async_loop
    await self._async_loop()
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 894, in _async_loop
    await __step(True)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 880, in __step
    raise e
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 872, in __step
    raise out
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 817, in _async_loop_background
    await self._async_step_background(
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 717, in _async_step_background
    output = await self._async_model_forward(
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/utils.py", line 236, in __tmp
    return (await func(*args, **kwargs))
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 623, in _async_model_forward
    ret = await __forward(inputs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 601, in __forward
    return await self.model_agent.async_forward(
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 781, in async_forward
    output = self._forward_impl(inputs,
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 748, in _forward_impl
    output = model_forward(
  File "/opt/py3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 154, in model_forward
    output = model(**input_dict)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/backends/cuda/graph_runner.py", line 265, in __call__
    output = runner.forward(**kwargs)
  File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/backends/cuda/graph_runner.py", line 193, in forward
    output = self.output_buffers['logits'][:, :num_tokens].clone()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.40 GiB. GPU

image

@grimoire
Copy link
Collaborator Author

grimoire commented Sep 9, 2024

@zhulinJulia24 This is expected, cudagraph requires more memories. Use small --cache-max-entry-count.

@zhulinJulia24
Copy link
Collaborator

@zhulinJulia24 This is expected, cudagraph requires more memories. Use small --cache-max-entry-count.

@grimoire

I set --cache-max-entry-count=0.7 throughtput benchmark test passed, but api_server still has OOM, reproduce step is:

  1. start api server:
    CUDA_VISIBLE_DEVICES=2,3 lmdeploy serve api_server /nvme/qa_test_models/internlm/internlm2_5-20b-chat --session-len 8096 --server-port 23334 --tp 2 --max-batch-size 256 --cache-max-entry-count 0.7 --backend pytorch

  2. start api benchmark:
    python3 benchmark/profile_restful_api.py localhost:23334 /nvme/qa_test_models/internlm/internlm2_5-20b-chat /nvme/qa_test_models/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --stream-output True --num-prompts 2000 --concurrency 256

@zhulinJulia24
Copy link
Collaborator

It seems FTL is still a litter high
image

@lvhan028 lvhan028 changed the title Custom backend support. Refactor pytorch engine Sep 9, 2024
@zhulinJulia24
Copy link
Collaborator

@@ -212,6 +212,8 @@ class PytorchEngineConfig:
thread_safe: bool = False
enable_prefix_caching: bool = False
device_type: str = 'cuda'
eager_mode: bool = False
custom_module_map: str = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里MODULE_MAP为啥我们要设计成Dict[str, str] 的形式,不是 Dict[str, class]的形式啊

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom_module_map is a path to the .py file contain the map and custom module.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我是说内部的 MODULE_MAP 字典,之前用 patch 的话字符串能理解,但是改成自己加载权重推理的话,继续用字符串意义不大了吧。直接传类,简单直接

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load all classes is slow.


1. The custom Triton kernel allows us to incorporate new features, such as `paged_attention_fwd`.
2. Fused kernels offer superior performance compared to the pure PyTorch implementation.
class GemmaModelConfigBuilder(AutoModelConfigBuilder):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的例子是否前后统一比较好?这里是 Gemma 后面是 llama。另外例子最后能给一个完整的可执行的python脚本吗,从注册模型到运行一次推理。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a config builder for llama.
The documentation will be improved in future PRs.

Copy link
Collaborator

@zhulinJulia24 zhulinJulia24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit e8a1a33 into InternLM:main Sep 10, 2024
5 checks passed
@zhyncs
Copy link
Collaborator

zhyncs commented Sep 10, 2024

Nice to see the PyTorch Engine being refactored. I am looking forward to the performance of the new PyTorch Engine when CUDA Graph is enabled.
Here is the latest build
https://github.com/zhyncs/lmdeploy-build/actions/runs/10789704734
https://github.com/zhyncs/lmdeploy-build/actions/runs/10789706131

@zhyncs
Copy link
Collaborator

zhyncs commented Sep 10, 2024

@lvhan028 @grimoire May we release a new version soon? I believe it's a great upgrade.

@lvhan028
Copy link
Collaborator

yes, we are working on v0.6.0.
It will be released this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants