Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pytorch engine #2104

Merged
merged 98 commits into from
Sep 10, 2024
Merged
Show file tree
Hide file tree
Changes from 95 commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
038f6bf
attn layer
Jul 16, 2024
68936c9
move to backend
Jul 17, 2024
ccdb3ea
add base layer
Jul 19, 2024
b123e4d
finish llama base
Jul 20, 2024
5a09d9f
add lora and w8a8
Jul 22, 2024
4755b1e
support awq
Jul 23, 2024
60df32f
add add_rms_norm kernel
Jul 23, 2024
67aba31
optimize step context
Jul 24, 2024
4312826
attn meta as input
Jul 24, 2024
ef092e5
add cuda graph support
Jul 25, 2024
9fefda5
disable one of mha kernel
Jul 25, 2024
839f0be
share graph pool
Jul 26, 2024
3345181
del graph
Jul 26, 2024
6746e67
update docstring
Jul 26, 2024
e5a790b
awq cudagraph
Jul 30, 2024
fbc0912
merge main
Jul 30, 2024
a9ec3fa
support llava for llama
Jul 30, 2024
67e427a
fix adapter
Jul 30, 2024
5158b96
fix support cudagraph flag
Jul 31, 2024
580cdd0
support lora cudagraph
Jul 31, 2024
449f947
support logit softcapping
Aug 1, 2024
0e16e69
support transformers 4.43
Aug 1, 2024
e6a3048
fix ut
Aug 2, 2024
364737b
Merge branch 'main' into torch-layers
Aug 5, 2024
93d3746
fix dynamic ntk cudagraph
Aug 5, 2024
2dfcc6f
add moe support
Aug 5, 2024
93c64ee
add custom module support
Aug 6, 2024
3622635
optimize awq kernel
Aug 14, 2024
973d222
optimize attention
Aug 16, 2024
871b788
fix graph runner
Aug 19, 2024
09149ac
optimize prefill
Aug 19, 2024
935c25c
dynamic prefill interval
Aug 20, 2024
c363832
fix response
Aug 20, 2024
b5bb49f
optmize prefill
Aug 21, 2024
8cf2ab2
adjust grid of paged attention
grimoire Aug 21, 2024
3ce4e2d
add attention stages
Aug 21, 2024
30c2066
support llama3
Aug 21, 2024
047e58e
optimize apply rotary
Aug 22, 2024
6ef049d
rename
Aug 22, 2024
7b75a65
fix sampling
Aug 22, 2024
c21bf95
merge main
Aug 22, 2024
b8f7f54
remove print
Aug 22, 2024
40fc417
prepare for new weight loader
Aug 24, 2024
9df5161
refactor add model
Aug 27, 2024
476bce2
optimize nn
Aug 28, 2024
eca95ff
fix linear device
Aug 28, 2024
d14289e
support baichuan 7b 13b
Aug 28, 2024
8424a21
support deepseekv2 no-tp
Aug 28, 2024
6395a00
support deepseek v2 tp
Aug 29, 2024
d34a2b4
add log
Aug 29, 2024
fc7a78a
fix ut
Aug 29, 2024
4e87115
merge main
Aug 29, 2024
9a6855d
support chatglm
Aug 29, 2024
694bb04
support llava
Aug 29, 2024
6d47b63
add falcon
Aug 29, 2024
49e51ca
add internlm2 and mistral
Aug 30, 2024
8550f04
add gemma/gemma2
Aug 30, 2024
6889cc6
add deepseek, qwen1
Aug 30, 2024
481182d
remove request timeout
Aug 30, 2024
b1c4ff7
merge main
Sep 2, 2024
71d55a4
add qwen2, qwen-moe
Sep 2, 2024
90a4a63
add starcoder2 phi-3 phi-3 vision
Sep 2, 2024
5f772ab
support phi3 moe
Sep 2, 2024
1d3b27d
support dbrx
Sep 3, 2024
5607566
support internvl
Sep 3, 2024
494649d
support merged awq weight
Sep 3, 2024
5f33ccf
add cogvlm
Sep 3, 2024
17abf91
update docs
Sep 3, 2024
c6824d5
fused layernorm
Sep 3, 2024
985c769
add gelu and mul
Sep 3, 2024
5de7cd9
support triton==3.0.0
Sep 4, 2024
7716147
update names
Sep 4, 2024
9daa0d1
fix
Sep 4, 2024
7d8ac69
cogvlm2
Sep 4, 2024
e5a6c37
fix
Sep 4, 2024
2236f28
fix
Sep 5, 2024
3327d6d
fix internlm2 awq
Sep 5, 2024
fe18df4
rename
Sep 5, 2024
b74a22d
fix a hanging problem when using cli serve mode and device ascend on …
CyCle1024 Sep 5, 2024
2f29a78
Merge pull request #3 from CyCle1024/fix-ascend-exit-uvicorn
grimoire Sep 5, 2024
d810128
raise -> return
Sep 5, 2024
27ec376
optimize moe
Sep 5, 2024
6e4da93
Merge branch 'torch-layers' of github.com:grimoire/lmdeploy into torc…
Sep 5, 2024
b98153e
fix linear awq bias, default awq kernel
Sep 6, 2024
77616aa
fix
Sep 6, 2024
3bfcae2
optimize default awq
Sep 6, 2024
50f5b3c
fix llama rope, add internlm
Sep 6, 2024
e61ddcf
optimize decoding
Sep 6, 2024
a27bf51
recovery attention
Sep 6, 2024
0387730
fix fill kv cache
Sep 6, 2024
195ed83
fix internlm oom
Sep 9, 2024
9b5bc43
fix llama3 memory usage
Sep 9, 2024
3020ada
remove float deepseekv2
Sep 9, 2024
331e2c0
fix llama3
Sep 9, 2024
aa9c722
update smooth quant flag
Sep 9, 2024
adbc531
fix w8a8
Sep 10, 2024
84e9b01
merge main
Sep 10, 2024
1fae365
fix w8a8 tp
Sep 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
473 changes: 134 additions & 339 deletions docs/en/advance/pytorch_new_model.md

Large diffs are not rendered by default.

22 changes: 0 additions & 22 deletions docs/en/inference/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,28 +47,6 @@ ModelAgent consists of two components:
1. \`**patched_model**: : This is the transformer model after patching. In comparison to the original model, the patched model incorporates additional features such as Tensor Parallelism, quantization, and high-performance kernels.
2. **cache_engine**: This component manages the caches. It receives commands from the Scheduler and performs host-device page swaps. Only GPU blocks are utilized for caching key/value pairs and adapters.

## Patching

In order to facilitate the deployment of a new model, we have developed a tool to patch the modules.

For example, if we want to reimplement the forward method of `LlamaAttention`:

```python
class CustomLlamaAttention(nn.Module):
def forward(self, ...):
# custom forward
```

We register the implementation above into `lmdeploy.pytorch.models.module_map`:

```python
MODULE_MAP.update({
'transformers.models.llama.modeling_llama.LlamaAttention':
'qualname.to.CustomLlamaAttention'})
```

`ModelAgent` would then load and patch `LlamaAttention` with `CustomLlamaAttention` while leaving everything else unchanged. You can perform inference with the new implementation. For more detail about model patching, please refer to [support new model](../advance/pytorch_new_model.md) .

## Features

`lmdeploy.pytorch` supports new features including:
Expand Down
465 changes: 132 additions & 333 deletions docs/zh_cn/advance/pytorch_new_model.md

Large diffs are not rendered by default.

22 changes: 0 additions & 22 deletions docs/zh_cn/inference/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,28 +47,6 @@ ModelAgent 有两个重要组件:
1. patched_model 是更新后的 transformer 模型,更新后的模型添加了各种功能的支持,包括更高性能的子模块实现、TP、量化等等
2. cache_engine 是缓存的分配与交换模块。它接收来自 scheduler 的交换请求,执行 host-device 间显存交换,adapter 加载等工作

## Patching

为了降低接入模型的门槛,我们实现了一套简单的 patch 机制来简化实现的替换。

以 Llama 模型的 LlamaAttention.forward 为例,我们可以重新写一个 forward 的实现:

```python
class CustomLlamaAttention(nn.Module):
def forward(self, ...):
# custom forward
```

然后在 `lmdeploy.pytorch.models.module_map` 中注册模块的映射

```python
MODULE_MAP.update({
'transformers.models.llama.modeling_llama.LlamaAttention':
'qualname.to.CustomLlamaAttention'})
```

经过 patch 后的模型就会使用新的 forward 实现。TP、量化等功能也依赖 patch 机制,请阅读 [lmdeploy.pytorch 新模型支持](../advance/pytorch_new_model.md) 了解更多细节。

## 特性

- **Continuous Batching**: 由于输入序列的长度不一样,batching 通常需要对输入进行 padding,这种 padding 会导致后续运算的计算量增加、影响速度,也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案,lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理,避免了多余的资源占用。
Expand Down
2 changes: 2 additions & 0 deletions lmdeploy/lite/apis/smooth_quant.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,8 @@ def smooth_quant(model: str,
model.save_pretrained(work_dir,
max_shard_size='2GB',
safe_serialization=False)
model.config.update(
dict(quantization_config=dict(quant_method='smooth_quant')))
tokenizer.save_pretrained(work_dir)

shutil.copy(MODEL_PATH_MAP[type(model).__name__], work_dir)
Expand Down
2 changes: 2 additions & 0 deletions lmdeploy/messages.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,8 @@ class PytorchEngineConfig:
thread_safe: bool = False
enable_prefix_caching: bool = False
device_type: str = 'cuda'
eager_mode: bool = False
custom_module_map: str = None
RunningLeon marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里MODULE_MAP为啥我们要设计成Dict[str, str] 的形式,不是 Dict[str, class]的形式啊

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom_module_map is a path to the .py file contain the map and custom module.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我是说内部的 MODULE_MAP 字典,之前用 patch 的话字符串能理解,但是改成自己加载权重推理的话,继续用字符串意义不大了吧。直接传类,简单直接

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load all classes is slow.

download_dir: str = None
revision: str = None

Expand Down
Loading
Loading