InternLM · lvhan028 · Sep 10, 2024 · Jul 16, 2024 · Jul 17, 2024 · Jul 19, 2024
diff --git a/docs/en/advance/pytorch_new_model.md b/docs/en/advance/pytorch_new_model.md
diff --git a/docs/en/inference/pytorch.md b/docs/en/inference/pytorch.md
@@ -47,28 +47,6 @@ ModelAgent consists of two components:
 1. \`**patched_model**: : This is the transformer model after patching. In comparison to the original model, the patched model incorporates additional features such as Tensor Parallelism, quantization, and high-performance kernels.
 2. **cache_engine**: This component manages the caches. It receives commands from the Scheduler and performs host-device page swaps. Only GPU blocks are utilized for caching key/value pairs and adapters.
 
-## Patching
-
-In order to facilitate the deployment of a new model, we have developed a tool to patch the modules.
-
-For example, if we want to reimplement the forward method of `LlamaAttention`:
-
-```python
-class CustomLlamaAttention(nn.Module):
-    def forward(self, ...):
-        # custom forward
-```
-
-We register the implementation above into `lmdeploy.pytorch.models.module_map`:
-
-```python
-MODULE_MAP.update({
-'transformers.models.llama.modeling_llama.LlamaAttention':
-'qualname.to.CustomLlamaAttention'})
-```
-
-`ModelAgent` would then load and patch `LlamaAttention` with `CustomLlamaAttention` while leaving everything else unchanged. You can perform inference with the new implementation. For more detail about model patching, please refer to [support new model](../advance/pytorch_new_model.md) .
-
 ## Features
 
 `lmdeploy.pytorch` supports new features including:

diff --git a/docs/zh_cn/advance/pytorch_new_model.md b/docs/zh_cn/advance/pytorch_new_model.md
diff --git a/docs/zh_cn/inference/pytorch.md b/docs/zh_cn/inference/pytorch.md
@@ -47,28 +47,6 @@ ModelAgent 有两个重要组件：
 1. patched_model 是更新后的 transformer 模型，更新后的模型添加了各种功能的支持，包括更高性能的子模块实现、TP、量化等等
 2. cache_engine 是缓存的分配与交换模块。它接收来自 scheduler 的交换请求，执行 host-device 间显存交换，adapter 加载等工作
 
-## Patching
-
-为了降低接入模型的门槛，我们实现了一套简单的 patch 机制来简化实现的替换。
-
-以 Llama 模型的 LlamaAttention.forward 为例，我们可以重新写一个 forward 的实现：
-
-```python
-class CustomLlamaAttention(nn.Module):
-    def forward(self, ...):
-        # custom forward
-```
-
-然后在 `lmdeploy.pytorch.models.module_map` 中注册模块的映射
-
-```python
-MODULE_MAP.update({
-'transformers.models.llama.modeling_llama.LlamaAttention':
-'qualname.to.CustomLlamaAttention'})
-```
-
-经过 patch 后的模型就会使用新的 forward 实现。TP、量化等功能也依赖 patch 机制，请阅读 [lmdeploy.pytorch 新模型支持](../advance/pytorch_new_model.md) 了解更多细节。
-
 ## 特性
 
 - **Continuous Batching**: 由于输入序列的长度不一样，batching 通常需要对输入进行 padding，这种 padding 会导致后续运算的计算量增加、影响速度，也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案，lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理，避免了多余的资源占用。

diff --git a/lmdeploy/lite/apis/smooth_quant.py b/lmdeploy/lite/apis/smooth_quant.py
@@ -158,6 +158,8 @@ def smooth_quant(model: str,
         model.save_pretrained(work_dir,
                               max_shard_size='2GB',
                               safe_serialization=False)
+        model.config.update(
+            dict(quantization_config=dict(quant_method='smooth_quant')))
     tokenizer.save_pretrained(work_dir)
 
     shutil.copy(MODEL_PATH_MAP[type(model).__name__], work_dir)

diff --git a/lmdeploy/messages.py b/lmdeploy/messages.py
@@ -213,6 +213,8 @@ class PytorchEngineConfig:
     thread_safe: bool = False
     enable_prefix_caching: bool = False
     device_type: str = 'cuda'
+    eager_mode: bool = False
+    custom_module_map: str = None
     download_dir: str = None
     revision: str = None