Cannot build quantized int8 models for Phi3 128k models [TensorRT-LLM 0.12.0] #2214
Open
2 of 4 tasks
Labels
bug
Something isn't working
System Info
Who can help?
@Tracin
@ncomly-nvidia
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Steps to reproduce the behavior:
Expected behavior
Builds successfully, as per the support matrix of Phi3 in TRTLLM v0.12.0: https://github.com/NVIDIA/TensorRT-LLM/tree/28fb9aacaa5a05494635194a9cbb264da9a744bd/examples/phi
actual behavior
Errors occur when building the engine with trtllm-build. Command:
Errors:
additional notes
The environment that TRT is being run in is a VM with GPU passthrough (Ubuntu host, Ubuntu guest). The models are downloaded from HF's official Phi3 models: eg https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3, and are of the latest version.
Building and running the models (Phi3 mini 4k, mini 128k, medium 4k, medium 128k) without quantization (using dtype bf16 or fp16) works as intended (on the same VM environment), while only the models (Phi3 mini 4k, medium 4k) work with int8 quantization. I believe that the problem is due to the implementation of ROPE within TRTLLM, as evident from the error message and the fact that only 128k Phi3 models have problems with int8 quantization.
I hope that this can be resolved, as models with longer contexts are very useful in a practical sense. If possible, please also considering supporting int4 quantization officially for Phi3 models in the future. Thanks!
The text was updated successfully, but these errors were encountered: