Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Nougat]Accuracy Problem: different output for both float32 and bfloat 16 trtllm engine with float32 huggingface original model #2207

Open
3 of 4 tasks
ehuaa opened this issue Sep 9, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ehuaa
Copy link

ehuaa commented Sep 9, 2024

System Info

Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version (GPU?): 2.4.0+cu121 (True)
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
Driver Version: 535.161.08
CUDA Version: 12.5
GPU: A40 single card

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Follow the tutorial of TensorRT-LLM Linux install the docker and TensorRT-LLM 0.12.0
    https://nvidia.github.io/TensorRT-LLM/installation/linux.html

  2. Download my own finetune_version of Nougat, which has the same architecture with nougat base 0.1.0, only change the model weights. # Clone finetune-version of nougat model
    git lfs install
    git clone https://huggingface.co/shenzhanyou/table_nougat
    and copy the model to examples/multimodal/tmp/hf_models/${MODEL_NAME} to align with the official example script.

  3. Follow the tutorial of Nougat and transform the original model above to bfloat16 and float32 version. (Only show the bfloat16 cmd, you can replace bfloat16 with float32 to check float32 accuracy)

     python ../enc_dec/convert_checkpoint.py --model_type bart \
         --model_dir tmp/hf_models/${MODEL_NAME} \
         --output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
         --tp_size 1 \
         --pp_size 1 \
         --dtype bfloat16 \
         --nougat
     
     trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
         --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16/decoder \
         --paged_kv_cache disable \
         --moe_plugin disable \
         --enable_xqa disable \
         --gemm_plugin bfloat16 \
         --bert_attention_plugin bfloat16 \
         --gpt_attention_plugin bfloat16 \
         --remove_input_padding enable \
         --max_beam_width 1 \
         --max_batch_size 1 \
         --max_seq_len 101 \
         --max_input_len 1 \
         --max_encoder_input_len 588 # 1 (max_batch_size) * 588 (num_visual_features)
    
    python build_visual_engine.py --model_type nougat --model_path tmp/hf_models/${MODEL_NAME}
    
  4. Only replace the test image in examples/multimodal/run.py with my own image below to check result.
    python run.py
    --hf_model_dir tmp/hf_models/${MODEL_NAME}
    --visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder
    --llm_engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16

Image

Expected behavior

the output of original version of NougatModel is:

\begin{tabular}{@{}llcccccccc@{}}
\hline\hline
\textsf{Dataset} & \textsf{Reference} & \textsf{Resolution} & \textsf{SS} & \textsf{NC} & \textsf{AD} & \textsf{TRT} & \textsf{ML} & \textsf{MP} & \textsf{DM} \\
 \hline
\textsf{\textsf{\textsf{DVS}-Gesture}} & \textsf{[56]} & \textsf{128\,$\times$\,128} & \textsf{1342} & \textsf{11} & \textsf{5s} & \textsf{1.86h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{SL-Animal}-DVS}} & \textsf{[57]} & \textsf{128\,$\times$\,128} & \textsf{1121} & \textsf{19} & \textsf{4.5s} & \textsf{1.4h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{Action}Recognition}} & \textsf{[58]} & \textsf{260\,$\times$\,346} & \textsf{291} & \textsf{10} & \textsf{5s} & \textsf{0.4h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{DailyAction}}} & \textsf{[37]} & \textsf{260\,$\times$\,346} & \textsf{1440} & \textsf{12} & \textsf{5s} & \textsf{2h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\hline
\textsf{\textsf{\textsf{\textsf{DVS}-SLR}} (Ours)} & \textsf{-} & \textsf{260\,$\times$\,346} & \textsf{5418} & \textsf{21} & \textsf{6s} & \textsf{9.03h} & \textsf{$\surd$} & \textsf{$\surd$} & \textsf{$\surd$} \\
\hline
\end{tabular}

actual behavior

bfloat16 and float32 trt_engine gives the same output below.

\begin{tabular}{@{}lllllllllll@{}}
\hline\hline
\textsf{Dataset} & \textsf{Reference} & \textsf{Resolution} & \textsf{SS} & \textsf{NC} & \textsf{AD} & \textsf{TRT} & \textsf{ML} & \textsf{MP} & \textsf{DM} \\
 \hline
\textsf{\textsf{\textsf{DVS-Gesture}}} & \textsf{[56]} & \textsf{128\,$\times$\,128} & \textsf{1342} & \textsf{11} & \textsf{5s} & \textsf{1.86h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{SL-Animal-DVS}}} & \textsf{[57]} & \textsf{128\,$\times$\,128} & \textsf{1121} & \textsf{19} & \textsf{4.5s} & \textsf{1.4h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{ActionRecognition}}} & \textsf{[58]} & \textsf{260\,$\times$\,346} & \textsf{291} & \textsf{10} & \textsf{5s} & \textsf{0.4h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{DailyAction}}} & \textsf{[37]} & \textsf{260\,$\times$\,346} & \textsf{1440} & \textsf{12} & \textsf{5s} & \textsf{2h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\hline
\textsf{\textsf{\textsf{DVS-SLR}}} (\textsf{Ours}) & \textsf{-} & \textsf{260\,$\times$\,346} & \textsf{5418} & \textsf{21} & \textsf{6s} & \textsf{9.03h} & \textsf{$\surd$} & \textsf{$\surd$} & \textsf{$\surd$} \\
\hline
\end{tabular}

which is different from the trtllm inference result with the first line:
\begin{tabular}{@{}llcccccccc@{}} (original transformer)
and
\begin{tabular}{@{}lllllllllll@{}} (trtllm engine bfloat16 and float32)

code for transformers:

from PIL import Image

from transformers import NougatProcessor, VisionEncoderDecoderModel
from datasets import load_dataset
import torch

processor = NougatProcessor.from_pretrained("shenzhanyou/table_nougat")
model = VisionEncoderDecoderModel.from_pretrained("shenzhanyou/table_nougat")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

image = Image.open(filepath) # test_image above 
pixel_values = processor(image, return_tensors="pt").pixel_values

# verify generation
outputs = model.generate(
    pixel_values,
    min_length=1,
    max_length=4096,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
    bad_words_ids=[
        [tokenizer.unk_token_id],
    ],
    return_dict_in_generate=True,
    do_sample=False,
)
generated = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)[0]

additional notes

Not all the images show different results between transformers and trtllm v0.12.0. This image is a weird one.

@ehuaa ehuaa added the bug Something isn't working label Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant