[Nougat]Accuracy Problem: different output for both float32 and bfloat 16 trtllm engine with float32 huggingface original model #2207

ehuaa · 2024-09-09T03:59:10Z

System Info

Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version (GPU?): 2.4.0+cu121 (True)
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
Driver Version: 535.161.08
CUDA Version: 12.5
GPU: A40 single card

Who can help?

@byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Follow the tutorial of TensorRT-LLM Linux install the docker and TensorRT-LLM 0.12.0
https://nvidia.github.io/TensorRT-LLM/installation/linux.html
Download my own finetune_version of Nougat, which has the same architecture with nougat base 0.1.0, only change the model weights. # Clone finetune-version of nougat model
git lfs install
git clone https://huggingface.co/shenzhanyou/table_nougat
and copy the model to examples/multimodal/tmp/hf_models/${MODEL_NAME} to align with the official example script.

Follow the tutorial of Nougat and transform the original model above to bfloat16 and float32 version. (Only show the bfloat16 cmd, you can replace bfloat16 with float32 to check float32 accuracy)

 python ../enc_dec/convert_checkpoint.py --model_type bart \
     --model_dir tmp/hf_models/${MODEL_NAME} \
     --output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
     --tp_size 1 \
     --pp_size 1 \
     --dtype bfloat16 \
     --nougat
 
 trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
     --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16/decoder \
     --paged_kv_cache disable \
     --moe_plugin disable \
     --enable_xqa disable \
     --gemm_plugin bfloat16 \
     --bert_attention_plugin bfloat16 \
     --gpt_attention_plugin bfloat16 \
     --remove_input_padding enable \
     --max_beam_width 1 \
     --max_batch_size 1 \
     --max_seq_len 101 \
     --max_input_len 1 \
     --max_encoder_input_len 588 # 1 (max_batch_size) * 588 (num_visual_features)

python build_visual_engine.py --model_type nougat --model_path tmp/hf_models/${MODEL_NAME}

Only replace the test image in examples/multimodal/run.py with my own image below to check result.
python run.py
--hf_model_dir tmp/hf_models/${MODEL_NAME}
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16

Expected behavior

the output of original version of NougatModel is:

\begin{tabular}{@{}llcccccccc@{}}
\hline\hline
\textsf{Dataset} & \textsf{Reference} & \textsf{Resolution} & \textsf{SS} & \textsf{NC} & \textsf{AD} & \textsf{TRT} & \textsf{ML} & \textsf{MP} & \textsf{DM} \\
 \hline
\textsf{\textsf{\textsf{DVS}-Gesture}} & \textsf{[56]} & \textsf{128\,$\times$\,128} & \textsf{1342} & \textsf{11} & \textsf{5s} & \textsf{1.86h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{SL-Animal}-DVS}} & \textsf{[57]} & \textsf{128\,$\times$\,128} & \textsf{1121} & \textsf{19} & \textsf{4.5s} & \textsf{1.4h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{Action}Recognition}} & \textsf{[58]} & \textsf{260\,$\times$\,346} & \textsf{291} & \textsf{10} & \textsf{5s} & \textsf{0.4h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{DailyAction}}} & \textsf{[37]} & \textsf{260\,$\times$\,346} & \textsf{1440} & \textsf{12} & \textsf{5s} & \textsf{2h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\hline
\textsf{\textsf{\textsf{\textsf{DVS}-SLR}} (Ours)} & \textsf{-} & \textsf{260\,$\times$\,346} & \textsf{5418} & \textsf{21} & \textsf{6s} & \textsf{9.03h} & \textsf{$\surd$} & \textsf{$\surd$} & \textsf{$\surd$} \\
\hline
\end{tabular}

actual behavior

bfloat16 and float32 trt_engine gives the same output below.

\begin{tabular}{@{}lllllllllll@{}}
\hline\hline
\textsf{Dataset} & \textsf{Reference} & \textsf{Resolution} & \textsf{SS} & \textsf{NC} & \textsf{AD} & \textsf{TRT} & \textsf{ML} & \textsf{MP} & \textsf{DM} \\
 \hline
\textsf{\textsf{\textsf{DVS-Gesture}}} & \textsf{[56]} & \textsf{128\,$\times$\,128} & \textsf{1342} & \textsf{11} & \textsf{5s} & \textsf{1.86h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{SL-Animal-DVS}}} & \textsf{[57]} & \textsf{128\,$\times$\,128} & \textsf{1121} & \textsf{19} & \textsf{4.5s} & \textsf{1.4h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{ActionRecognition}}} & \textsf{[58]} & \textsf{260\,$\times$\,346} & \textsf{291} & \textsf{10} & \textsf{5s} & \textsf{0.4h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{DailyAction}}} & \textsf{[37]} & \textsf{260\,$\times$\,346} & \textsf{1440} & \textsf{12} & \textsf{5s} & \textsf{2h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\hline
\textsf{\textsf{\textsf{DVS-SLR}}} (\textsf{Ours}) & \textsf{-} & \textsf{260\,$\times$\,346} & \textsf{5418} & \textsf{21} & \textsf{6s} & \textsf{9.03h} & \textsf{$\surd$} & \textsf{$\surd$} & \textsf{$\surd$} \\
\hline
\end{tabular}

which is different from the trtllm inference result with the first line:
\begin{tabular}{@{}llcccccccc@{}} (original transformer)
and
\begin{tabular}{@{}lllllllllll@{}} (trtllm engine bfloat16 and float32)

code for transformers:

from PIL import Image

from transformers import NougatProcessor, VisionEncoderDecoderModel
from datasets import load_dataset
import torch

processor = NougatProcessor.from_pretrained("shenzhanyou/table_nougat")
model = VisionEncoderDecoderModel.from_pretrained("shenzhanyou/table_nougat")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

image = Image.open(filepath) # test_image above 
pixel_values = processor(image, return_tensors="pt").pixel_values

# verify generation
outputs = model.generate(
    pixel_values,
    min_length=1,
    max_length=4096,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
    bad_words_ids=[
        [tokenizer.unk_token_id],
    ],
    return_dict_in_generate=True,
    do_sample=False,
)
generated = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)[0]

additional notes

Not all the images show different results between transformers and trtllm v0.12.0. This image is a weird one.

The text was updated successfully, but these errors were encountered:

ehuaa added the bug Something isn't working label Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Nougat]Accuracy Problem: different output for both float32 and bfloat 16 trtllm engine with float32 huggingface original model #2207

[Nougat]Accuracy Problem: different output for both float32 and bfloat 16 trtllm engine with float32 huggingface original model #2207

ehuaa commented Sep 9, 2024 •

edited

Loading

[Nougat]Accuracy Problem: different output for both float32 and bfloat 16 trtllm engine with float32 huggingface original model #2207

[Nougat]Accuracy Problem: different output for both float32 and bfloat 16 trtllm engine with float32 huggingface original model #2207

Comments

ehuaa commented Sep 9, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

ehuaa commented Sep 9, 2024 •

edited

Loading