[kv_cache] integrated vlm code for benchmark (Stacked on #3527) #3652

chohk88 · 2025-07-03T01:46:11Z

Description

Base branch: kv_cache (PR #3527 )

Integrated VLM benchmark framework
- Currently supports Eagle2
- Planned support: Paligemma, Qwen 2.5-VL, etc.
Added custom token-generation function** for multi-modal (MM) models

Type of change

Please delete options that are not relevant and/or add your own.

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tools/llm/utils.py	2025-07-03 01:46:24.189295+00:00
+++ /home/runner/work/TensorRT/TensorRT/tools/llm/utils.py	2025-07-03 01:46:57.661981+00:00
@@ -318,14 +318,16 @@
    generated = 0

    while generated < osl:
        cur_embeds = seq_embeds  # full seq first step or cache off
        position_ids = (
-                torch.arange(cur_embeds.shape[1]).unsqueeze(0).to(cur_embeds.device)
-            )
+            torch.arange(cur_embeds.shape[1]).unsqueeze(0).to(cur_embeds.device)
+        )
        with torch.no_grad():
-            logits = model.language_model(inputs_embeds=cur_embeds, position_ids=position_ids)
+            logits = model.language_model(
+                inputs_embeds=cur_embeds, position_ids=position_ids
+            )
            if hasattr(logits, "logits"):
                logits = logits.logits

        next_tok = torch.argmax(logits[:, -1, :], dim=-1)  # (B,)
        # append token & embed
@@ -381,13 +383,11 @@
        mask = seq_tokens.view(B * N) == model.image_token_index
        flat[mask] = vit_embeds.reshape(-1, C).to(flat.dtype)[: mask.sum()]
        seq_embeds = flat.view(B, N, C)

    # ───────────────────── KV-cache initialization ─────────────────────
-    kv_cache = get_zeroed_static_cache_inputs(
-        model.language_model
-    )
+    kv_cache = get_zeroed_static_cache_inputs(model.language_model)
    start_idx = 0  # First token index
    end_idx = seq_embeds.size(1)  # Prompt length
    generated = 0
    max_total_len = max_output_seq_length
    output_tokens = seq_tokens.clone()
@@ -607,13 +607,11 @@
        mask = seq_tokens.view(B * N) == model.image_token_index
        flat[mask] = vit_embeds.reshape(-1, C).to(flat.dtype)[: mask.sum()]
        seq_embeds = flat.view(B, N, C)

    # ───────────────────── KV-cache initialization ─────────────────────
-    kv_cache = get_zeroed_static_cache_inputs(
-        model.language_model
-    )
+    kv_cache = get_zeroed_static_cache_inputs(model.language_model)
    start_idx = 0  # First token index
    end_idx = seq_embeds.size(1)  # Prompt length
    generated = 0
    max_total_len = end_idx + max_new_tokens
    output_tokens = seq_tokens.clone()

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tools/llm/utils.py	2025-07-03 01:46:23.684507+00:00
+++ /home/runner/work/TensorRT/TensorRT/tools/llm/utils.py	2025-07-03 01:46:57.646835+00:00
@@ -318,14 +318,16 @@
    generated = 0

    while generated < osl:
        cur_embeds = seq_embeds  # full seq first step or cache off
        position_ids = (
-                torch.arange(cur_embeds.shape[1]).unsqueeze(0).to(cur_embeds.device)
-            )
+            torch.arange(cur_embeds.shape[1]).unsqueeze(0).to(cur_embeds.device)
+        )
        with torch.no_grad():
-            logits = model.language_model(inputs_embeds=cur_embeds, position_ids=position_ids)
+            logits = model.language_model(
+                inputs_embeds=cur_embeds, position_ids=position_ids
+            )
            if hasattr(logits, "logits"):
                logits = logits.logits

        next_tok = torch.argmax(logits[:, -1, :], dim=-1)  # (B,)
        # append token & embed
@@ -381,13 +383,11 @@
        mask = seq_tokens.view(B * N) == model.image_token_index
        flat[mask] = vit_embeds.reshape(-1, C).to(flat.dtype)[: mask.sum()]
        seq_embeds = flat.view(B, N, C)

    # ───────────────────── KV-cache initialization ─────────────────────
-    kv_cache = get_zeroed_static_cache_inputs(
-        model.language_model
-    )
+    kv_cache = get_zeroed_static_cache_inputs(model.language_model)
    start_idx = 0  # First token index
    end_idx = seq_embeds.size(1)  # Prompt length
    generated = 0
    max_total_len = max_output_seq_length
    output_tokens = seq_tokens.clone()
@@ -607,13 +607,11 @@
        mask = seq_tokens.view(B * N) == model.image_token_index
        flat[mask] = vit_embeds.reshape(-1, C).to(flat.dtype)[: mask.sum()]
        seq_embeds = flat.view(B, N, C)

    # ───────────────────── KV-cache initialization ─────────────────────
-    kv_cache = get_zeroed_static_cache_inputs(
-        model.language_model
-    )
+    kv_cache = get_zeroed_static_cache_inputs(model.language_model)
    start_idx = 0  # First token index
    end_idx = seq_embeds.size(1)  # Prompt length
    generated = 0
    max_total_len = end_idx + max_new_tokens
    output_tokens = seq_tokens.clone()

integrated vlm code for benchmark

85f40fe

chohk88 requested a review from peri044 July 3, 2025 01:46

chohk88 self-assigned this Jul 3, 2025

facebook-github-bot added the cla signed label Jul 3, 2025

github-actions bot requested changes Jul 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[kv_cache] integrated vlm code for benchmark (Stacked on #3527) #3652

[kv_cache] integrated vlm code for benchmark (Stacked on #3527) #3652

chohk88 commented Jul 3, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

[kv_cache] integrated vlm code for benchmark (Stacked on #3527) #3652

Are you sure you want to change the base?

[kv_cache] integrated vlm code for benchmark (Stacked on #3527) #3652

Conversation

chohk88 commented Jul 3, 2025

Description

Type of change

Checklist:

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!