Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do replicate-internal/staging-llama-2-70b-mlc and replicate-internal/llama-2-70b-triton have different maximum input lengths? #260

Open
jdkanu opened this issue Mar 15, 2024 · 0 comments

Comments

@jdkanu
Copy link

jdkanu commented Mar 15, 2024

I am getting an error that the prompt length exceeds the maximum input length when calling meta/llama-2-70b through the API. I have included the error log from the Replicate dashboard online (see below). I have called the same model in the past without error, and I am almost sure that the prompts were identical or similar in length (prediction data expires for older predictions so I can't verify to be 100% sure). The prompt is also not very long---just 6 question-answering demonstrations with a few intermediate reasoning steps.

Inspecting further, I discovered that there are two different replicate-internal models that are being called to serve the request.

replicate-internal/staging-llama-2-70b-mlc (this one gave me no error)
and
replicate-internal/llama-2-70b-triton (this one gives an error)

Do these models have have different maximum input lengths? If so, how can I call replicate-internal/staging-llama-2-70b-mlc or another llama-2-70b model with a large enough maximum input length?

The error:

[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f48929b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f48929b41cd]
2       0x7f48949dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f48949dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f49c93f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f49c93f2253]
5       0x7f49c9181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f49c9181ac3]
6       0x7f49c9212a04 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 1728927168: Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f48929b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f48929b41cd]
2       0x7f48949dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f48949dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f49c93f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f49c93f2253]
5       0x7f49c9181ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f49c9181ac3]
6       0x7f49c9212a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7fbb0e9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7fbb0e9b41cd]
2       0x7fbb109dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7fbb109dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7fbc3cbf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fbc3cbf2253]
5       0x7fbc3c981ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fbc3c981ac3]
6       0x7fbc3ca12a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7fa92a9b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7fa92a9b41cd]
2       0x7fa92c9dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7fa92c9dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7faa585f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7faa585f2253]
5       0x7faa58381ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7faa58381ac3]
6       0x7faa58412a04 clone + 68
[TensorRT-LLM][ERROR] Cannot process new request: Prompt length (1251) exceeds maximum input length (1024). (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:246)
1       0x7f9a329b41cd /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x7f61cd) [0x7f9a329b41cd]
2       0x7f9a349dbe17 tensorrt_llm::batch_manager::GptManager::fetchNewRequests() + 423
3       0x7f9a349dda98 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 232
4       0x7f9b5fbf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9b5fbf2253]
5       0x7f9b5f981ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9b5f981ac3]
6       0x7f9b5fa12a04 clone + 68
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 224, in _handle_predict_error
yield
File "/usr/local/lib/python3.10/dist-packages/cog/server/worker.py", line 253, in _predict_async
async for r in result:
File "/src/predict.py", line 180, in predict
output = event.json()["text_output"]
KeyError: 'text_output'
@jdkanu jdkanu changed the title Do [replicate-internal/staging-llama-2-70b-mlc](https://replicate.com/replicate-internal/staging-llama-2-70b-mlc) and [replicate-internal/llama-2-70b-triton](https://replicate.com/replicate-internal/llama-2-70b-triton) have different maximum input lengths? Do replicate-internal/staging-llama-2-70b-mlc and replicate-internal/llama-2-70b-triton have different maximum input lengths? Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant