Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync with IBM/main #13

Merged
merged 155 commits into from
May 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
155 commits
Select commit Hold shift + click to select a range
3bc3053
♻️ install vllm using wheels (#19)
prashantgupta24 May 3, 2024
7df0eb8
fix: Missed TLS config logic from internal fork (#21)
njhill May 3, 2024
11fb3ab
[Bugfix] Support logprobs when using guided_json and other constraine…
jamestwhedbee Apr 18, 2024
0d80844
[Misc] Bump transformers to latest version (#4176)
njhill Apr 18, 2024
12193cb
[CI/CD] add neuron docker and ci test scripts (#3571)
liangfu Apr 18, 2024
2f0297f
[Bugfix] Fix CustomAllreduce nvlink topology detection (#3974)
agt Apr 18, 2024
d7e2c90
[Core] add an option to log every function call to for debugging hang…
youkaichao Apr 18, 2024
35b9861
Support eos_token_id from generation_config.json (#4182)
simon-mo Apr 19, 2024
c260f9c
[Bugfix] Fix LoRA loading check (#4138)
jeejeelee Apr 19, 2024
a791512
Bump version of 0.4.1 (#4177)
simon-mo Apr 19, 2024
54eeec0
[Misc] fix docstrings (#4191)
UranusSeven Apr 19, 2024
c69fae9
[Bugfix][Core] Restore logging of stats in the async engine (#4150)
ronensc Apr 19, 2024
0d6fb1a
[Misc] add nccl in collect env (#4211)
youkaichao Apr 19, 2024
a09273f
Pass `tokenizer_revision` when getting tokenizer in openai serving (#…
chiragjn Apr 20, 2024
6ca2f99
[Bugfix] Add fix for JSON whitespace (#4189)
ayusher Apr 20, 2024
a030b80
Fix missing docs and out of sync `EngineArgs` (#4219)
hmellor Apr 20, 2024
0e49aff
[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118)
comaniac Apr 20, 2024
79d0b52
[Frontend] multiple sampling params support (#3570)
nunjunj Apr 20, 2024
4bedf7a
Updating lm-format-enforcer version and adding links to decoding libr…
noamgat Apr 20, 2024
de140fd
Don't show default value for flags in `EngineArgs` (#4223)
hmellor Apr 21, 2024
69031fd
[Doc]: Update the doc of adding new models (#4236)
YeFD Apr 21, 2024
1b3cb93
Make initialization of tokenizer and detokenizer optional (#3748)
GeauxEric Apr 21, 2024
6651ef7
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic an…
hongxiayang Apr 22, 2024
6b1cfd6
[Core][Distributed] fix _is_full_nvlink detection (#4233)
youkaichao Apr 22, 2024
a415320
[Misc] Add vision language model support to CPU backend (#3968)
Isotr0py Apr 22, 2024
9828c49
[Bugfix] Fix type annotations in CPU model runner (#4256)
WoosukKwon Apr 22, 2024
bcee2b2
[Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993)
sighingnow Apr 22, 2024
2dd273f
[Bugfix] Ensure download_weights_from_hf(..) inside loader is using t…
alexm-neuralmagic Apr 22, 2024
a7fc71d
Add example scripts to documentation (#4225)
hmellor Apr 22, 2024
c10e074
[Core] Scheduler perf fix (#4270)
rkooo567 Apr 22, 2024
764364e
[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276)
Michaelvll Apr 22, 2024
1d8dba2
[Core][Distributed] use absolute path for library file (#4271)
youkaichao Apr 23, 2024
bc7bd7a
Fix `autodoc` directives (#4272)
hmellor Apr 23, 2024
ec844ac
Update vllm to 34128a69
joerunde May 6, 2024
d885361
[Mypy] Part 3 fix typing for nested directories for most of directory…
rkooo567 Apr 23, 2024
c454aec
[Core] Some simplification of WorkerWrapper changes (#4183)
njhill Apr 23, 2024
f877b32
[Core] Scheduling optimization 2 (#4280)
rkooo567 Apr 23, 2024
899ccaa
[Speculative decoding 7/9] Speculative decoding end-to-end correctnes…
cadedaniel Apr 23, 2024
97e9907
[Bugfix] Fixing max token error message for openai compatible server …
jgordley Apr 23, 2024
d209558
[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286)
DefTruth Apr 23, 2024
eec0776
[Core][Logging] Add last frame information for better debugging (#4278)
youkaichao Apr 23, 2024
9b5dcec
AQLM CUDA support (#3287)
jaemzfleming Apr 23, 2024
3363b50
[Bugfix][Frontend] Raise exception when file-like chat template fails…
DarkLight1337 Apr 23, 2024
e2ba82f
[Kernel] FP8 support for MoE kernel / Mixtral (#4244)
pcmoritz Apr 24, 2024
65bab09
[BUG] fixed fp8 conflict with aqlm (#4307)
robertgshaw2-neuralmagic Apr 24, 2024
671c816
[Core][Distributed] use cpu/gloo to initialize pynccl (#4248)
youkaichao Apr 24, 2024
b0914de
[CI][Build] change pynvml to nvidia-ml-py (#4302)
youkaichao Apr 24, 2024
8a76d72
[Misc] Reduce supported Punica dtypes (#4304)
WoosukKwon Apr 24, 2024
1b3020a
[Core][Distributed] use existing torch.cuda.device (#4318)
youkaichao Apr 24, 2024
1ceb7bd
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279)
ywang96 Apr 24, 2024
2824405
[Bugfix] Fix marlin kernel crash on H100 (#4218)
alexm-neuralmagic Apr 24, 2024
824d100
[Doc] Add note for docker user (#4340)
youkaichao Apr 24, 2024
89f5f3d
[Misc] Use public API in benchmark_throughput (#4300)
zifeitong Apr 24, 2024
d6bba29
[Model] Adds Phi-3 support (#4298)
caiom Apr 25, 2024
d7132d1
[Core] Move ray_utils.py from `engine` to `executor` package (#4347)
njhill Apr 25, 2024
34939d6
[Bugfix][Model] Refactor OLMo model to support new HF format in trans…
Isotr0py Apr 25, 2024
04e737d
[CI/Build] Adding functionality to reset the node's GPUs before proce…
Alexei-V-Ivanov-AMD Apr 25, 2024
f34a48f
[Doc] README Phi-3 name fix. (#4372)
caiom Apr 25, 2024
63a5b9b
[Core]refactor aqlm quant ops (#4351)
jikunshang Apr 25, 2024
32ab5e9
[Mypy] Typing lora folder (#4337)
rkooo567 Apr 25, 2024
0fbf6ef
[Misc] Fix flash attention backend log (#4368)
esmeetu Apr 25, 2024
a0e79e9
[Core] Add `shutdown()` method to `ExecutorBase` (#4349)
njhill Apr 25, 2024
d2e8642
[Core] Move function tracing setup to util function (#4352)
njhill Apr 25, 2024
67a1c2c
[ROCm][Hardware][AMD][Doc] Documentation update for ROCm (#4376)
hongxiayang Apr 26, 2024
c1779e1
[Bugfix] Fix parameter name in `get_tokenizer` (#4107)
DarkLight1337 Apr 26, 2024
860ea76
[Frontend] Add --log-level option to api server (#4377)
normster Apr 26, 2024
5e7eaa9
[CI] Disable non-lazy string operation on logging (#4326)
rkooo567 Apr 26, 2024
ad2f90d
[Core] Refactoring sampler and support prompt logprob for chunked pre…
rkooo567 Apr 26, 2024
e8184d0
[Misc][Refactor] Generalize linear_method to be quant_method (#4373)
comaniac Apr 26, 2024
f582fe6
[Misc] add RFC issue template (#4401)
youkaichao Apr 26, 2024
af70b6b
[Core] Introduce `DistributedGPUExecutor` abstract class (#4348)
njhill Apr 27, 2024
6a6df45
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static sca…
pcmoritz Apr 27, 2024
8483952
[Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355)
DarkLight1337 Apr 27, 2024
838de71
[Misc] Fix logger format typo (#4396)
esmeetu Apr 27, 2024
8ec345a
[ROCm][Hardware][AMD] Enable group query attention for triton FA (#4406)
hongxiayang Apr 27, 2024
cfc48ff
[Kernel] Full Tensor Parallelism for LoRA Layers (#3524)
FurtherAI Apr 27, 2024
9e4039a
[Model] Phi-3 4k sliding window temp. fix (#4380)
caiom Apr 27, 2024
0c5790d
[Bugfix][Core] Fix get decoding config from ray (#4335)
esmeetu Apr 27, 2024
6c3326a
[Bugfix] Abort requests when the connection to /v1/completions is int…
chestnut-Q Apr 27, 2024
1a91ddf
[BugFix] Fix `min_tokens` when `eos_token_id` is None (#4389)
njhill Apr 27, 2024
c7334d2
[Core] Support offline use of local cache for models (#4374)
prashantgupta24 Apr 27, 2024
b2ae047
[BugFix] Fix return type of executor execute_model methods (#4402)
njhill Apr 27, 2024
3a3ea57
[BugFix] Resolved Issues For LinearMethod --> QuantConfig (#4418)
robertgshaw2-neuralmagic Apr 27, 2024
53299da
[Misc] fix typo in llm_engine init logging (#4428)
DefTruth Apr 28, 2024
91aabb0
Add more Prometheus metrics (#2764)
ronensc Apr 28, 2024
46a9863
[CI] clean docker cache for neuron (#4441)
simon-mo Apr 28, 2024
0733647
[mypy][5/N] Support all typing on model executor (#4427)
rkooo567 Apr 29, 2024
70d7507
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922)
robertgshaw2-neuralmagic Apr 29, 2024
1e25f6a
[CI] hotfix: soft fail neuron test (#4458)
simon-mo Apr 29, 2024
826a21c
[Core][Distributed] use cpu group to broadcast metadata in cpu (#4444)
youkaichao Apr 29, 2024
aa6c82d
[Misc] Upgrade to `torch==2.3.0` (#4454)
mgoin Apr 30, 2024
61e6343
[Bugfix][Kernel] Fix compute_type for MoE kernel (#4463)
WoosukKwon Apr 30, 2024
c92cac9
[Core]Refactor gptq_marlin ops (#4466)
jikunshang Apr 30, 2024
d1176c8
[BugFix] fix num_lookahead_slots missing in async executor (#4165)
leiwen83 Apr 30, 2024
eb69d24
[Doc] add visualization for multi-stage dockerfile (#4456)
prashantgupta24 Apr 30, 2024
e65c20e
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332)
robertgshaw2-neuralmagic Apr 30, 2024
d9e9d52
[Frontend] Support complex message content for chat completions endpo…
fgreinacher Apr 30, 2024
3fc345a
[Frontend] [Core] Tensorizer: support dynamic `num_readers`, update v…
alpayariyak Apr 30, 2024
80d0058
[Bugfix][Minor] Make ignore_eos effective (#4468)
bigPYJ1151 Apr 30, 2024
f8e22b3
fix_tokenizer_snapshot_download_bug (#4493)
kingljl Apr 30, 2024
ecb3620
Unable to find Punica extension issue during source code installation…
kingljl May 1, 2024
edd9c67
[Core] Centralize GPU Worker construction (#4419)
njhill May 1, 2024
1b710d6
[Misc][Typo] type annotation fix (#4495)
HarryWu99 May 1, 2024
28a0f80
[Misc] fix typo in block manager (#4453)
Juelianqvq May 1, 2024
c49d777
Allow user to define whitespace pattern for outlines (#4305)
robcaulk May 1, 2024
8eba757
[Misc]Add customized information for models (#4132)
jeejeelee May 1, 2024
9f85f52
[Test] Add ignore_eos test (#4519)
rkooo567 May 1, 2024
e38157e
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to…
AnyISalIn May 1, 2024
e5814bc
[Bugfix] Fix 307 Redirect for `/metrics` (#4523)
robertgshaw2-neuralmagic May 1, 2024
6e3a823
[Doc] update(example model): for OpenAI compatible serving (#4503)
fpaupier May 1, 2024
4d3057b
[Bugfix] Use random seed if seed is -1 (#4531)
sasha0552 May 1, 2024
f1f98b6
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534)
tjohnson31415 May 1, 2024
4107938
[Speculative decoding] Add ngram prompt lookup decoding (#4237)
leiwen83 May 1, 2024
b2eed51
[Core] Enable prefix caching with block manager v2 enabled (#4142)
leiwen83 May 1, 2024
4c28921
[Core] Add `multiproc_worker_utils` for multiprocessing-based workers…
njhill May 1, 2024
0c9d74c
[Kernel] Update fused_moe tuning script for FP8 (#4457)
pcmoritz May 1, 2024
a7e8e4d
[Bugfix] Add validation for seed (#4529)
sasha0552 May 1, 2024
0793528
[Bugfix][Core] Fix and refactor logging stats (#4336)
esmeetu May 1, 2024
07632b0
[Core][Distributed] fix pynccl del error (#4508)
youkaichao May 1, 2024
e742ab4
[Misc] Remove Mixtral device="cuda" declarations (#4543)
pcmoritz May 1, 2024
6fe52f3
[Misc] Fix expert_ids shape in MoE (#4517)
WoosukKwon May 1, 2024
2ec5dd2
[MISC] Rework logger to enable pythonic custom logging configuration …
May 2, 2024
121847e
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.…
rkooo567 May 2, 2024
3bb37bd
[CI]Add regression tests to ensure the async engine generates metrics…
ronensc May 2, 2024
fe82250
[mypy][6/N] Fix all the core subdirectory typing (#4450)
rkooo567 May 2, 2024
9ff783f
[Core][Distributed] enable multiple tp group (#4512)
youkaichao May 2, 2024
d3ab1c7
[Kernel] Support running GPTQ 8-bit models in Marlin (#4533)
alexm-neuralmagic May 2, 2024
beecc8e
[mypy][7/N] Cover all directories (#4555)
rkooo567 May 2, 2024
b553e05
[Misc] Exclude the `tests` directory from being packaged (#4552)
itechbear May 2, 2024
37f8957
[BugFix] Include target-device specific requirements.txt in sdist (#4…
markmc May 2, 2024
d7f5c58
[Misc] centralize all usage of environment variables (#4548)
youkaichao May 2, 2024
df04c10
[kernel] fix sliding window in prefix prefill Triton kernel (#4405)
mmoskal May 2, 2024
299066f
[CI/Build] AMD CI pipeline with extended set of tests. (#4267)
Alexei-V-Ivanov-AMD May 2, 2024
3e9f425
[Core] Ignore infeasible swap requests. (#4557)
rkooo567 May 2, 2024
977a6cd
[Core][Distributed] enable allreduce for multiple tp groups (#4566)
youkaichao May 3, 2024
de6d42a
[BugFix] Prevent the task of `_force_log` from being garbage collecte…
Atry May 3, 2024
deb0ccc
[Misc] remove chunk detected debug logs (#4571)
DefTruth May 3, 2024
9500596
[Doc] add env vars to the doc (#4572)
youkaichao May 3, 2024
a5d0d0e
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518)
rkooo567 May 3, 2024
ab445b1
[Bugfix] Allow "None" or "" to be passed to CLI for string args that …
mgoin May 3, 2024
83f0437
Fix/async chat serving (#2727)
schoennenbeck May 3, 2024
0c86070
[Kernel] Use flashinfer for decoding (#4353)
LiuXiaoxuanPKU May 3, 2024
81a9e09
[Speculative decoding] Support target-model logprobs (#4378)
cadedaniel May 3, 2024
cf0665c
[Misc] add installation time env vars (#4574)
youkaichao May 3, 2024
ecb55eb
[Misc][Refactor] Introduce ExecuteModelData (#4540)
comaniac May 4, 2024
8e82b90
[Doc] Chunked Prefill Documentation (#4580)
rkooo567 May 4, 2024
ba2be94
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with…
mgoin May 4, 2024
71bb251
[CI] check size of the wheels (#4319)
simon-mo May 4, 2024
ac5ccb6
[Bugfix] Fix inappropriate content of model_name tag in Prometheus me…
DearPlanet May 4, 2024
52b5bcb
bump version to v0.4.2 (#4600)
simon-mo May 5, 2024
c7426c1
[CI] Reduce wheel size by not shipping debug symbols (#4602)
simon-mo May 5, 2024
352ef7c
Disable cuda version check in vllm-openai image (#4530)
zhaoyang-star May 5, 2024
06241cf
[Bugfix] Fix `asyncio.Task` not being subscriptable (#4623)
DarkLight1337 May 6, 2024
4c758aa
Update vLLM to 323f27b9
joerunde May 6, 2024
b180134
sync with IBM/main@4c758aa2
dtrifiro May 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
36 changes: 36 additions & 0 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import os
import zipfile

MAX_SIZE_MB = 100


def print_top_10_largest_files(zip_file):
with zipfile.ZipFile(zip_file, 'r') as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
print(f"{f}: {size/(1024*1024)} MBs uncompressed.")


def check_wheel_size(directory):
for root, _, files in os.walk(directory):
for f in files:
if f.endswith(".whl"):
wheel_path = os.path.join(root, f)
wheel_size = os.path.getsize(wheel_path)
wheel_size_mb = wheel_size / (1024 * 1024)
if wheel_size_mb > MAX_SIZE_MB:
print(
f"Wheel {wheel_path} is too large ({wheel_size_mb} MB) "
f"compare to the allowed size ({MAX_SIZE_MB} MB).")
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb} MB).")
return 0


if __name__ == "__main__":
import sys
sys.exit(check_wheel_size(sys.argv[1]))
64 changes: 35 additions & 29 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
@@ -1,38 +1,44 @@
# This script build the ROCm docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
# This script build the ROCm docker image and runs test inside it.
set -ex

# Print ROCm version
echo "--- ROCm info"
rocminfo

# Try building the docker image
docker build -t rocm -f Dockerfile.rocm .
echo "--- Resetting GPUs"

# Setup cleanup
remove_docker_container() { docker rm -f rocm || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device /dev/kfd --device /dev/dri --network host --name rocm rocm python3 -m vllm.entrypoints.api_server &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
echo "reset" > /opt/amdgpu/etc/gpu_state

while true; do
sleep 3
if grep -q clean /opt/amdgpu/etc/gpu_state; then
echo "GPUs state is \"clean\""
break
fi
done
done

echo "--- Building container"
sha=$(git rev-parse --short HEAD)
container_name=rocm_${sha}
docker build \
-t ${container_name} \
-f Dockerfile.rocm \
--progress plain \
.

remove_docker_container() {
docker rm -f ${container_name} || docker image rm -f ${container_name} || true
}
wait_for_server_to_start
trap remove_docker_container EXIT

echo "--- Running container"

docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--rm \
-e HF_TOKEN \
--name ${container_name} \
${container_name} \
/bin/bash -c $(echo $1 | sed "s/^'//" | sed "s/'$//")

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
5 changes: 5 additions & 0 deletions .buildkite/run-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,11 @@ echo '```' >> benchmark_results.md
tail -n 20 benchmark_serving.txt >> benchmark_results.md # last 20 lines
echo '```' >> benchmark_results.md

# if the agent binary is not found, skip uploading the results, exit 0
if [ ! -f /workspace/buildkite-agent ]; then
exit 0
fi

# upload the results to buildkite
/workspace/buildkite-agent annotate --style "info" --context "benchmark-results" < benchmark_results.md

Expand Down
51 changes: 51 additions & 0 deletions .buildkite/run-neuron-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# This script build the Neuron docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
set -e

# Try building the docker image
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

# prune old image and containers to save disk space, and only once a day
# by using a timestamp file in tmp.
if [ -f /tmp/neuron-docker-build-timestamp ]; then
last_build=$(cat /tmp/neuron-docker-build-timestamp)
current_time=$(date +%s)
if [ $((current_time - last_build)) -gt 86400 ]; then
docker system prune -f
echo $current_time > /tmp/neuron-docker-build-timestamp
fi
else
echo $(date +%s) > /tmp/neuron-docker-build-timestamp
fi

docker build -t neuron -f Dockerfile.neuron .

# Setup cleanup
remove_docker_container() { docker rm -f neuron || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device=/dev/neuron0 --device=/dev/neuron1 --network host --name neuron neuron python3 -m vllm.entrypoints.api_server \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-num-seqs 8 --max-model-len 128 --block-size 128 --device neuron --tensor-parallel-size 2 &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
30 changes: 25 additions & 5 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,32 +15,41 @@ steps:
commands:
- VLLM_ATTENTION_BACKEND=XFORMERS pytest -v -s basic_correctness/test_basic_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_basic_correctness.py
- VLLM_ATTENTION_BACKEND=ROCM_FLASH pytest -v -s basic_correctness/test_basic_correctness.py
- VLLM_ATTENTION_BACKEND=XFORMERS pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=ROCM_FLASH pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py

- label: Core Test
mirror_hardwares: [amd]
command: pytest -v -s core

- label: Distributed Comm Ops Test
command: pytest -v -s test_comm_ops.py
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 2 # only support 1 or 2 for now.
num_gpus: 2

- label: Distributed Tests
working_dir: "/vllm-workspace/tests/distributed"

num_gpus: 2 # only support 1 or 2 for now.
mirror_hardwares: [amd]

commands:
- pytest -v -s test_pynccl.py
- pytest -v -s test_pynccl_library.py
- TEST_DIST_MODEL=facebook/opt-125m pytest -v -s test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m pytest -v -s test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s test_chunked_prefill_distributed.py

- label: Distributed Tests (Multiple Groups)
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 4
commands:
- pytest -v -s test_pynccl.py

- label: Engine Test
command: pytest -v -s engine tokenization test_sequence.py test_config.py
mirror_hardwares: [amd]
command: pytest -v -s engine tokenization test_sequence.py test_config.py test_logger.py

- label: Entrypoints Test
commands:
Expand All @@ -50,6 +59,7 @@ steps:

- label: Examples Test
working_dir: "/vllm-workspace/examples"
mirror_hardwares: [amd]
commands:
# install aws cli for llava_example.py
- pip install awscli
Expand All @@ -63,29 +73,35 @@ steps:
parallelism: 4

- label: Models Test
mirror_hardwares: [amd]
commands:
- bash ../.buildkite/download-images.sh
- pytest -v -s models --ignore=models/test_llava.py --ignore=models/test_mistral.py

- label: Llava Test
mirror_hardwares: [amd]
commands:
- bash ../.buildkite/download-images.sh
- pytest -v -s models/test_llava.py

- label: Prefix Caching Test
mirror_hardwares: [amd]
commands:
- pytest -v -s prefix_caching

- label: Samplers Test
command: pytest -v -s samplers

- label: LogitsProcessor Test
mirror_hardwares: [amd]
command: pytest -v -s test_logits_processor.py

- label: Worker Test
mirror_hardwares: [amd]
command: pytest -v -s worker

- label: Speculative decoding tests
mirror_hardwares: [amd]
command: pytest -v -s spec_decode

- label: LoRA Test %N
Expand All @@ -98,8 +114,12 @@ steps:
- label: Metrics Test
command: pytest -v -s metrics

- label: Quantization Test
command: pytest -v -s quantization

- label: Benchmarks
working_dir: "/vllm-workspace/.buildkite"
mirror_hardwares: [amd]
commands:
- pip install aiohttp
- bash run-benchmarks.sh
Expand Down
28 changes: 24 additions & 4 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,29 @@ steps:
limit: 5
- wait

- label: "AMD Test"
- group: "AMD Tests"
depends_on: ~
steps:
{% for step in steps %}
{% if step.mirror_hardwares and "amd" in step.mirror_hardwares %}
- label: "AMD: {{ step.label }}"
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh "'cd {{ (step.working_dir or default_working_dir) | safe }} && {{ step.command or (step.commands | join(' && ')) | safe }}'"
env:
DOCKER_BUILDKIT: "1"
{% endif %}
{% endfor %}

- label: "Neuron Test"
depends_on: ~
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh
queue: neuron
command: bash .buildkite/run-neuron-test.sh
soft_fail: true

- label: "CPU Test"
- label: "Intel Test"
depends_on: ~
command: bash .buildkite/run-cpu-test.sh

{% for step in steps %}
Expand All @@ -39,6 +56,9 @@ steps:
plugins:
- kubernetes:
podSpec:
{% if step.num_gpus %}
priorityClassName: gpu-priority-cls-{{ step.num_gpus }}
{% endif %}
volumes:
- name: dshm
emptyDir:
Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/200-installation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ body:
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
value: |
```text
The output of `python collect_env.py`
Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/300-usage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ body:
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
value: |
```text
The output of `python collect_env.py`
Expand Down
3 changes: 3 additions & 0 deletions .github/ISSUE_TEMPLATE/400-bug report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ body:
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
value: |
```text
The output of `python collect_env.py`
Expand Down Expand Up @@ -57,6 +58,8 @@ body:
If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.

Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.

If you experienced crashes or hangs, it would be helpful to run vllm with `export VLLM_TRACE_FUNCTION=1` . All the function calls in vllm will be recorded. Inspect these log files, and tell which function crashes or hangs.
placeholder: |
A clear and concise description of what the bug is.

Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/700-performance discussion.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ body:
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
value: |
```text
The output of `python collect_env.py`
Expand Down
Loading