-
Notifications
You must be signed in to change notification settings - Fork 842
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into integrate_sanity_tests_with_pytest
- Loading branch information
Showing
28 changed files
with
483 additions
and
19 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,8 @@ on: | |
push: | ||
branches: | ||
- master | ||
merge_group: | ||
|
||
jobs: | ||
build_docs_job: | ||
runs-on: ubuntu-20.04 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,8 @@ on: | |
pull_request: | ||
branches: | ||
- master | ||
merge_group: | ||
|
||
|
||
jobs: | ||
mypy: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Model Inference Optimization Checklist | ||
|
||
This checklist describes some steps that should be completed when diagnosing model inference performance issues. Some of these suggestions are only applicable to NLP models (e.g., ensuring the input is not over-padded and sequence bucketing), but the general principles are useful for other models too. | ||
|
||
## General System Optimizations | ||
|
||
- Check the versions of PyTorch, Nvidia driver, and other components and update to the latest compatible releases. Oftentimes known performance bugs have already been fixed. | ||
|
||
- Collect system-level activity logs to understand the overall resource utilizations. It’s useful to know how the model inference pipeline is using the system resources at a high level, as the first step of optimization. Even simple CLI tools such as nvidia-smi and htop would be helpful. | ||
|
||
- Start with a target with the highest impact on performance. It should be obvious from the system activity logs where the biggest bottleneck is – look beyond model inference, as pre/post processing can be expensive and can affect the end-to-end throughput just as much. | ||
|
||
- Quantify and mitigate the influence of slow I/O such as disk and network on end-to-end performance. While optimizing I/O is out of scope for this checklist, look for techniques that use async, concurrency, pipelining, etc. to effectively “hide” the cost of I/O. | ||
|
||
- For model inference on input sequences of dynamic length (e.g., transformers for NLP), make sure the tokenizer is not over-padding the input. If a transformer was trained with padding to a constant length (e.g., 512) and deployed with the same padding, it would run unnecessarily slow (orders of magnitude) on short sequences. | ||
|
||
- Vision models with input in JPEG format often benefit from faster JPEG decoding on CPU such as libjpeg-turbo and Pillow-SIMD, and on GPU such as torchvision.io.decode_jpeg and Nvidia DALI. | ||
As this [example](https://colab.research.google.com/drive/1NMaLS8PG0eYhbd8IxQAajXgXNIZ_AvHo?usp=sharing) shows, Nvidia DALI is about 20% faster than torchvision, even on an old K80 GPU. | ||
|
||
## Model Inference Optimizations | ||
|
||
Start model inference optimization only after other factors, the “low-hanging fruit”, have been extensively evaluated and addressed. | ||
|
||
- Use fp16 for GPU inference. The speed will most likely more than double on newer GPUs with tensor cores, with negligible accuracy degradation. Technically fp16 is a type of quantization but since it seldom suffers from loss of accuracy for inference it should always be explored. As shown in this [article](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#abstract), use of fp16 offers speed up in large neural network applications. | ||
|
||
- Use model quantization (i.e. int8) for CPU inference. Explore different quantization options: dynamic quantization, static quantization, and quantization aware training, as well as tools such as Intel Neural Compressor that provide more sophisticated quantization methods. It is worth noting that quantization comes with some loss in accuracy and might not always offer significant speed up on some hardware thus this might not always be the right approach. | ||
|
||
- Balance throughput and latency with smart batching. While meeting the latency SLA try larger batch sizes to increase the throughput. | ||
|
||
- Try optimized inference engines such as onnxruntime, tensorRT, lightseq, ctranslate-2, etc. These engines often provide additional optimizations such as operator fusion, in addition to model quantization. | ||
|
||
- Try model distillation. This is more involved and often requires training data, but the potential gain can be large. For example, MiniLM achieves 99% the accuracy of the original BERT base model while being 2X faster. | ||
|
||
- If working on CPU, you can try core pinning. You can find more information on how to work with this [in this blog post](https://pytorch.org/tutorials/intermediate/torchserve_with_ipex#grokking-pytorch-intel-cpu-performance-from-first-principles). | ||
|
||
- For batch processing on sequences with different lengths, sequence bucketing could potentially improve the throughput by 2X. In this case, a simple implementation of sequence bucketing is to sort all input by sequence length before feeding them to the model, as this reduces unnecessary padding when batching the sequences. | ||
|
||
While this checklist is not exhaustive, going through the items will likely help you squeeze more performance out of your model inference pipeline. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
60 changes: 60 additions & 0 deletions
60
examples/large_models/Huggingface_accelerate/llama2/Readme.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# Loading meta-llama/Llama-2-70b-chat-hf on AWS EC2 g5.24xlarge using accelerate | ||
|
||
This document briefs on serving large HG models with limited resource using accelerate. This option can be activated with `low_cpu_mem_usage=True`. The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint). | ||
|
||
### Step 1: Download model Permission | ||
|
||
Follow [this instruction](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) to get permission | ||
|
||
Login with a Hugging Face account | ||
``` | ||
huggingface-cli login | ||
# or using an environment variable | ||
huggingface-cli login --token $HUGGINGFACE_TOKEN | ||
``` | ||
|
||
```bash | ||
python ../Download_model.py --model_path model --model_name meta-llama/Llama-2-70b-chat-hf | ||
``` | ||
Model will be saved in the following path, `model/models--meta-llama--Llama-2-70b-chat-hf`. | ||
|
||
### Step 2: Generate MAR file | ||
|
||
Add the downloaded path to " model_path:" in `model-config.yaml` and run the following. | ||
|
||
```bash | ||
torch-model-archiver --model-name llama2-70b-chat --version 1.0 --handler custom_handler.py --config-file model-config.yaml -r requirements.txt --archive-format no-archive | ||
``` | ||
|
||
If you are using conda, and notice issues with mpi4py, you would need to install openmpi-mpicc using the following | ||
|
||
``` | ||
conda install -c conda-forge openmpi-mpicc | ||
``` | ||
|
||
### Step 3: Add the mar file to model store | ||
|
||
```bash | ||
mkdir model_store | ||
mv llama2-70b-chat model_store | ||
mv model model_store/llama2-70b-chat | ||
``` | ||
|
||
### Step 3: Start torchserve | ||
|
||
Update config.properties and start torchserve | ||
|
||
```bash | ||
torchserve --start --ncs --ts-config config.properties --model-store model_store --models llama2-70b-chat | ||
``` | ||
|
||
### Step 4: Run inference | ||
|
||
```bash | ||
curl -v "http://localhost:8080/predictions/llama2-70b-chat" -T sample_text.txt | ||
``` | ||
|
||
results in the following output | ||
``` | ||
Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, vinegar or lemon juice, and seasonings' | ||
``` |
6 changes: 6 additions & 0 deletions
6
examples/large_models/Huggingface_accelerate/llama2/config.properties
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
inference_address=http://0.0.0.0:8080 | ||
management_address=http://0.0.0.0:8081 | ||
metrics_address=http://0.0.0.0:8082 | ||
enable_envvars_config=true | ||
install_py_dep_per_model=true | ||
|
Oops, something went wrong.