Update llama documentation (#2683)

* update llama documentation * update llama documentation * update llama documentation * lint * removed TP based on review comments * spellcheck * review comments * review comments --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-50-11.us-west-2.compute.internal>
pytorch · Oct 6, 2023 · f57240f · f57240f
1 parent e29512a
commit f57240f
Show file tree

Hide file tree

Showing 6 changed files with 40 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -77,6 +77,7 @@ Refer to [torchserve docker](docker/README.md) for details.
 
 
 ## 🏆 Highlighted Examples
+* [Serving Llama 2 with TorchServe](examples/LLM/llama2/README.md)
 * [Chatbot with Llama 2 on Mac 🦙💬](examples/LLM/llama2/chat_app)
 * [🤗 HuggingFace Transformers](examples/Huggingface_Transformers) with a [Better Transformer Integration/ Flash Attention & Xformer Memory Efficient ](examples/Huggingface_Transformers#Speed-up-inference-with-Better-Transformer)
 * [Model parallel inference](examples/Huggingface_Transformers#model-parallelism)

diff --git a/examples/LLM/llama2/README.md b/examples/LLM/llama2/README.md
@@ -0,0 +1,38 @@
+# Llama 2: Next generation of Meta's Language Model
+![Llama 2](./images/llama.png)
+
+TorchServe supports serving Llama 2 in a number of ways. The examples covered in this document range from someone new to TorchServe learning how to serve Llama 2 with an app, to an advanced user of TorchServe using micro batching and streaming response with Llama 2
+
+## 🦙💬 Llama 2 Chatbot
+
+### [Example Link](https://github.com/pytorch/serve/tree/master/examples/LLM/llama2/chat_app)
+
+This example shows how to deploy a llama2 chat app using TorchServe.
+We use [streamlit](https://github.com/streamlit/streamlit) to create the app
+
+This example is  using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python).
+
+You can run this example on your laptop to understand how to use TorchServe, how to scale up/down TorchServe backend workers and play around with batch_size to see its effect on inference time
+
+![Chatbot Architecture](./chat_app/screenshots/architecture.png)
+
+## Llama 2 with HuggingFace
+
+### [Example Link](https://github.com/pytorch/serve/tree/master/examples/large_models/Huggingface_accelerate/llama2)
+
+This example shows how to serve Llama 2 - 70b model with limited resource using [HuggingFace](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf). It shows the following optimizations
+    1) HuggingFace `accelerate`. This option can be activated with `low_cpu_mem_usage=True`.
+    2) Quantization from [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes)  using `load_in_8bit=True`
+The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).
+
+## Llama 2 on Inferentia
+
+### [Example Link](https://github.com/pytorch/serve/tree/master/examples/large_models/inferentia2/llama2)
+
+### [PyTorch Blog](https://pytorch.org/blog/high-performance-llama/)
+
+This example shows how to serve the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support.
+
+Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.
+
+![Inferentia 2 Software Stack](./images/software_stack_inf2.jpg)
diff --git a/examples/LLM/llama2/chat_app/client_app.py b/examples/LLM/llama2/chat_app/client_app.py
@@ -6,7 +6,6 @@
 # App title
 st.set_page_config(page_title="🦙💬 Llama 2 Chatbot")
 
-# Replicate Credentials
 with st.sidebar:
     st.title("🦙💬 Llama 2 Chatbot")
 

diff --git a/examples/LLM/llama2/images/llama.png b/examples/LLM/llama2/images/llama.png
diff --git a/examples/LLM/llama2/images/software_stack_inf2.jpg b/examples/LLM/llama2/images/software_stack_inf2.jpg
diff --git a/ts_scripts/spellcheck_conf/wordlist.txt b/ts_scripts/spellcheck_conf/wordlist.txt
@@ -1117,4 +1117,4 @@ sharding
 quantized
 Chatbot
 LLM
-
+bitsandbytes