Skip to content

Commit

Permalink
Update llama documentation (#2683)
Browse files Browse the repository at this point in the history
* update llama documentation

* update llama documentation

* update llama documentation

* lint

* removed TP based on review comments

* spellcheck

* review comments

* review comments

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-50-11.us-west-2.compute.internal>
  • Loading branch information
agunapal and Ubuntu committed Oct 6, 2023
1 parent e29512a commit f57240f
Show file tree
Hide file tree
Showing 6 changed files with 40 additions and 2 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ Refer to [torchserve docker](docker/README.md) for details.


## 🏆 Highlighted Examples
* [Serving Llama 2 with TorchServe](examples/LLM/llama2/README.md)
* [Chatbot with Llama 2 on Mac 🦙💬](examples/LLM/llama2/chat_app)
* [🤗 HuggingFace Transformers](examples/Huggingface_Transformers) with a [Better Transformer Integration/ Flash Attention & Xformer Memory Efficient ](examples/Huggingface_Transformers#Speed-up-inference-with-Better-Transformer)
* [Model parallel inference](examples/Huggingface_Transformers#model-parallelism)
Expand Down
38 changes: 38 additions & 0 deletions examples/LLM/llama2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Llama 2: Next generation of Meta's Language Model
![Llama 2](./images/llama.png)

TorchServe supports serving Llama 2 in a number of ways. The examples covered in this document range from someone new to TorchServe learning how to serve Llama 2 with an app, to an advanced user of TorchServe using micro batching and streaming response with Llama 2

## 🦙💬 Llama 2 Chatbot

### [Example Link](https://github.com/pytorch/serve/tree/master/examples/LLM/llama2/chat_app)

This example shows how to deploy a llama2 chat app using TorchServe.
We use [streamlit](https://github.com/streamlit/streamlit) to create the app

This example is using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python).

You can run this example on your laptop to understand how to use TorchServe, how to scale up/down TorchServe backend workers and play around with batch_size to see its effect on inference time

![Chatbot Architecture](./chat_app/screenshots/architecture.png)

## Llama 2 with HuggingFace

### [Example Link](https://github.com/pytorch/serve/tree/master/examples/large_models/Huggingface_accelerate/llama2)

This example shows how to serve Llama 2 - 70b model with limited resource using [HuggingFace](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf). It shows the following optimizations
1) HuggingFace `accelerate`. This option can be activated with `low_cpu_mem_usage=True`.
2) Quantization from [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) using `load_in_8bit=True`
The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).

## Llama 2 on Inferentia

### [Example Link](https://github.com/pytorch/serve/tree/master/examples/large_models/inferentia2/llama2)

### [PyTorch Blog](https://pytorch.org/blog/high-performance-llama/)

This example shows how to serve the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with [micro batching](https://github.com/pytorch/serve/tree/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/examples/micro_batching) and [streaming response](https://github.com/pytorch/serve/blob/96450b9d0ab2a7290221f0e07aea5fda8a83efaf/docs/inference_api.md#curl-example-1) support.

Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.

![Inferentia 2 Software Stack](./images/software_stack_inf2.jpg)
1 change: 0 additions & 1 deletion examples/LLM/llama2/chat_app/client_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
# App title
st.set_page_config(page_title="🦙💬 Llama 2 Chatbot")

# Replicate Credentials
with st.sidebar:
st.title("🦙💬 Llama 2 Chatbot")

Expand Down
Binary file added examples/LLM/llama2/images/llama.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion ts_scripts/spellcheck_conf/wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1117,4 +1117,4 @@ sharding
quantized
Chatbot
LLM

bitsandbytes

0 comments on commit f57240f

Please sign in to comment.