From 70508ba7bba73b7dbffef5d120a187a0a811a484 Mon Sep 17 00:00:00 2001
From: Michael Clifford <mcliffor@redhat.com>
Date: Wed, 3 Jan 2024 18:00:14 -0500
Subject: [PATCH] remove old files, update README: refactor 2 of 2

---
 README.md                                     | 102 ++----------------
 arm/Containerfile                             |   9 --
 .../model_services/builds/arm/Containerfile   |   2 +-
 .../model_services/builds/x86/Containerfile   |   2 +-
 chatbot/model_services/chat_service.py        |   2 +-
 src/app.py                                    |   9 --
 src/chat.py                                   |  98 -----------------
 src/run_locallm.py                            |  11 --
 x86/Containerfile                             |  12 ---
 9 files changed, 14 insertions(+), 233 deletions(-)
 delete mode 100644 arm/Containerfile
 delete mode 100644 src/app.py
 delete mode 100644 src/chat.py
 delete mode 100644 src/run_locallm.py
 delete mode 100644 x86/Containerfile

diff --git a/README.md b/README.md
index b15d36d28..5b0c5e805 100644
--- a/README.md
+++ b/README.md
@@ -1,103 +1,23 @@
 # Locallm
 
-This repo contains the assets required to build and run an application on your Mac that uses a local instance of a large language model (LLM).
+This repo contains artifacts that can be used to build and run LLM (Large Language Model) services locally on your Mac using podman. These containerized LLM services can be used to help developers quickly prototype new LLM based applications, without the need for relying on any other externally hosted services. Since they are already containerized, it also helps developers move from their prototype to production quicker.        
 
-This README outlines three different approaches to running the application:
-* [Pull and Run](#pull-and-run)
-* [Build and Run](#build-and-run)
-* [Deploy on Openshift](#deploy-on-openshift)
+## Current Locallm Services: 
 
+* [Chatbot](#chatbot)
+* [Text Summarization](#text-summarization)
+* [Fine-tuning](#fine-tuning)
 
+### Chatbot
 
-## Pull and Run 
-
-If you have [podman](https://podman-desktop.io/) installed on your Mac and don't want to build anything, you can pull the image directly from my [quay.io](quay.io) repository and run the application locally following the instructions below. 
-
-_Note: You can increase the speed of the LLM's response time by increasing the resources allocated to your podman's virtual machine._ 
-
-### Pull the image from quay. 
-```bash
-podman pull quay.io/michaelclifford/locallm
-```
-### Run the container
-```bash
-podman run -it -p 7860:7860 quay.io/michaelclifford/locallm:latest 
-```
-
-Go to `0.0.0.0:7860` in your browser and start to chat with the LLM. 
+A simple chatbot using the gradio UI. Learn how to build and run this model service here: [Chatbot](/chatbot/).
 
 ![](/assets/app.png)
 
-## Build and Run
-
-If you'd like to customize the application or change the model, you can rebuild and run the application using [podman](https://podman-desktop.io/). 
-
-
-_Note: If you would like to build this repo as is, it expects that you have downloaded this [model](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q5_K_S.gguf) ([llama-2-7b-chat.Q5_K_S.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/blob/main/llama-2-7b-chat.Q5_K_S.gguf)) from huggingface and saved it into the top directory of this repo._ 
-
-### Build the image locally for arm
-
-```bash
-podman build -t locallm . -f arm/Containerfile  
-```
-
-### Run the image
-
-```bash
-podman run -it -p 7860:7860 locallm:latest
-```
-
-Go to `0.0.0.0:7860` in your browser and start to chat with the LLM. 
-
-![](/assets/app.png)
-
-## Deploy on Openshift
-
-Now that we've developed an application locally that leverages an LLM, we likely want to share it with a wider audience. Let's get it off our machine and run it on OpenShift. 
-
-### Rebuild for x86
-We'll need to rebuild the image for the x86 architecture for most use case outside of our Mac. Since this is an AI workload, we will also want to take advantage of Nvidia GPU's available outside our local machine. Therefore, this image's base image contains CUDA and builds llama.cpp specifically for a CUDA environment. 
+### Text Summarization
 
-```bash
-podman build -t locallm:x86 . -f x86/Containerfile
-```
-
- Before building the image, you can change line 6 of `x86/Containerfile` if you'd like to **NOT** use CUDA and GPU acceleration by setting `-DLLAMA_CUBLAS` to `off`  
-
-```Containerfile
-ENV CMAKE_ARGS="-DLLAMA_CUBLAS=off"
-```
-
-### Push to Quay
-
-Once you login to [quay.io](quay.io) you can push your own newly built version of this LLM application to your repository for use by others.  
-
-```bash
-podman login quay.io
-```
-
-```bash
-podman push localhost/locallm quay.io/<YOUR-QUAY_REPO>/locallm
-```
-
-### Deploy
-
-Now that your model lives in a remote repository we can deploy it. Go to your OpenShift developer dashboard and select "+Add" to use the Openshift UI to deploy the application. 
-
-![](/assets/add_image.png)
-
-Select "Container images" 
-
-![](/assets/container_images.png)
-
-Then fill out the form on the Deploy page with your [quay.io](quay.io) image name and make sure to set the "Target port" to 7860.
-
-![](/assets/deploy.png)
-
-Hit "Create" at the bottom and watch your application start.
-
-Once the pods are up and the application is working, navigate to the "Routs" section and click on the link created for you to interact with your app. 
-
-![](/assets/app.png)
+An LLM app that can summarize arbitrarily long text inputs. Learn how to build and run this model service here: [Text Summarization](/summarizer/).
 
+### Fine Tuning 
 
+This application allows a user to select a model and a data set they'd like to fine-tune that model on. Once the application finishes, it outputs a new fine-tuned model for the user to apply to other LLM services. Learn how to build and run this model training job here: [Fine-tuning](/finetune/).
\ No newline at end of file
diff --git a/arm/Containerfile b/arm/Containerfile
deleted file mode 100644
index f2dc3eea1..000000000
--- a/arm/Containerfile
+++ /dev/null
@@ -1,9 +0,0 @@
-FROM registry.access.redhat.com/ubi9/python-39:1-158
-WORKDIR /locallm
-COPY ../requirements.txt /locallm/requirements.txt
-RUN pip install --upgrade pip
-RUN pip install --no-cache-dir --upgrade -r /locallm/requirements.txt
-ENV MODEL_FILE=llama-2-7b-chat.Q5_K_S.gguf
-COPY ../${MODEL_FILE} /locallm/
-COPY ../src/ /locallm
-ENTRYPOINT [ "python", "app.py" ]
diff --git a/chatbot/model_services/builds/arm/Containerfile b/chatbot/model_services/builds/arm/Containerfile
index 2a0469c8f..d4b9e50cc 100644
--- a/chatbot/model_services/builds/arm/Containerfile
+++ b/chatbot/model_services/builds/arm/Containerfile
@@ -4,7 +4,7 @@ COPY builds/requirements.txt /locallm/requirements.txt
 RUN pip install --upgrade pip
 RUN pip install --no-cache-dir --upgrade -r /locallm/requirements.txt
 ENV MODEL_FILE=llama-2-7b-chat.Q5_K_S.gguf
-COPY builds/${MODEL_FILE} /locallm/
+COPY builds/${MODEL_FILE} /locallm/models/
 COPY builds/src/ /locallm
 COPY chat_service.py /locallm/chat_service.py
 ENTRYPOINT [ "python", "chat_service.py" ]
diff --git a/chatbot/model_services/builds/x86/Containerfile b/chatbot/model_services/builds/x86/Containerfile
index eb91b7d67..b56ec7817 100644
--- a/chatbot/model_services/builds/x86/Containerfile
+++ b/chatbot/model_services/builds/x86/Containerfile
@@ -6,7 +6,7 @@ ENV CMAKE_ARGS="-DLLAMA_CUBLAS=on"
 ENV FORCE_CMAKE=1
 RUN pip install --upgrade --force-reinstall --no-cache-dir -r /locallm/requirements.txt
 ENV MODEL_FILE=llama-2-7b-chat.Q5_K_S.gguf
-COPY builds/${MODEL_FILE} /locallm/
+COPY builds/${MODEL_FILE} /locallm/models/
 COPY builds/src/ /locallm
 COPY chat_service.py /locallm/chat_service.py
 ENTRYPOINT [ "python", "chat_service.py" ]
diff --git a/chatbot/model_services/chat_service.py b/chatbot/model_services/chat_service.py
index 9fa0e3f8a..269a5f1ad 100644
--- a/chatbot/model_services/chat_service.py
+++ b/chatbot/model_services/chat_service.py
@@ -5,7 +5,7 @@
 from llamacpp_utils import clip_history
  
 
-llm = Llama("llama-2-7b-chat.Q5_K_S.gguf",
+llm = Llama("models/llama-2-7b-chat.Q5_K_S.gguf",
             n_gpu_layers=-1,
             n_ctx=2048,
             max_tokens=512,
diff --git a/src/app.py b/src/app.py
deleted file mode 100644
index dec783058..000000000
--- a/src/app.py
+++ /dev/null
@@ -1,9 +0,0 @@
-import gradio as gr
-from chat import Chat
- 
-if __name__ == "__main__":
-
-    chat = Chat()
-    demo = gr.ChatInterface(chat.ask)
-    demo.launch(server_name="0.0.0.0")
-    
\ No newline at end of file
diff --git a/src/chat.py b/src/chat.py
deleted file mode 100644
index 63fd29e80..000000000
--- a/src/chat.py
+++ /dev/null
@@ -1,98 +0,0 @@
-import os
-from llama_cpp import Llama
-
-class Chat:
-
-    n_ctx = 2048
-
-    def __init__(self) -> None:
-        self.chat_history = [
-                {"role": "system", "content": """You are a helpful assistant that is comfortable speaking
-                with C level executives in a professional setting."""},
-                ]
-        self.llm = Llama(model_path=os.getenv("MODEL_FILE",
-                                    "llama-2-7b-chat.Q5_K_S.gguf"),
-                         n_ctx=Chat.n_ctx,
-                         n_gpu_layers=-1,
-                         n_batch=Chat.n_ctx,
-                         f16_kv=True,
-                         stream=True,)
-
-
-    def reset_system_prompt(self, prompt=None):
-        if not prompt:
-            self.chat_history[0] = {"role":"system", "content":""}
-        else:
-            self.chat_history[0] = {"role":"system",
-                                  "content": prompt}
-        print(self.chat_history[0])
-
-
-    def clear_history(self):
-        self.chat_history = [self.chat_history[0]]
-
-
-    def count_tokens(self, messages):
-        num_extra_tokens = len(self.chat_history) * 6 # accounts for tokens outside of "content"
-        token_count = sum([len(self.llm.tokenize(bytes(x["content"], "utf-8"))) for x 
-                           in messages]) + num_extra_tokens
-        return token_count
-    
-    
-    def clip_history(self, prompt):
-        context_length = Chat.n_ctx
-        prompt_length = len(self.llm.tokenize(bytes(prompt["content"], "utf-8")))
-        history_length = self.count_tokens(self.chat_history)
-        input_length = prompt_length + history_length
-        print(input_length)
-        while input_length > context_length:
-            print("Clipping")
-            self.chat_history.pop(1)
-            self.chat_history.pop(1)
-            history_length = self.count_tokens(self.chat_history)      
-            input_length = history_length + prompt_length   
-            print(input_length)
-    
-
-    def ask(self, prompt, history):
-        prompt = {"role":"user", "content":prompt}
-        self.chat_history.append(prompt)
-        self.clip_history(prompt)
-        chat_response = self.llm.create_chat_completion(self.chat_history, stream=True)
-        reply = ""
-        for i in chat_response:
-            token =  i["choices"][0]["delta"] 
-            if "content" in token.keys():
-                reply += token["content"]
-                yield reply
-        self.chat_history.append({"role":"assistant","content":reply})
-
-
-def chunk_tokens(llm, prompt, chunk_size):
-    tokens = tokenize(llm, prompt)
-    num_tokens = count_tokens(llm, prompt)
-    chunks = []
-    for i in range((num_tokens//chunk_size)+1):
-        chunk = str(llm.detokenize(tokens[:chunk_size]),"utf-8")
-        chunks.append(chunk)
-        tokens = tokens[chunk_size:]
-    return chunks
-
-def tokenize(llama, prompt):
-    return llama.tokenize(bytes(prompt, "utf-8"))
-
-def count_tokens(llama,prompt):
-    return len(tokenize(llama,prompt)) + 5
-
-def clip_history(llama, prompt, history, n_ctx, max_tokens):
-    prompt_len = count_tokens(llama, prompt)
-    history_len = sum([count_tokens(llama, x["content"]) for x in history])
-    input_len = prompt_len + history_len
-    print(input_len)
-    while input_len >= n_ctx-max_tokens:
-        print("Clipping")
-        history.pop(1)
-        history_len = sum([count_tokens(llama, x["content"]) for x in history])
-        input_len = history_len + prompt_len
-        print(input_len)
-    return history
diff --git a/src/run_locallm.py b/src/run_locallm.py
deleted file mode 100644
index 0efaa6866..000000000
--- a/src/run_locallm.py
+++ /dev/null
@@ -1,11 +0,0 @@
-
-from src.chat import Chat
- 
-if __name__ == "__main__":
-
-    chat = Chat()
-    print("\n Start Chatting with Llama2...")
-    while True:
-        query = input("\n User: ")
-        response = chat.ask(query)
-        print("\n Agent: " + response)
diff --git a/x86/Containerfile b/x86/Containerfile
deleted file mode 100644
index 2090c185c..000000000
--- a/x86/Containerfile
+++ /dev/null
@@ -1,12 +0,0 @@
-
-FROM quay.io/opendatahub/workbench-images:cuda-ubi9-python-3.9-20231206
-WORKDIR /locallm
-COPY ../requirements.txt /locallm/requirements.txt
-RUN pip install --upgrade pip
-ENV CMAKE_ARGS="-DLLAMA_CUBLAS=on"
-ENV FORCE_CMAKE=1
-RUN pip install --upgrade --force-reinstall --no-cache-dir -r /locallm/requirements.txt
-ENV MODEL_FILE=llama-2-7b-chat.Q5_K_S.gguf
-COPY ../${MODEL_FILE} /locallm/
-COPY ../src/ /locallm
-ENTRYPOINT [ "python", "app.py" ]