Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(blog): Add new post on llama-cpp-python and instructor library usage #434

Merged
merged 6 commits into from
Feb 12, 2024
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions docs/blog/posts/llama-cpp-python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
draft: False
date: 2024-02-12
slug: llama-cpp-python
tags:
- patching
authors:
- jxnl
---

# Structured Outputs with llama-cpp-python

Open-source LLMS are gaining popularity, and llama-cpp-python has made the `llama-cpp` model available to obtain structured outputs using JSON schema via a mixture of [constrained sampling](https://llama-cpp-python.readthedocs.io/en/latest/#json-schema-mode) and [speculative decoding](https://llama-cpp-python.readthedocs.io/en/latest/#speculative-decoding). They also support a [OpenAI compatible client](https://llama-cpp-python.readthedocs.io/en/latest/#openai-compatible-web-server), which can be used to obtain structured output as a in process mechanism to avoid any network dependency.

<!-- more -->

## Patching

Instructor's patch enhances an create call it with the following features:

- `response_model` in `create` calls that returns a pydantic model
- `max_retries` in `create` calls that retries the call if it fails by using a backoff strategy

!!! note "Learn More"

To learn more, please refer to the [docs](../../index.md). To understand the benefits of using Pydantic with Instructor, visit the tips and tricks section of the [why use Pydantic](../../why.md) page. If you want to check out examples of using Pydantic with Instructor, visit the [examples](../../examples/index.md) page.

## llama-cpp-python

Recently llama-cpp-python has made support structured outputs via JSON schema available. This is a time-saving alternative to extensive prompt engineering and can be used to obtain structured outputs.
jxnl marked this conversation as resolved.
Show resolved Hide resolved
jxnl marked this conversation as resolved.
Show resolved Hide resolved

In this example we'll cover a more advanced use case of by using `JSON_SCHEMA` mode to stream out partial models. To learn more partial streaming check out [partial streaming](../../concepts/partial.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example we'll cover a more advanced use case of JSON_SCHEMA mode to stream out partial models. To learn more partial streaming check out partial streaming.

jxnl marked this conversation as resolved.
Show resolved Hide resolved

```python
import llama_cpp
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

import instructor

from pydantic import BaseModel
from typing import List
from rich.console import Console


llama = llama_cpp.Llama(
model_path="../../models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
n_gpu_layers=-1,
chat_format="chatml",
n_ctx=2048,
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=2), #(1)!
logits_all=True,
verbose=False
)


create = instructor.patch(
create=llama.create_chat_completion_openai_v1, mode=instructor.Mode.JSON_SCHEMA #(2)!
)


text_block = """
In our recent online meeting, participants from various backgrounds joined to discuss
the upcoming tech conference. The names and contact details of the participants were as follows:

- Name: John Doe, Email: johndoe@email.com, Twitter: @TechGuru44
- Name: Jane Smith, Email: janesmith@email.com, Twitter: @DigitalDiva88
- Name: Alex Johnson, Email: alexj@email.com, Twitter: @CodeMaster2023

During the meeting, we agreed on several key points. The conference will be held on March 15th, 2024,
at the Grand Tech Arena located at 4521 Innovation Drive. Dr. Emily Johnson, a renowned AI researcher,
will be our keynote speaker.

The budget for the event is set at $50,000, covering venue costs, speaker fees, and promotional activities.
Each participant is expected to contribute an article to the conference blog by February 20th.

A follow-up meetingis scheduled for January 25th at 3 PM GMT to finalize the agenda and confirm the list of speakers.
"""


class User(BaseModel):
name: str
email: str
twitter: str


class MeetingInfo(BaseModel):
users: List[User]
date: str
location: str
budget: int
deadline: str


extraction_stream = create(
response_model=instructor.Partial[MeetingInfo], #(3)!
messages=[
{
"role": "user",
"content": f"Get the information about the meeting and the users {text_block}",
},
],
stream=True,
)


console = Console()

for extraction in extraction_stream:
obj = extraction.model_dump()
console.clear() #(4)!
console.print(obj)
```

1. We use `LlamaPromptLookupDecoding` to obtain structured outputs using JSON schema via a mixture of constrained sampling and speculative decoding. 10 is good for GPU, 2 is good for CPU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use LlamaPromptLookupDecoding to speed up structured output generation using speculative decoding. The draft model generates candidate tokens during generation 10 is good for GPU, 2 is good for CPU.

jxnl marked this conversation as resolved.
Show resolved Hide resolved
2. We use `instructor.Mode.JSON_SCHEMA` return a JSON schema response.
3. We use `instructor.Partial` to stream out partial models.
4. This is just a simple example of how to stream out partial models and clear the console.

![](../../img/partial.gif)
Loading