Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add blog post and example script for extracting YouTube video chapters using OpenAI models #831

Merged
merged 4 commits into from
Jul 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/blog/posts/img/youtube-clips.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
273 changes: 273 additions & 0 deletions docs/blog/posts/youtube-transcripts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
---
draft: False
date: 2024-07-11
slug: youtube-transcripts
comments: true
authors:
- jxnl
ivanleomk marked this conversation as resolved.
Show resolved Hide resolved
---

# Analyzing Youtube Transcripts with Instructor

## Extracting Chapter Information

!!! info "Code Snippets"

As always, the code is readily available in our `examples/youtube` folder in our repo for your reference in the `run.py` file.

In this post, we'll show you how to summarise Youtube video transcripts into distinct chapters using `instructor` before exploring some ways you can adapt the code to different applications.

By the end of this article, you'll be able to build an application as per the video below.

![](../../hub/img/youtube.gif)

Let's first install the required packages.

```bash
pip install openai instructor pydantic youtube_transcript_api
```

!!! info "Quick Note"

The video that we'll be using in this tutorial is [A Hacker's Guide To Language Models](https://www.youtube.com/watch?v=jkrNMKz9pWU) by Jeremy Howard. It has the video id of `jkrNMKz9pWU`.

Next, let's start by defining a Pydantic Model for the structured chapter information that we want.

```python
from pydantic import BaseModel, Field


class Chapter(BaseModel):
start_ts: float = Field(
...,
description="Starting timestamp for a chapter.",
)
end_ts: float = Field(
...,
description="Ending timestamp for a chapter",
)
title: str = Field(
..., description="A concise and descriptive title for the chapter."
)
summary: str = Field(
...,
description="A brief summary of the chapter's content, don't use words like 'the speaker'",
)
```

We can take advantage of `youtube-transcript-api` to extract out the transcript of a video using the following function

```python
from youtube_transcript_api import YouTubeTranscriptApi


def get_youtube_transcript(video_id: str) -> str:
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
return " ".join(
[f"ts={entry['start']} - {entry['text']}" for entry in transcript]
)
except Exception as e:
print(f"Error fetching transcript: {e}")
return ""
```

Once we've done so, we can then put it all together into the following functions.

```python hl_lines="30-31 38-48"
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from youtube_transcript_api import YouTubeTranscriptApi

# Set up OpenAI client
client = instructor.from_openai(OpenAI())


class Chapter(BaseModel):
start_ts: float = Field(
...,
description="The start timestamp indicating when the chapter starts in the video.",
)
end_ts: float = Field(
...,
description="The end timestamp indicating when the chapter ends in the video.",
)
title: str = Field(
..., description="A concise and descriptive title for the chapter."
)
summary: str = Field(
...,
description="A brief summary of the chapter's content, don't use words like 'the speaker'",
)


def get_youtube_transcript(video_id: str) -> str:
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
return [f"ts={entry['start']} - {entry['text']}" for entry in transcript]
except Exception as e:
print(f"Error fetching transcript: {e}")
return ""


def extract_chapters(transcript: str):
return client.chat.completions.create_iterable(
model="gpt-4o", # You can experiment with different models
response_model=Chapter,
messages=[
{
"role": "system",
"content": "Analyze the given YouTube transcript and extract chapters. For each chapter, provide a start timestamp, end timestamp, title, and summary.",
},
{"role": "user", "content": transcript},
],
)


if __name__ == "__main__":
transcripts = get_youtube_transcript("jkrNMKz9pWU")

for transcript in transcripts[:2]:
print(transcript)
#> ts=0.539 - hi I am Jeremy Howard from fast.ai and
#> ts=4.62 - this is a hacker's guide to language

formatted_transcripts = ''.join(transcripts)
chapters = extract_chapters(formatted_transcripts)

for chapter in chapters:
print(chapter.model_dump_json(indent=2))
"""
{
"start_ts": 0.539,
"end_ts": 9.72,
"title": "Introduction",
"summary": "Jeremy Howard from fast.ai introduces the video, mentioning it as a hacker's guide to language models, focusing on a code-first approach."
}
"""
"""
{
"start_ts": 9.72,
"end_ts": 65.6,
"title": "Understanding Language Models",
"summary": "Explains the code-first approach to using language models, suggesting prerequisites such as prior deep learning knowledge and recommends the course.fast.ai for in-depth learning."
}
"""
"""
{
"start_ts": 65.6,
"end_ts": 250.68,
"title": "Basics of Language Models",
"summary": "Covers the concept of language models, demonstrating how they predict the next word in a sentence, and showcases OpenAI's text DaVinci for creative brainstorming with examples."
}
"""
"""
{
"start_ts": 250.68,
"end_ts": 459.199,
"title": "How Language Models Work",
"summary": "Dives deeper into how language models like ULMfit and others were developed, their training on datasets like Wikipedia, and the importance of learning various aspects of the world to predict the next word effectively."
}
"""
# ... other chapters
```

## Alternative Ideas

Now that we've seen a complete example of chapter extraction, let's explore some alternative ideas using different Pydantic models. These models can be used to adapt our YouTube transcript analysis for various applications.

### 1. Study Notes Generator

```python
from pydantic import BaseModel, Field
from typing import List


class Concept(BaseModel):
term: str = Field(..., description="A key term or concept mentioned in the video")
definition: str = Field(
..., description="A brief definition or explanation of the term"
)


class StudyNote(BaseModel):
timestamp: float = Field(
..., description="The timestamp where this note starts in the video"
)
topic: str = Field(..., description="The main topic being discussed at this point")
key_points: List[str] = Field(..., description="A list of key points discussed")
concepts: List[Concept] = Field(
..., description="Important concepts mentioned in this section"
)
```

This model structures the video content into clear topics, key points, and important concepts, making it ideal for revision and study purposes.

### 2. Content Summarization

```python
from pydantic import BaseModel, Field
from typing import List


class ContentSummary(BaseModel):
title: str = Field(..., description="The title of the video")
duration: float = Field(
..., description="The total duration of the video in seconds"
)
main_topics: List[str] = Field(
..., description="A list of main topics covered in the video"
)
key_takeaways: List[str] = Field(
..., description="The most important points from the entire video"
)
target_audience: str = Field(
..., description="The intended audience for this content"
)
```

This model provides a high-level overview of the entire video, perfect for quick content analysis or deciding whether a video is worth watching in full.

### 3. Quiz Generator

```python
from pydantic import BaseModel, Field
from typing import List


class QuizQuestion(BaseModel):
question: str = Field(..., description="The quiz question")
options: List[str] = Field(
..., min_items=2, max_items=4, description="Possible answers to the question"
)
correct_answer: int = Field(
...,
ge=0,
lt=4,
description="The index of the correct answer in the options list",
)
explanation: str = Field(
..., description="An explanation of why the correct answer is correct"
)


class VideoQuiz(BaseModel):
title: str = Field(
..., description="The title of the quiz, based on the video content"
)
questions: List[QuizQuestion] = Field(
...,
min_items=5,
max_items=20,
description="A list of quiz questions based on the video content",
)
```

This model transforms video content into an interactive quiz, perfect for testing comprehension or creating engaging content for social media.

To use these alternative models, you would replace the `Chapter` model in our original code with one of these alternatives and adjust the system prompt in the `extract_chapters` function accordingly.

## Conclusion

The power of this approach lies in its flexibility. By defining the result of our function calls as Pydantic Models, we're able to quickly adapt code for a wide variety of applications whether it be generating quizzes, creating study materials or just optimizing for simple SEO.
96 changes: 96 additions & 0 deletions examples/youtube/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from youtube_transcript_api import YouTubeTranscriptApi
from rich.console import Console
from rich.table import Table
from rich.live import Live

client = instructor.from_openai(OpenAI())


class Chapter(BaseModel):
start_ts: float = Field(
...,
description="The start timestamp indicating when the chapter starts in the video.",
)
end_ts: float = Field(
...,
description="The end timestamp indicating when the chapter ends in the video.",
)
title: str = Field(
..., description="A concise and descriptive title for the chapter."
)
summary: str = Field(
...,
description="A brief summary of the chapter's content, don't use words like 'the speaker'",
)


def get_youtube_transcript(video_id: str) -> str:
try:
transcript = YouTubeTranscriptApi.get_transcript(video_id)
return " ".join(
[f"ts={entry['start']} - {entry['text']}" for entry in transcript]
)
except Exception as e:
print(f"Error fetching transcript: {e}")
return ""


def extract_chapters(transcript: str):
class Chapters(BaseModel):
chapters: list[Chapter]

return client.chat.completions.create_partial(
model="gpt-4o", # You can experiment with different models
response_model=Chapters,
messages=[
{
"role": "system",
"content": "Analyze the given YouTube transcript and extract chapters. For each chapter, provide a start timestamp, end timestamp, title, and summary.",
},
{"role": "user", "content": transcript},
],
)


if __name__ == "__main__":
video_id = input("Enter a Youtube Url: ")
video_id = video_id.split("v=")[1]
console = Console()

with console.status("[bold green]Processing YouTube URL...") as status:
transcripts = get_youtube_transcript(video_id)
status.update("[bold blue]Generating Clips...")
chapters = extract_chapters(transcripts)

table = Table(title="Video Chapters")
table.add_column("Title", style="magenta")
table.add_column("Description", style="green")
table.add_column("Start", style="cyan")
table.add_column("End", style="cyan")

with Live(refresh_per_second=4) as live:
for extraction in chapters:
if not extraction.chapters:
continue

new_table = Table(title="Video Chapters")
new_table.add_column("Title", style="magenta")
new_table.add_column("Description", style="green")
new_table.add_column("Start", style="cyan")
new_table.add_column("End", style="cyan")

for chapter in extraction.chapters:
new_table.add_row(
chapter.title,
chapter.summary,
f"{chapter.start_ts:.2f}" if chapter.start_ts else "",
f"{chapter.end_ts:.2f}" if chapter.end_ts else "",
)
new_table.add_row("", "", "", "") # Add an empty row for spacing

live.update(new_table)

console.print("\nChapter extraction complete!")
11 changes: 11 additions & 0 deletions tests/llm/test_openai/docs/test_posts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import pytest
from pytest_examples import find_examples, CodeExample, EvalExample


@pytest.mark.parametrize("example", find_examples("docs/blog/posts"), ids=str)
def test_index(example: CodeExample, eval_example: EvalExample):
if eval_example.update_examples:
eval_example.format(example)
eval_example.run_print_update(example)
else:
eval_example.lint(example)
Loading