llama2.jl

Tired of low-level languages? Ever wanted to infer a baby Llama 2 model in pure Julia? Great news – you can now do so at in under 300 lines of Julia.

This is a fork of Andrej's llama2.c which has been ported to (for now) a slightly hacky version of Julia. This README is heavily inspired by the Rust port llama.rs.

Don't want to read? Got ya back!

git clone https://github.com/juvi21/llama2.jl && cd llama2.jl && wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin && julia jl_helpers/install_pkg.jl && julia run.jl stories15M.bin tokenizer.bin

How to run?

Grab Andrej's baby Llama2 (see the original instructions) pretrained on the TinyStories dataset:

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin

Ensure you have the tokenizer binary - tokenizer.bin (if not, see tokenizer.py).
Run run.jl:

Single-threaded:
```
julia run.jl <model> <tokenizer> --temp [temperature]
```
Multi-Threaded: In Progress
CUDA: In Progress

Performance

On my current workstation, the performance is quite fast. However, I have been away visiting my parents for a few days, so I only had the opportunity to test it on one of my very first and less powerful station. More testing is coming soon! NOTE: I compiled llama2.c with the provided command in Andrej's README which is only the basic one to get started and not very optimized.

gcc -O3 -o run run.c -lm

system	model	llama2.c	llmaa2.c -0fast	llama2.jl
Ubuntu 22.04 AMD Ryzen 2600	stories15M.bin	85.418752 tok/s	189.591078 tok/s	257.445516 tok/s
Ubuntu 22.04 AMD Ryzen 2600	stories42M.bin	30.761836 tok/s	78.485688 tok/s	92.567484 tok/s
Ubuntu 22.04 AMD Ryzen 2600	stories110.bin	11.585283 tok/s	30.375223 tok/s	38.543434 tok/s

Contributions

Join the dark side and code in Julia. Contributions are highly encouraged!

Contribution Ideas:

Make it faster.
Add CUDA support.
Introduce Multi-Threaded support.
Cutom Prompt

Art

@Midjourney

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

llama2.jl

How to run?

Performance

Contributions

Art

Files

README.md

Latest commit

History

README.md

File metadata and controls

llama2.jl

How to run?

Performance

Contributions

Art