Awesome KV Caching (WIP)

Awesome KV Caching (WIP)

Curated collection of papers and resources on how to build a efficient KV Cache system for LLM inference service.

The template is derived from Awesome-LLM-Reasoning. Still Work In Progress.

Pre-Train Stage, Structural Modification

Long-Context Language Modeling with Parallel Context Encoding ACL 2024

Howard Yen, Tianyu Gao, Danqi Chen [Paper] [Code], 2024.2

Re-introduce Encoder module to comprise long context.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints EMNLP 2023

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai [Paper], 2023.05

Reuse KV Cache across attention heads.
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention Preprint

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly [Paper], 2024.5

Reuse KV Cache across layers.
You Only Cache Once: Decoder-Decoder Architectures for Language Models Preprint

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei [Paper] [Code], 2024.5

Use a linear model to generate KV Cache for all layer at once.
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression Preprint

Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah [Paper] [Code], 2024.7

Use a linear model to generate KV Cache for all layer at once, but even more extreme.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model Preprint

DeepSeek-AI Team [Paper], 2024.5

Instead reusing, mapping the KV Cache into latent space.

↑ Back to Top ↑

Deploy Stage, Inference System

Efficient Memory Management for Large Language Model Serving with PagedAttention SOSP 2023

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica [Paper] [Code], 2023.10 Pubed

Manage KV Cache like paged memory.
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition ACL 2024

Lu Ye, Ze Tao, Yong Huang, Yang Li [Paper] [Code], 2024.2

Use a prefix-tree to manage and reuse KV Cache across decoding requests.
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention ATC 2024

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo [Paper], 2024.3

Manage and reuse KV Cache in a hierarchical system on multiple storage devices.
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management OSDI 2024

Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim [Paper], 2024.6

Speculate important tokens for attention then pre-fetch them from slower but larger memory.
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache Preprint

Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin [Paper], 2024.1

Fully utilize the ability of a distributed inference system.
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving Preprint

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu [Paper], 2024.6

A KV-Cache-centric disaggregated architecture that designed and used by a major AI company.
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory Preprint

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Maosong Sun [Paper], 2024.2

Offload most KV-Cache to CPU, left only a few Keys for smart retrieval.
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines Preprint

Jiaao He, Jidong Zhai [Paper], 2024.3

Utilize modern CPU to help the computation of attention.

↑ Back to Top ↑

Post-Train Stage

Work in progress.

Static Eviction

Dynamic Eviction

Merging

Quantization

Benchmark

Work in progress.

Field	Benchmarks
Efficiency
Retrieval
Reasoning

↑ Back to Top ↑

Other Awesome Lists

Awesome-LLM-Reasoning Curated collection of papers and resources on how to unlock the reasoning ability of LLMs and MLLMs.
Awesome-Controllable-Generation Collection of papers and resources on Controllable Generation using Diffusion Models.
Chain-of-ThoughtsPapers A trend starts from "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models".
LM-reasoning Collection of papers and resources on Reasoning in Large Language Models.
Prompt4ReasoningPapers Repository for the paper "Reasoning with Language Model Prompting: A Survey".
ReasoningNLP Paper list on reasoning in NLP
Awesome-LLM Curated list of Large Language Model.
Awesome LLM Self-Consistency Curated list of Self-consistency in Large Language Models.
Deep-Reasoning-Papers Recent Papers including Neural-Symbolic Reasoning, Logical Reasoning, and Visual Reasoning.

↑ Back to Top ↑

Contributing

Add a new paper or update an existing paper, thinking about which category the work should belong to.
Use the same format as existing entries to describe the work.
Add the abstract link of the paper (/abs/ format if it is an arXiv publication).

Don't worry if you do something wrong, it will be fixed for you!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome KV Caching (WIP)

Pre-Train Stage, Structural Modification

Deploy Stage, Inference System

Post-Train Stage

Static Eviction

Dynamic Eviction

Merging

Quantization

Benchmark

Other Awesome Lists

Contributing

Contributors

About

Releases

Packages

zcli-charlie/Awesome-KV-Cache

Folders and files

Latest commit

History

Repository files navigation

Awesome KV Caching (WIP)

Pre-Train Stage, Structural Modification

Deploy Stage, Inference System

Post-Train Stage

Static Eviction

Dynamic Eviction

Merging

Quantization

Benchmark

Other Awesome Lists

Contributing

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages