Skip to content

The Language Model Menagerie: A Taxonomy of open-source &/| open-weights LLMs

Notifications You must be signed in to change notification settings

atibaup/large-language-menagerie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 

Repository files navigation

Large Language Menagerie

A Taxonomy of open-source and/or open-weights LLMs.

At the current pace of innovation, this table can become quickly outdated. Feel free to open PRs to improve it or keep it up-to-date.

Model Name Institution First release date Source Code License Weights License Dataset License Dataset Size Dataset Language(s) Model Size Base Model Training modality
Comments
Flan-T5-{} Google October 2022 Apache 2.0 Apache 2.0 N/A N/A Many languages 80M-11B T5 Instruction fine-tuned
GPT4All Nomic-ai March 2023 Apache License 2.0 800k examples English 11B Params GPT-J Instruction & dialog fine-tuned
LLaMA Meta February 2023 GPL-3 Non-commercial bespoke license N/A >1T tokens 20 languages 7B, 13B, 33B, and 65B N/A Causal LM First highly performant "small" LLM
Alpaca Stanford March 2023 Apache License 2.00 CC BY NC 4.0 (LLaMA weigth diff) Claims CC BY NC 4.0 but not clear it is! 54k examples English 7B, 13B LLaMA Instruction fine-tuned Alpaca
Vicuña UC Berkeley, UCSD, CMU, MBZUAI March 2023 Apache License 2.0 Apache License 2.0 (LLaMA weigth diff) N/A 70k examples N/A 13B LLaMA Instruction & dialog fine-tuned Vicuna
Koala BAIR (Berkeley) April 2023 Apache License 2.0 Unclear 1 N/A >350k examples N/A 13B LLaMA Instruction & dialog fine-tuned Koala
FastChat-T5 UC Berkeley, UCSD, CMU, MBZUAI April 2023 Apache License 2.0 Unclear 2 N/A 70k examples N/A 3B T5-flan-XL Dialog fine-tuned
Pythia EleutherAI April 2023 Apache License 2.0 Apache License 2.0 "open source" 3 300B tokens Mostly English (though multiple languages) 70M-12B GPT-neoX Causal LM
StableLM-Alpha Stability-AI April 2023 N/A CC BY-SA-4.0 N/A 1.5T tokens Mostly English (though multiple languages) 3B-7B GPT-neoX Causal LM 4k context window, training code not available
Dolly-v2 Databricks March 2023 Apache License 2.0 Apache License 2.0 CC BY-SA 3.0 15k examples English 3B-12B Pythia Instruction fine-tuned Not state-of-the-art, first one of the first commercially licensed.
CerebrasGPT Cerebras March 2023 N/A Apache License 2.0 "open source" 3 300B tokens Mostly English (though multiple languages) 111M-13B GPT-2 Causal LM Developed mostly as demo of cerebra's hardware capabilities, performance sub-par
WizardLM-{7/13/15/30} WizardLM April 2023 Apache License 2.0 Attribution-NonCommercial 4.0 International Attribution-NonCommercial 4.0 International (LLaMA weight diff) 70k instructions Mostly English 7, 13, 15, 30B LLaMA Causal LM
MPT-7B Mosaic ML May 2023 Apache License 2.0 Apache License 2.0 Various datasets 4 300B tokens Mostly English (though multiple languages) 7B MPT Causal LM Commercially usable, comparable performance to LLaMA, up to 64k context length
Falcon-7B/40B Technology Innovation Institute UAE May 2023 N/A TII Falcon LLM License Apache License 2.0 + other 1T tokens English, German, Spanish, French 7B, 40B Causal LM Commercially usable up to capped revenues, top-performer on OpenLLM leaderboard as of date of launch
Orca-13B Microsoft Research June 2023 Unclear, but likely non-commercial N/A N/A 6M examples Mostly English 13B Presumably LLaMA Causal LM Pending publication of artifacts & their licenses, but likely restrictive since seems to be based on LLaMA

[1] Claimed to be "subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Any other usage of the model weights, including but not limited to commercial usage, is strictly prohibited. "

[2] Uses ShareGPT data, which comes form users posting data from ChatGPT, whose terms and conditions are restrictive...

[3] The code to generate the pile is MIT-licensed, and the data itself can be downloaded, no-strings-attached from here. But nowhere it says what the actual license for the dataset is, other than claiming it is "open-source".

[4] Included a variety of data sources: mC4, C4, RedPajama, The Stack, Semantic Scholar, mostly public datasets from public data but each with possibly different licenses.

Comments

  • Alpaca: First evidence that small-high quality data can make a relatively small LLM competitive with much bigger models, LLaMA fine-tuned at a cost of ~600USD (dataset gen + training)
  • Vicuna: Another LLaMA with model, which GPT-4 grades better than ChatGPT and Alpaca. Fine-tuned at a cost of ~300USD
  • Koala: Another LLaMA with model fine-tuned on large, partially propietary dialog dataset, with comparable performance to Alpaca according to human evaluators.

About

The Language Model Menagerie: A Taxonomy of open-source &/| open-weights LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published