Go to file

Mayank Chhabra 9d086c2e28 Lock 13B and 70B models in memory for faster interference		2023-08-18 03:11:42 +07:00
.github/workflows	Build images on tag	2023-08-16 18:39:44 +07:00
api	Fix signal handling	2023-08-16 15:52:52 +07:00
ui	add fallback for copying to clipboard over http (#1 )	2023-08-15 15:42:26 -07:00
.gitattributes	Add .gitattributes to enforce consistent line endings	2023-08-18 00:41:04 +07:00
.gitignore	Disable 13B docker builds, add images to docker-compose.yml	2023-08-16 04:59:40 +07:00
LICENSE.md	Add support for 13B and 70B models, workflow, readme	2023-08-15 23:11:39 +07:00
README.md	Update README.md	2023-08-17 05:56:59 +07:00
docker-compose-13b.yml	Lock 13B and 70B models in memory for faster interference	2023-08-18 03:11:42 +07:00
docker-compose-70b.yml	Lock 13B and 70B models in memory for faster interference	2023-08-18 03:11:42 +07:00
docker-compose.yml	Update docker service names for different model APIs	2023-08-17 23:28:55 +07:00

README.md

LlamaGPT

A self-hosted, offline, ChatGPT-like chatbot, powered by Llama 2. 100% private, with no data leaving your device.
umbrel.com »

Demo

https://github.com/getumbrel/llama-gpt/assets/10330103/5d1a76b8-ed03-4a51-90bd-12ebfaf1e6cd

How to install

Install LlamaGPT on your umbrelOS home server

Running LlamaGPT on an umbrelOS home server is one click. Simply install it from the Umbrel App Store.

Install LlamaGPT anywhere else

You can run LlamaGPT on any x86 or arm64 system. Make sure you have Docker installed.

Then, clone this repo and cd into it:

git clone https://github.com/getumbrel/llama-gpt.git
cd llama-gpt

You can now run LlamaGPT with any of the following models depending upon your hardware:

Model size	Model used	Minimum RAM required	How to start LlamaGPT
7B	Nous Hermes Llama 2 7B (GGML q4_0)	8GB	`docker compose up -d`
13B	Nous Hermes Llama 2 13B (GGML q4_0)	16GB	`docker compose -f docker-compose-13b.yml up -d`
70B	Meta Llama 2 70B Chat (GGML q4_0)	48GB	`docker compose -f docker-compose-70b.yml up -d`

You can access LlamaGPT at http://localhost:3000.

To stop LlamaGPT, run:

docker compose down

Benchmarks

We've tested LlamaGPT models on the following hardware with the default system prompt, and user prompt: "How does the universe expand?" at temperature 0 to guarantee deterministic results. Generation speed is averaged over the first 10 generations.

Feel free to add your own benchmarks to this table by opening a pull request.

Nous Hermes Llama 2 7B (GGML q4_0)

Device	Generation speed
M1 Max MacBook Pro (10 64GB RAM)	8.2 tokens/sec
Umbrel Home (16GB RAM)	2.7 tokens/sec
Raspberry Pi 4 (8GB RAM)	0.9 tokens/sec

Nous Hermes Llama 2 13B (GGML q4_0)

Device	Generation speed
M1 Max MacBook Pro (64GB RAM)	3.7 tokens/sec
Umbrel Home (16GB RAM)	1.5 tokens/sec

Meta Llama 2 70B Chat (GGML q4_0)

Unfortunately, we don't have any benchmarks for this model yet. If you have one, please open a pull request to add it to this table.

Roadmap and contributing

We're looking to add more features to LlamaGPT. You can see the roadmap here. The highest priorities are:

Add CUDA and Metal support.
Moving the model out of the Docker image and into a separate volume.
Updating the front-end to show model download progress, and to allow users to switch between models.
Making it easy to run custom models.

If you're a developer who'd like to help with any of these, please open an issue to discuss the best way to tackle the challenge. If you're looking to help but not sure where to begin, check out these issues that have specifically been marked as being friendly to new contributors.

Acknowledgements

A massive thank you to the following developers and teams for making LlamaGPT possible:

Mckay Wrigley for building Chatbot UI.
Georgi Gerganov for implementing llama.cpp.
Andrei for building the Python bindings for llama.cpp.
NousResearch for fine-tuning the Llama 2 7B and 13B models.
Tom Jobbins for quantizing the Llama 2 models.
Meta for releasing Llama 2 under a permissive license.

umbrel.com