LocalLLaMA @sh.itjust.works Even_Adder @lemmy.dbzer0.com 11 mo. ago

QuIP#: SOTA 2 bit LLMs

github.com GitHub - Cornell-RelaxML/quip-sharp

Contribute to Cornell-RelaxML/quip-sharp development by creating an account on GitHub.

Large language models (LLMs) exhibit amazing performance on a wide variety of tasks such as text modeling and code generation. However, they are also very large. For example Llama 2 70B has 70 billion parameters that require 140GB of memory to store in half precision. This presents many challenges, such as needing multiple GPUs just to serve a single LLM. To address these issues, researchers have developed compression methods that reduce the size of models without destroying performance.

One class of methods, post-training quantization, compresses trained model weights into lower precision formats to reduce memory requirements. For example, quantizing a model from 16 bit to 2 bit precision would reduce the size of the model by 8x, meaning that even Llama 2 70B would fit on a single 24GB GPU. In this work, we introduce QuIP#, which combines lattice codebooks with incoherence processing to create state-of-the-art 2 bit quantized models. These two methods allow QuIP# to significantly close the gap between 2 bit quantized LLMs and unquantized 16 bit models.

Project Page: https://cornell-relaxml.github.io/quip-sharp/

Code: https://github.com/Cornell-RelaxML/quip-sharp

Free Open-Source Artificial Intelligence @lemmy.world Even_Adder @lemmy.dbzer0.com 11 mo. ago

QuIP#: SOTA 2 bit LLMs

github.com /Cornell-RelaxML/quip-sharp

1 comments

If anyone else wonders how that compares to llama.cpp's "2bit" quantization, here is the in-depth discussion: https://github.com/ggerganov/llama.cpp/discussions/4327