1-bit LLM performs similarly to full-precision Transformer LLMs with the same model size and training tokens but is much more efficient in terms of latency, memory, throughput, and energy consumption.
It’s actually 1.58bits weirdly. The addition of 0 here was the significant change/improvement in this experiment. The paper isn’t too dense and has some decent tables that explain things fairly accessibly.