Skip Navigation

Free Open-Source Artificial Intelligence @lemmy.world tinwhiskers @lemmy.world 11 mo. ago

New technique to run 70B LLM Inference on a single 4GB GPU

ai.gopubby.com Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique

Large language models require huge amounts of GPU memory. Is it possible to run inference on a single GPU? If so, what is the minimum GPU…

Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique

5 comments

Isn't that super slow? I mean that could be slower than using llama.cpp on CPU? (If you always transfer layers between SSD, RAM and over the PCIE-Bus into the GPU...
- I expect so, but as we start to get more agents capable of doing jobs without the hand holding, there are some jobs where time isn't as important as ability. You could potentially run a very powerful model on a GPU with 24GB of memory.
  
  OK but the article implies that this approach saves money. I don't think it does that at all.
  
  You know what's cheaper than a GPU with 120GB of RAM? Renting one, for a split second. You can do that for like 1 cent.