Isn't that super slow? I mean that could be slower than using llama.cpp on CPU? (If you always transfer layers between SSD, RAM and over the PCIE-Bus into the GPU...
I expect so, but as we start to get more agents capable of doing jobs without the hand holding, there are some jobs where time isn't as important as ability. You could potentially run a very powerful model on a GPU with 24GB of memory.