Skip Navigation

New "Context Shifting" feature in KoboldCPP 1.48

github.com Release koboldcpp-1.48 · LostRuins/koboldcpp

koboldcpp-1.48 Harder Better Faster Stronger Edition NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from con...

Release koboldcpp-1.48 · LostRuins/koboldcpp

"This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing."

This means a major speed increase for people like me who rely on (slow) CPU inference (or big models). Consider a chatbot scenario and a long chat where old lines of dialogue need to be evicted from the context to stay within the (4096 token) context size. Previously the context had to be re-computed starting with the first changed/now missing token. This feature detects that, deletes the affected tokens from the KV cache and shifts the subsequent tokens in the KV cache so it can be re-used. Avoiding a computationally expensive re-calculation.

This is probably also more or less related to recent advancements like Streaming-LLM

This won't help once text gets inserted "in the middle" or the prompt gets changed in another way. But I managed to connect KoboldCPP as a backend for SillyTavern/Oobabooga and now I'm able to have unlimited length conversations without waiting excessively, once the chat history hits max tokens and the frontend starts dropping text.

It's just a clever way to re-use the KV cache in one specific case. But I've wished for this for quite some time.

2
2 comments