Skip Navigation

What's the deal with LlamaCPP and caching?

I'm curious what it is doing from a top down perspective.

I've been playing with a 70B chat model that has several datasets on top of Llama2. There are some unusual features somewhere in this LLM and I am not sure what was trained versus (unusual layers?). The model has built in roleplaying stories I've never seen other models perform. These stories are not in the Oobabooga Textgen WebUI. The model can do stuff like a Roman Gladiator, and some NSFW stuff. These are not very realistic stories and play out with the depth of a child's videogame. They are structured rigidly like they are coming from a hidden system context.

Like with the gladiators story it plays out like Tekken on the original PlayStation. No amount of dialogue context about how real gladiators will change the story flow. Like I tried modifying by adding how gladiators were mostly nonlethal fighters and showmen more closely aligned with the wrestler-actors that were popular in the 80's and 90's, but no amount of input into the dialogue or system contexts changed the story from a constant series of lethal encounters. These stories could override pretty much anything I added to system context in Textgen.

There was one story that turned an escape room into objectification of women, and another where name-1 is basically like a Loki-like character that makes the user question what is really happening by taking on elements in system context but changing them slightly. Like I had 5 characters in system context and it shifted between them circumstantially in a story telling fashion that was highly intentional with each shift. (I know exactly what a bad system context can do, and what errors look like in practice, especially with this model. I am 100% certain these are either (over) trained or programic in nature. Asking the model to generate a list of built in roleplaying stories creates a similar list of stories the couple of times I cared to ask. I try to stay away from these "built-in" roleplays as they all seem rather poorly written. I think this model does far better when I write the entire story in system context. One of the main things the built in stories do that surprise me is maintaining a consistent set of character identities and features throughout the story. Like the user can pick a trident or gladius, drop into a dialogue that is far longer than the batch size and then return with the same weapon in the next fight. Normally, I expect that kind of persistence would only happen if the detail was added to the system context.

Is this behavior part of some deeper layer of llama.cpp that I do not see in the Python version or Textgen source, like is there an additional persistent context stored in the cache?

15
15 comments
You've viewed 15 comments.