Skip Navigation
What's the deal with LlamaCPP and caching?
  • Without knowing anything about this model or what it was trained on or how it was trained, it's impossible to say exactly why it displays this behavior. But there is no "hidden layer" in llama.cpp that allows for "hardcoded"/"built-in" content.

    It is absolutely possible for the model to "override pretty much anything in the system context". Consider any regular "censored" model, and how any attempt at adding system instructions to change/disable this behavior is mostly ignored. This model is probably doing much the same thing except with a "built-in story" rather than a message that says "As an AI assistant, I am not able to ...".

    As I say, without knowing anything more about what model this is or what the training data looked like, it's impossible to say exactly why/how it has learned this behavior or even if it's intentional (this could just be a side-effect of the model being trained on a small selection of specific stories, or perhaps those stories were over-represented in the training data).

  • Nearly 10% of people ask AI chatbots for explicit content. Will it lead LLMs astray? [Article from October 3]
  • IMO, local LLMs lack the capabilities or depth of understanding to be useful for most practical tasks (e.g. writing code, automation, language analysis). This will heavily skew any local LLM "usage statistics" further towards RP/storytelling (a significant proportion of which will always be NSFW in nature).

  • Nearly 10% of people ask AI chatbots for explicit content. Will it lead LLMs astray? [Article from October 3]
  • Stable Diffusion 2 base model is trained using what we would today refer to as a "censored" dataset. Stable Diffusion 1 dataset included NSFW images, the base model doesn't seem particularly biased towards or away from them and can be further trained in either direction as it has the foundational understanding of what those things are.

  • Deleted
    *Permanently Deleted*
  • So... If this doesn't actually increase the context window or otherwise increase the amount of text that the LLM is actually able to see/process, then how is it fundamentally different to just "manually" truncating the input to fit in the context size like everyone's already been doing?

  • Musical notation
  • I tried getting it to write out a simple melody using MIDI note numbers once. I didn't think of asking it for LilyPond format, I couldn't think of a text-based format for music notation at the time.

    It was able to produce a mostly accurate output for a few popular children's songs. It was also able to "improvise" a short blues riff (mostly keeping to the correct scale, and showing some awareness of/reference to common blues themes), and write an "answer" phrase (which was suitable and made musical sense) to a prompt phrase that I provided.

  • Mistral 7B model
  • To be honest, the same could be said of LLaMa/Facebook (which doesn't particularly claim to be "open", but I don't see many people criticising Facebook for doing a potential future marketing "bait and switch" with their LLMs).

    They're only giving these away for free because they aren't commercially viable. If anyone actually develops a leading-edge LLM, I doubt they will be giving it away for free regardless of their prior "ethics".

    And the chance of a leading-edge LLM being developed by someone other than a company with prior plans to market it commercially is quite small, as they wouldn't attract the same funding to cover the development costs.

  • Mistral 7B model
  • IMO the availability of the dataset is less important than the model, especially if the model is under a license that allows fairly unrestricted use.

    Datasets aren't useful to most people and carry more risk of a lawsuit or being ripped off by a competitor than the model. Publishing a dataset with copyrighted content is legally grey at best, while the verdict is still out regarding a model trained on that dataset and the model also carries with it some short-term plausible deniability.

  • [Help] Trying to run a local Story telling model with KoboldCpp
  • Yeah, I think you need to set the contextsize and ropeconfig. Documentation isn't completely clear and in some places sort of implies that it should be autodetected based on the model when using a recent version, but the first thing I would try is setting these explicitly as this definitely looks like an encoding issue.

  • [Help] Trying to run a local Story telling model with KoboldCpp
  • I would guess that this is possibly an issue due to the model being a "SuperHOT" model. This affects the way that the context is encoded and if the software that uses the model isn't set up correctly for it you will get issues such as repeated output or incoherent rambling with words that are only vaguely related to the topic.

    Unfortunately I haven't used these models myself so I don't have any personal experience here but hopefully this is a starting point for your searches. Check out the contextsize and ropeconfig parameters. If you are using the wrong context size or scaling factor then you will get incorrect results.

    It might help if you posted a screenshot of your model settings (the screenshot that you posted is of your sampler settings). I'm not sure if you configured this in the GUI or if the only model settings that you have are the command-line ones (which are all defaults and probably not correct for an 8k model).

  • Pygmalion-2 has been released
  • TBH my experience with SillyTavern was that it merely added another layer of complexity/confusion to the prompt formatting/template experience, as it runs on top of text-generation-webui anyway. It was easy for me to end up with configurations where e.g. the SillyTavern turn template would be wrapped inside the text-generation-webui one, and it is very difficult to verify what the prompt actually looks like by the time it reaches the model as this is not displayed in any UI or logs anywhere.

    For most purposes I have given up on any UI/frontend and I just work with llama-cpp-python directly. I don't even trust text-generation-webui's "notebook" mode to use my configured sampling settings or to not insert extra end-of-text tokens or whatever.

  • How usable are AMD GPUs?
  • Yes, that makes more sense. I was concerned initially that you were looking to buy a new GPU with more VRAM for the sole reason of being unable to do something that you should already be able to do, and that this would be an unnecessary spend of money and/or not actually fix the problem, that you would be somewhat mad at yourself if you found out afterwards that "oh, I just needed to change this setting".

  • How usable are AMD GPUs?
  • Fair enough but if your baseline for comparison is wrong then you can't make good assessments of the capabilities of different GPUs. And it's possible that you don't actually need a new GPU/more VRAM anyway, if your goal is to generate 1024x1024 in Stable Diffusion and run a 13B LLM both of which I can do with 8 GB of VRAM.

  • Pygmalion-2 has been released
  • text-generation-webui "chat" and "chat-instruct" modes are... weird and badly documented when it comes to using a specific prompt template. If you don't want to use the notepad mode, use "instruct" mode and set your turn template with the required tags and include your system prompt in the context (? I forget what it is labeled as) box.

    EDIT: Actually I think text-generation-webui might use <|user|> as a special string to mean "substitute the user prefix set in the box directly above the turn template box". Why they have to have a turn template field with "macro" functionality and then separate fields for user and bot prefixes when you could just... put the prefix directly in the turn template I have no idea. It's not as though you would ever want or need to change one without the other anyway. But it's possible that as a result of this you can't actually use <|user|> itself in the turn template...

  • How usable are AMD GPUs?
  • What sort of issues are you getting trying to generate 1024x1024 images in Stable Diffusion? I've generated up to 1536x1024 without issue on a 1070 (although it takes a few minutes) and could probably go even larger (this was in img2img mode which uses more VRAM as well - although at that size you usually won't get good results with txt2img anyway). What model are you using?

  • What is your favorite offline LLM for technical utility, and have you noticed anything unexpected about certain models?
  • I’m slowly working my way into the codebase for Textgen, and will hopefully get to the point where I can directly use the command line for prompting.

    The llama.cpp Python API is super simple to use and you don't need to dig into the text-generation-webui codebase at all. Literally just:

    import llama_cpp_cuda as llama_cpp    # use llama_cpp_cuda version for support for running GGML models on the GPU
    
    model = llama_cpp.Llama(model_path="", seed=-1, n_ctx=2048, n_gpu_layers=28, low_vram=True)    # use whatever settings here that you would set in text-generation-webui when loading the model, make sure to include n_gqa=8 when using LLaMa v2 70B model
    
    # now you can either do things with the "all-in-one" API...
    text = model.create_completion(prompt, max_tokens=200, temperature=0.8, top_p=0.95, top_k=40, repeat_penalty=1.1, frequency_penalty=0.0, presence_penalty=0.0, tfs_z=1.0, mirostat_mode=0, mirostat_tau=5.0, mirostat_eta=0.1)    # you pass your temperature, top_p, top_k, etc. settings here, these are the same as the settings in text-generation-webui, note that you don't need to pass all the parameters e.g. you can leave out mirostat parameters if you aren't using mirostat mode
    
    # ...or the "manual" way
    prompt_tokens = model.tokenize(prompt.encode('utf-8'))
    model.reset()
    model.eval(prompt_tokens)
    generated_tokens = []
    while True:
        next_token = model.sample(temp=0.8, top_p=0.95, top_k=40, repeat_penalty=1.1, frequency_penalty=0.0, presence_penalty=0.0, tfs_z=1.0, mirostat_mode=0, mirostat_tau=5.0, mirostat_eta=0.1)
        if next_token != model.token_eos():
            generated_tokens.append(next_token)
            model.eval([next_token])
        else:
            break
    text = model.detokenize([generated_tokens]).decode('utf-8')
    

    See the documentation here for more information: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/ You only really need to pay attention to __init__(), tokenize(), detokenize(), reset(), eval(), sample(), and generate(). create_completion() provides an "all-in-one" wrapper around eval/sample/generate that is intended to be (loosely) compatible as a drop-in replacement for the OpenAI Python library. create_chat_completion() is likewise intended to be a replacement for OpenAI but if you want direct control over the prompt format then ignore it entirely (it's not even documented exactly how the prompt is formatted when using this function...).

    Do you happen to know if the prompt processing differences in Textgen, and others like Kobold, are all arbitrary processing done before llama.cpp is called (or some similar code), or is there some other API level that more complex character prompts are tapping into?

    They are not doing anything special with the model (no fancy API or anything). All they are doing is including some extra text before your input that describes the characters, scene etc. and possibly a direct instruction to roleplay as that character, and then sending that combined assembled prompt to the model/backend API as you would with any other text. Unfortunately the documentation isn't particularly transparent about how the extra text is included (with regards to the exact formatting used, what order things appear in, etc.) and neither do the logs produced by e.g. text-generation-webui include the actual raw prompt as seen by the model.

    I’m aware I’m blindly walking into this space with my arms out trying to find the walls; aware, but unworried about giant potential holes in the floor.

    The key point to understand here is that all current LLMs (this may change in the future) work only with raw text. They take in some text and then generate other text that goes after it. Any more complex applications such as conversation are just layers built on top of this. The conversation is turned into a plain-text transcript that is sent to the model. The model generates the next part of the conversation transcript, which is then parsed back out and appended to the list of conversation messages. From the model's perspective, it's all just one continuous stream of raw text. You can always achieve exactly the same results by manually constructing the same prompt yourself and passing it directly to the model.

    For example, if I pass the following string as the prompt into model.create_completion() from above

    "### User:\nPlease can you write a program in Python that will split a file into 19200-byte blocks and calculate the SHA256 hash of each block.\n\n### Response:\n"

    I will get exactly the same result as if I used instruct mode in text-generation-webui with ### User: as the user string, ### Response: as the bot string, and <|user|>\n<|user-message|>\n\n<|bot|>\n<|bot-message|>\n\n as the turn template, and then sent the message "Please can you write a program in Python that will split a file into 19200-byte blocks and calculate the SHA256 hash of each block." in the chat box.

    (Although imo doing it the manual way is less error-prone and guaranteed to give me exactly the prompt that I think I should be getting, noting that text-generation-webui doesn't give me any way at all to actually verify that the prompt seen by the model is actually the way I intended it to be and it's not as though I haven't encountered UI bugs before where the produced formatting doesn't match what I entered...)

    Like if the reply changes writing perspective context arbitrarily, I need to recall the last question, alter it, and regenerate.

    You don't necessarily need to alter your question in that case, often just regenerating is enough to "fix" this. This is, as I have said, particularly an issue with the LLaMa 2 non-chat models as they aren't specifically trained to follow a conversation, so sometimes they will arbitrarily decide to provide a commentary or reaction to the conversation or they see the conversation as part of a webpage and try to generate a heading for the next part of an article or some other such seemingly-"random" behavior instead of continuing the conversation itself. If that happens just regenerate the response until the RNG works out in your favor and the model starts writing in the correct role. Once it starts writing a particular "type" of output it will generally keep writing in the same role until it has finished.

    Sometimes it is also helpful to write the first part of the response yourself. For example, you could write "Sure! Here is a program that does <summary>" (try to copy the particular style used by a particular model) and then let the model continue from there (there's an option in text-generation-webui labeled "Start reply with" that does this, or if you're constructing the prompt yourself then this is trivial to accomplish - make sure to not include a space or newline after the part that you've written). This will make it more likely to write a program for you instead of providing a commentary like "The user has asked the assistant to write a program. It is possible that someone may respond to such a request by ...".

    If the reply is the same, I know the context tokens are ruined.

    This seems to be (sort of) a known issue with LLaMa 2 specifically, where it will keep regenerating the previous response even though you continue the conversation. It's not exactly clear what causes this, it's not a software bug in the traditional sense. The model is receiving your follow-up message but it's just deciding to repeat whatever it said last time instead of saying something different. This is believed to possibly be an issue with how the training data was formatted.

    This might make more sense if you think of this in terms of what the model is seeing. The model is seeing something such as the following:

    ### User:
    Please can you write a program in Python that will split a file into 19200-byte blocks and calculate the SHA256 hash of each block. The hash should be written to a file with the name ".blockhashes." (index is padded to 5 digits).
    
    ### Response:
    Certainly! Here's an example program that does what you described:
    
    [33-line code snippet removed]
    
    This program takes two arguments: the input file and the output directory. It first calculates the number of blocks needed to store the entire file, and then loops over each block, reading it from the input file and calculating its SHA256 hash. The hash is written to a separate file with the format `.blockhashes.`.
    
    I hope this helps! Let me know if you have any questions or need further clarification.
    
    ### User:
    Please can you fix the following two issues with your program:
    
    * The output filename must have the block index padded to 5 digits.
    
    * The output file must contain only the SHA256 hash in hex form and no other text/contents.
    
    Please write out only the parts of the program that you have changed.
    
    ### Response:
    

    At this point, the model sees the heading ### Response:. For some reason, the LLaMa 2 models have an over-tendancy to refer back in the text and see that last time the text ### Response: was followed by the text Certainly! Here's an example program that does what you described: and so they will then repeat that exact same text again because the model has concluded that ### Response: should now always be followed by Certainly! Here's an example program that does what you described: instead of seeing the higher-level view where ### User: and ### Response: are taking turns in a conversation.

    If this happens, you don't always need to clear/reset the conversation. Often, you can just regenerate it a few times and once the model starts writing a different response it will continue into something else other than repeating the same text as before. As with the previous point it can also help if you write the first part of the response yourself to force it to say something different.</summary>

  • What is your favorite offline LLM for technical utility, and have you noticed anything unexpected about certain models?
  • I haven't got any experience with the 70B version specifically but based on my experience with LLaMa 2 13B (still annoyed that there's no 30B version of v2...) it is more sensitive to promoting variations than other models as it isn't specifically trained for "chat", "instruct", or "completion" style interactions. It is capable of all three but without using a clear prompt and template it can be somewhat random as to what kind of response you will get.

    For example, using

    ### User:
    Please write an article about [subject].
    
    ### Response:
    

    as the prompt will get results varying from a written article to "The user's response to an article about [subject] is" to "My response to this request is to ask the user about [clarifying questions]" to "One possible counterargument to an article about [subject] is" to literally the text "Generating response, please wait... [random URL]". Whereas most conversationally-fine-tuned models will understand and follow this template or other similar templates and play their side of the conversation even if it doesn't match exactly what they were trained on.

    I would recommend using llama.cpp (or the Python binding) directly for more awareness of and control over the exact prompt text as seen by the model. Or using text-generation-webui in "notebook" mode (which just gives you a blank text box that both you and the LLM will type into and it's up to you to provide the prompt format). This will also avoid any formatting issues with the chat view in text-generation-webui (again I don't have any specific experience with LLaMa 2 70B but I have encountered times when models don't output the markdown code block tags and text-generation-webui will mess up the formatting).

    Note that for some reason the difference between chat, instruct, and chat-instruct modes in text-generation-webui are confusingly named. instruct mode does not include an "instruction" (e.g. "Continue the conversation") before the conversation unless you include one in the conversation template (the conversation template is referred to as "Instruction template" in the UI). chat-instruct mode includes an instruction such as "Continue the conversation by writing a single response for Assistant" before the conversation, followed by the conversation template. chat and chat-instruct modes also include text that describes the character that the model will speak as (mostly used for roleplay but the default "None" character describes a generic AI assistant character - it is possible that the inclusion of this text is what is helping LLaMa 2 stay on track in your case and understand that it is participating in a conversation. I'm not sure what conversation template chat mode uses but afaik it is not the same turn template as set in instruct and chat-instruct modes and I don't see an option to configure it anywhere.

  • A note on the importance of prompt and template formatting - as seen from starcoder
  • I have also encountered "rate limits" where the request is not dropped/errored out but is simply stalled until the timeout expires.

    Usually this happens in a client library though rather than over the network itself, where the library blocks the thread until it knows that the rate-limit is due to expire before issuing the request to a server (and then blocks and reissues again if the server still returns a rate-limit error). This allows the application developer to know that their request will complete "at some point" rather than having to handle the error and timeout themselves. Usually this is preferred in single-threaded application, or one where all the API stuff happens on a single thread (i.e. one request at a time, no new request is issued until the previous request has completed).

  • A note on the importance of prompt and template formatting - as seen from starcoder
  • More generally, make sure that you have the correct template format selected in the chat settings when you're using a conversational model.

    Some models supposedly require an additional "instruction" template where the "instruction" is something like "Continue the following conversation between and by writing a single reply for " although personally I get better results without this even on models that are instruction-tuned rather than conversation-tuned. Most models that have any form of basic tuning beyond a bare "continue/complete the text" model (which requires an entirely different approach to prompting) seem to be able to understand the basic format/concept of a conversation.

  • A note on the importance of prompt and template formatting - as seen from starcoder
  • How would you ask for a follow-up change using this instruction template?

    Personally I interpreted the request as "if it's been less than 2 minutes, sleep/block until it's 2 minutes since last time" rather than dropping/discarding the string immediately and continuing. Suppose this is what I had actually wanted, can you ask the model to modify its code accordingly without having to go back and edit the original prompt to start over?

    I find that a lot of programming questions require multiple rounds of refinements. I tend to favor models that are able to modify existing code in a back-and-forth discussion, and that are capable of writing out just the modified parts of their code with each change to save on time and token count (seriously, so many models will insist on repeating the entire thing no matter how firmly you tell them not to - if you're lucky, they'll actually include the changes in their second reply instead of thereafter getting stuck in a loop of writing out identical code every time).

  • What is wrong with LLM benchmarks, and why are we still using them?

    You are probably familiar with the long list of various benchmarks that new models are tested on and compared against. These benchmarks are supposedly designed to assess the model's ability to perform in various aspects of language understanding, logical reasoning, information recall, and so on.

    However, while I understand the need for an objective and scientific measurement scale, I have long felt that these benchmarks are not particularly representative of the actual experience of using the models. For example, people will claim that a model performs at "some percentage of GPT-3" and yet not one of these models has ever been able to produce correctly-functioning code for any non-trivial task or follow a line of argument/reasoning. Talking to GPT-3 I have felt that the model has an actual in-depth understanding of the text, question, or argument, whereas other models that I have tried always feel as though they have only a superficial/surface-level understanding regardless of what the benchmarks claim.

    My most recent frustration, and the one that prompted this post, is regarding the newly-released OpenOrca preview 2 model. The benchmark numbers claim that it performs better than other 13B models at the time of writing, supposedly outperforms Microsoft's own published benchmark results for their yet-unreleased model, and scores an "average" result of 74.0% against GPT-3's 75.7% while the LLaMa model that I was using previously apparently scores merely 63%.

    I've used GPT-3 (text-davinci-003), and this model does not "come within comparison" of it. Even giving it as much of a fair chance as I can, giving it plenty of leeway and benefit of the doubt, not only can it still not write correct code (or even valid code in a lot of cases) but it is significantly worse at it than LLaMa 13B (which is also pretty bad). This model does not understand basic reasoning and fails at basic reasoning tasks. It will write a long step-by-step explanation of what it claims that it will do, but the answer itself contradicts the provided steps or the steps themselves are wrong/illogical. The model has only learnt to produce "step by step reasoning" as an output format, and has a worse understanding of what that actually means than any other model does when asked to "explain your reasoning" (at least, for other models that I have tried, asking them to explain their reasoning produces at least a marginal improvement in coherence).

    There is something wrong with these benchmarks. They do not relate to real-world performance. They do not appear to be measuring a model's ability to actually understand the prompt/task, but possibly only measuring its ability to provide an output that "looks correct" according to some format. These benchmarks are not a reliable way to compare model performance and as long as we keep using them we will keep producing models that score higher on benchmarks and claim to perform "almost as good as GPT-3" but yet fail spectacularly in any task/prompt that I can think of to throw at them.

    (I keep using coding as an example however I have also tried other tasks besides code as I realise that code is possibly a particularly challenging task due to requirements like needing exact syntax. My interpretation of the various models' level of understanding is based on experience across a variety of tasks.)

    12
    InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)MI
    micheal65536 @lemmy.micheal65536.duckdns.org
    Posts 1
    Comments 34