LLMs are black box bullshit that can only be prompted, not recoded. The gab one that was told 3 or 4 times not to reveal its initial prompt was easily jailbroken.
This is ultimately because LLMS are intelligent in the same way the subconscious is intelligent. It can rapidly make association but they are their initial knee jerk associations. In the same way that you can be tricked with word games if you're not thinking things through, the LLM gets tricked by saying the first thing on their mind.
However we're not far off from resolving this. Current methods are just to force the LLM to make a step by step plan before returning the final result.
Currently though there's the hot topic of Q* from OpenAI. No one knows what it is but a good theory is that it's applying the A* maze solving algorithm to the neural network. Essentially the LLM will explore possible routes in their neural network to try and discover the best answer. In other word it would let them think ahead and compare solutions, this would be far more similar to what the conscious mind does.
This would likely patch up these holes because it would discard pathways that lead to contradicting itself/the prompt, in favor of one that fits the entire prompt (In this case, acknowledging the attempt to have it break it's initial rules).