Google's Gemini 2.5 pro is out of beta.
Google's Gemini 2.5 pro is out of beta.
I love to show that kind of shit to AI boosters. (In case you're wondering, the numbers were chosen randomly and the answer is incorrect).
They go waaa waaa its not a calculator, and then I can point out that it got the leading 6 digits and the last digit correct, which is a lot better than it did on the "softer" parts of the test.
So the "show thinking" button is essentially just for when you want to read even more untrue text?
It’s just more llm output, in the style of “imagine you can reason about the question you’ve just been asked. Explain how you might have come about your answer.” It has no resemblance to how a neural network functions, nor to the output filters the service providers use.
It’s how the ai doomers get themselves into a flap over “deceptive” models… “omg it lied about its train of thought!” because if course it didn’t lie, it just edited a stream of tokens that were statistically similar to something classified as reasoning during training.
I was hoping, until seeing this post, that the reasoning text was actually related to how the answer is generated. Especially regarding features such as using external tools, generating and executing code and so on.
I get how LLMs work (roughly, didn't take too many courses in ML at Uni, and GANs were still all the rage then), that's why I specifically didn't call it lies. But the part I'm always unsure about is how much external structure is imposed on the LLM-based chat bots through traditional programming filling the gaps between rounds of token generation.
Apparently I was too optimistic :-)
I think there’s an aspect of having it generate a train of thought helps it generate better answers.
Always_has_been.jpeg
Depending on the task it can significantly improve the quality of the output, but it doesn't help with everything. It's more useful for stuff that has to be reasoned about in multiple iterations, not something that's a direct answer.
Except not really, because even if stuff that has to be reasoned about in multiple iterations was a distinct category of problems, reasoning models by all accounts hallucinate a whole bunch more.