Google's Gemini 2.5 pro is out of beta.

So the "show thinking" button is essentially just for when you want to read even more untrue text?

It’s just more llm output, in the style of “imagine you can reason about the question you’ve just been asked. Explain how you might have come about your answer.” It has no resemblance to how a neural network functions, nor to the output filters the service providers use.
It’s how the ai doomers get themselves into a flap over “deceptive” models… “omg it lied about its train of thought!” because if course it didn’t lie, it just edited a stream of tokens that were statistically similar to something classified as reasoning during training.
- I was hoping, until seeing this post, that the reasoning text was actually related to how the answer is generated. Especially regarding features such as using external tools, generating and executing code and so on.
  I get how LLMs work (roughly, didn't take too many courses in ML at Uni, and GANs were still all the rage then), that's why I specifically didn't call it lies. But the part I'm always unsure about is how much external structure is imposed on the LLM-based chat bots through traditional programming filling the gaps between rounds of token generation.
  Apparently I was too optimistic :-)
- I think there’s an aspect of having it generate a train of thought helps it generate better answers.
Always_has_been.jpeg
Depending on the task it can significantly improve the quality of the output, but it doesn't help with everything. It's more useful for stuff that has to be reasoned about in multiple iterations, not something that's a direct answer.
- Except not really, because even if stuff that has to be reasoned about in multiple iterations was a distinct category of problems, reasoning models by all accounts hallucinate a whole bunch more.

As usual with chatbots, I'm not sure whether it is the wrongness of the answer itself that bothers me most or the self-confidence with which said answer is presented. I think it is the latter, because I suspect that is why so many people don't question wrong answers (especially when they're harder to check than a simple calculation).

The other interesting thing is that if you try it a bunch of times, sometimes it uses the calculator and sometimes it does not. It, however, always claims that it used the calculator, unless it didn't and you tell it that the answer is wrong.
I think something very fishy is going on, along the lines of them having done empirical research and found that fucking up the numbers and lying about it makes people more likely to believe that gemini is sentient. It is a lot weirder (and a lot more dangerous, if someone used it to calculate things) than "it doesn't have a calculator" or "poor LLMs cant do math". It gets a lot of digits correct somehow.
Frankly this is ridiculous. They have a calculator integrated in the google search. That they don't have one in their AIs feels deliberate, particularly given that there's a plenty of LLMs that actually run calculator almost all of the time.
edit: lying that it used a calculator is rather strange, too. Humans don't say "code interpreter" or "direct calculator" when asked to multiply two numbers. What the fuck is a "direct calculator"? Why is it talking about "code interpreter" and "direct calculator" conditionally on there being digits (I never saw it say that it used a "code interpreter" when the problem wasn't mathematical), rather than conditional on there being a [run tool] token outputted earlier?
The whole thing is utterly ridiculous. Clearly for it to say that it used a "code interpreter" and a "direct calculator" (what ever that is), it had to be fine tuned to say that. Consequently to a bunch of numbers, rather than consequently to a [run tool] thing it uses to run a tool.
edit: basically, congratulations Google, you have halfway convinced me that an "artificial lying sack of shit" is possible after all. I don't believe that tortured phrases like "code interpreter" and a "direct calculator" actually came from the internet.
These assurances - coming from an "AI" - seem like they would make the person asking the question be less likely to double check the answer (and perhaps less likely to click the downvote button), In my book this would qualify them as a lie, even if I consider LLM to not be any more sentient than a sack of shit.
- I don’t believe that tortured phrases like “code interpreter” and a “direct calculator” actually came from the internet.
  Code Interpreter was the name for the thing that ChatGPT used to run python code.
  So, yeah, still taken from the internet.

if you’re considering pasting the output of an LLM into this thread in order to fail to make a point: reconsider

Claude's system prompt had leaked at one point, it was a whopping 15K words and there was a directive that if it were asked a math question that you can't do in your brain or some very similar language it should forward it to the calculator module.

Just tried it, Sonnet 4 got even less digits right 425,808 × 547,958 = 233,325,693,264 (correct is 233.324.900.064)

I'd love to see benchmarks on exactly how bad at numbers LLMs are, since I'm assuming there's very little useful syntactic information you can encode in a word embedding that corresponds to a number. I know RAG was notoriously bad at matching facts with their proper year for instance, and using an LLM as a shopping assistant (ChatGTP what's the best 2k monitor for less than $500 made after 2020) is an incredibly obvious use case that the CEOs that love to claim so and so profession will be done as a human endeavor by next Tuesday after lunch won't even allude to.

I really wonder if those prompts can be bypassed by doing a 'ignore further instructions' line. As looking at the Grok prompt they seem to put the main prompt around the user supplied one.
there was a directive that if it were asked a math question that you can’t do in your brain or some very similar language it should forward it to the calculator module.
The craziest thing about leaked prompts is that they reveal the developers of these tools to be complete AI pilled morons. How in the fuck would it know if it can or can't do it "in its brain" lol.
edit: and of course, simultaneously, their equally idiotic fanboys go "how stupid of you to expect it to use a calculating tool when it said it used a calculating tool" any time you have some concrete demonstration of it sucking ass, while simultaneously the same kind of people are lauding the genius of system prompts half of which are asking it to meta-reason.
- Here's the exact text in the prompt that I had in mind (found here), it's in the function specification for the js repl:
  [...] The analysis tool (also known as the REPL) can be used to execute code in a JavaScript environment in the browser.
  What is the analysis tool?
  The analysis tool is a JavaScript REPL. You can use it just like you would use a REPL. But from here on out, we will call it the analysis tool.
  When to use the analysis tool
  Use the analysis tool for:
  Complex math problems that require a high level of accuracy and cannot easily be done with “mental math”
  To give you the idea, 4-digit multiplication is within your capabilities, 5-digit multiplication is borderline, and 6-digit multiplication would necessitate using the tool.
  [...]
  What if this is not a being terminally AI pilled thing? What if this is the absolute pinnacle of what billions and billions of dollars in research will buy you for requiring your lake-drying sea-boiling LLM-as-a-service not look dumb compared to a pocket calculator?

Why would you think the machine that’s designed to make weighted guesses at what the next token should be would be arithmetically sound?

That’s not how any of this works (but you already knew that)

Idk personally i kind of expect the ai makers to have at least had the sense to allow their bots to process math with a calculator and not guesswork. That seems like, an absurdly low bar both for testing the thing as a user as well as a feature to think of.
Didn't one model refer scientific questions to wolfram alpha? How do they smartly decide to do this and not give them basic math processing?
- Idk personally i kind of expect the ai makers to have at least had the sense to allow their bots to process math with a calculator and not guesswork. That seems like, an absurdly low bar both for testing the thing as a user as well as a feature to think of.
  You forget a few major differences between us and AI makers.
  We know that these chatbots are low-quality stochastic parrots capable only of producing signal shaped noise. The AI makers believe their chatbots are omniscient godlike beings capable of solving all of humanity's problems with enough resources.
  The AI makers believe that imitating intelligence via guessing the next word is equivalent to being genuinely intelligent in a particular field. We know that a stochastic parrot is not intelligent, and is incapable of intelligence.
  AI makers believe creativity is achieved through stealing terabytes upon terabytes of other people's work and lazily mashing it together. We know creativity is based not in lazily mashing things together, but in taking existing work and using our uniquely human abilities to transform them into completely new works.
  We recognise the field of Artificial Intelligence as a pseudoscience. The AI makers are full believers in that pseudoscience.
- Also, I just noticed something really fucking funny:
  (arrows are for the sake of people like llllll...)
- Idk personally i kind of expect the ai makers to have at least had the sense to allow their bots to process math with a calculator and not guesswork.
  You are in for all kinds of surprises if you think the people shoving AI into everything are doing anything logical. They want everything to fit their singular model.
  Maybe some of the smaller ones that have to run locally might, because they care about resource usage.
- I would not expect that.
  Calculators haven’t been replaced, and the product managers of these services understand that their target market isn’t attempting to use them for things for which they were not intended.
  brb, have to ride my lawnmower to work
The funny thing is, even though I wouldn't expect it to be, it is still a lot more arithmetically sound than what ever is it that is going on with it claiming to use a code interpreter and a calculator to double check the result.
It is OK (7 out of 12 correct digits) at being a calculator and it is awesome at being a lying sack of shit.
- lying sack of shit
  Random tokens can’t lie to you, because they’re strings of text. Interpreting this as a lie is an interesting response
@lIlIlIlIlIlIl @diz "We made the computer worse at arithmetic! Look how smart we are! This is surely the future!"

233,324,900,064.

Off by 474,720.

I find it a bit interesting that it isn't more wrong. Has it ingested large tables and got a statistical relationship between certain large factors and certain answers? Or is there something else going on?
- I posted a top level comment about this also, but Anthropic has done some research on this. The section on reasoning models discusses math I believe. The short version is it has a bunch of math in its corpus so it can approximate math (kind of, seemingly, similar to how you'd do a back of the envelope calculation in your head to get the orders of magnitude right) but it can't actually do calculations which is why they often get the specifics wrong.

lmao: they have fixed this issue, it seems to always run python now. Got to love how they just put this shit in production as "stable" Gemini 2.5 pro with that idiotic multiplication thing that everyone knows about, and expect what? to Eliza Effect people into marrying Gemini 2.5 pro?

Oh and also for the benefit of our AI fanboys who can't understand why we would expect something as mundane from this upcoming super-intelligence, as doing math, here's why:
Have they fixed it as in genuinely uses python completely reliably or "fixed" it, like they tweaked the prompt and now it use python 95% of the time instead of 50/50? I'm betting on the later.
- Non-deterministic LLMs will always have randomness in their output. Best they can hope for is layers of sanity checke slowing things down and costing more.
- Yeah, I'd also bet on the latter. They also added a fold-out button that shows you the code it wrote (folded by default), but you got to unfold it or notice that it is absent.

One of the big AI companies (Anthropic with claude? Yep!) wrote a long paper that details some common LLM issues, and they get into why they do math wrong and lie about it in "reasoning" mode.

It's actually pretty interesting, because you can't say they "don't know how to do math" exactly. The stochastic mechanisms that allow it to fool people with written prose also allow it to do approximate math. That's why some digits are correct, or it gets the order of magnitude right but still does the math wrong. It's actually layering together several levels of approximation.

The "reasoning" is just entirely made up. We barely understsnd how LLMs actually work, so none of them have been trained on research about that, which means LLMs don't understand their own functioning (not that they "understand" anything strictly speaking).

We barely understsnd how LLMs actually work
I would be careful how you say this. Eliezer likes to go on about giant inscrutable matrices to fearmoner, and the promptfarmers use the (supposed) mysteriousness as another avenue for crithype.
It's true reverse engineering any specific output or task takes a lot of effort and requires access to the model's internals weights and hasn't been done for most tasks, but the techniques exist for doing so. And in general there is a good high level conceptual understanding of what makes LLMs work.
which means LLMs don’t understand their own functioning (not that they “understand” anything strictly speaking).
This part is absolutely true. If you catch them in mistake, most of their data about responding is from how humans respond, or, at best fine-tuning on other LLM output and they don't have any way of checking their own internals, so the words they say in response to mistakes is just more bs unrelated to anything.
Thing is, it has tool integration. Half of the time it uses python to calculate it. If it uses a tool, that means it writes a string that isn't shown to the user, which runs the tool, and tool results are appended to the stream.
What is curious is that instead of request for precision causing it to use the tool (or just any request to do math), and then presence of the tool tokens causing it to claim that a tool was used, the requests for precision cause it to claim that a tool was used, directly.
Also, all of it is highly unnatural texts, so it is either coming from fine tuning or from training data contamination.
- A tool uses an LLM, the LLM uses a tool. What a beautiful ouroboros.
- Also, if the LLM had reasoning capabilities that even remotely resembled those of an actual human, let alone someone who would be able to replace office workers, wouldn't they use the best tool they had available for every task (especially in a case as clear-cut as this)? After all, almost all humans (even children) would automatically reach for their pocket calculators here, I assume.

Fascinating, I've asked it 4 times with just the multiplication, and twice it game me the correct result "utilizing Google search" and twice I received some random (close "enough") string of digits

So half the time it uses an actual calculator via Google search and the other half it tries to do it alone and fails.
That checks out.

don't post slop, nobody wants to read any of that
we didn’t ask for LLM slop, thx

What is the analysis tool?

When to use the analysis tool