The chatbot gave wildly different answers to the same math problem, with one version of ChatGPT even refusing to show how it came to its conclusion.
Can we discuss how it's possible that the paid model (gpt4) got worse and the free one (gpt3.5) got better? Is it because the free one is being trained on a larger pool of users or what?
It's because the research in question used a really small and unrepresentative dataset. I want to see these findings reproduced on a proper task collection.