No they didn't. The study is misleading in a number of ways.
The first version thought every number is a prime, the second version thought none of them were primes.
Their complaints about the code generation are about formatting stuff that gets filtered out by the chat window. GPT actually got better at outputting runnable code over time.
That explanation of the prime number thing doesn't seem to actually match what's in the paper. GPT4 goes from a wordy explanation of how it arrived at the correct answer, "yes", to a single-word incorrect "no". GPT3.5 goes from a wordy explanation that has the right chain of thought but the wrong answer "no" to a very wordy explanation with the correct answer "yes". Neither of those seem to be predicated on either of the models just answering one way for everything.
@rastilin is making some unproven assumptions here. But it is true that the "math question" dataset consists only of prime numbers, so if the first version thought every number was prime and the second thought no numbers were prime, we would see this exact behavior. Source:
For this dataset, we query the primality of 500 randomly chosen primes between 1,000 and 20,000; the correct answer is always Yes.
From Zhang et al. (2023), the paper they took the dataset from.
Damn, you're right. The study has not been peer reviewed yet according to the article, and in my opinion, it really shows. For anyone who doesn't want to actually read the study:
They took the set of questions from a different study (which is fine). The original study had a set of 500 randomly chosen prime numbers and asked ChatGPT if they were prime, and to support its reasoning. They did this to see if in the cases where ChatGPT got the question wrong, ChatGPT would try to support its wrong answer with more faulty reasoning - a dataset with only prime numbers is perfectly fine for this initial question.
The study in the article appears to be trying to answer two questions - is there significant drift in the answers ChatGPT gives, and is ChatGPT getting better or worse at answering questions. The dataset is perfectly fine for answering the first question, but completely inadequate for answering the second, since an AI that simply thinks all numbers are prime would be judged as having perfect accuracy! Some good peer review would never let that kind of thing slide.
Holy shit you weren't kidding. The Markdown backticks being "not directly executable" is perhaps the dumbest take I've ever heard on ChatGPT, and that's saying a lot. Wow
I wonder if it’s because over time it’s using more AI-generated data in its training set, or if these results are true based on an identical static data set.
I don't think the lack of an affordable API is any deterrent for them to scrape Reddit anyway. They're getting dumber because they're getting too aggressive with their censorship.
I wonder what would happen if Redditors commented using AI generated text (without announcing that that was what they were doing, so no easy way to.exclude it from training).
I've never used it for anything beyond boilerplate I didn't feel like typing - terraform module skeletons, base nginx or a single config files, that type of thing. It was never even good at that but saved me a few minutes here and there.
I use it to help me find errors. A few things I use have been updating their syntax for routine tasks and I keep forgetting the new method. So I ask for a proofread and it fixes it maybe half the time, which ultimately does save me time overall.