GPT-4 performance comparable with physicians on official medical board residency examinations. Model performance near or above official passing rate in all medical specialties tested
It's just a multiple choice test with question prompts. This is the exact sort of thing an LLM should be very good at. This isn't chat gpt trying to do the job of an actual doctor, it would be quite abysmal at that. And even this multiple choice test had to be stacked in favor of chat gpt.
Because GPT models cannot interpret images, questions including imaging analysis, such as those related to ultrasound, electrocardiography, x-ray, magnetic resonance, computed tomography, and positron emission tomography/computed tomography imaging, were excluded.
Don't get me wrong though, I think there's some interesting ways AI can provide some useful assistive tools in medicine, especially tasks involving integrating large amounts of data. I think the authors use some misleading language though, saying things like AI "are performing at the standard we require from physicians," which would only be true if the job of a physician was filling out multiple choice tests.
I mean LLMs pretty much just try to guess what to say in a way that matches their training data, and research is usually trying to test or measure stuff in reality and see the data and try to find conclusions based on that so it doesn't seem feasible for LLMs to do research
They maybe used as part of research but it can't do the whole research as a crucial part of most research would be the actual data and you'd need a LOT more than just LLMs to get that
Despite using search results, it also hallucinates, like when it told me last week that IKEA had built a model of aircraft during World War 2 (uncited).
I was trying to remember the name of a well known consumer goods company that had made an aircraft and also had an aerospace division. The answer is Ball, the jar and soda can company.
Bigger models do start to show more emergent intelligent properties and there are components being added to the LLM to make them more logical and robust. At least this is what OpenAI and others are saying about even bigger datasets.
For me the biggest indicator that we've barking up the wrong tree is energy consumption.
Consider the energy required to feed a human with that required to train and run the current "leading edge" systems.
From a software development perspective, I think machine learning is a very useful way to model unknown variables, but that's not the same as "intelligence".
Cohere's command-r models are trained for exactly this type of task. The real struggle is finding a way to feed relevant sources into the model. There are plenty of projects that have attempted it but few can do more than pulling the first few search results.
Researchers reduced [the task] to producing a plausible corpus of text, and then published the not-so-shocking results that the thing that is good at generating plausible text did a good job generating plausible text.
From the OP , buried deep in the methodology :
Because GPT models cannot interpret images, questions including imaging analysis, such as those related to ultrasound, electrocardiography, x-ray, magnetic resonance, computed tomography, and positron emission tomography/computed tomography imaging, were excluded.
Yet here's their conclusion :
The advancement from GPT-3.5 to GPT-4 marks a critical milestone in which LLMs achieved physician-level performance. These findings underscore the potential maturity of LLM technology, urging the medical community to explore its widespread applications.
It's literally always the same. They reduce a task such that chatgpt can do it then report that it can do to in the headline, with the caveats buried way later in the text.
This research has been done a lot of a times but I don't see the point of it. Exams are something I would expect LLMs, especially the higher end ones, to do well because of their nature. But it says next to nothing about how reliable the LLM as an actual doctor.
Even those who do well in testing of wrote knowledge can perform poorly in practical exercises. That’s why medical doctors have to train and qualify through several years of supervised residency before being allowed to practice even basic medicine.
It says as much as it does for an LLM but doctors have to have a lot of field experience after passing these tests before they get certified as doctors.
LLMs can't design experiments or think of consequences or quality of life.
They also don't "learn" from asking questions or from a 1-time input. They need to see hundreds or thousands of people die from something to recognize the pattern of something new.