Interesting that the article ends with “The new ChatGPT catcher even performed well with introductions from journals it wasn’t trained on”. Isn’t that the whole point? If you just judge a model based on what it was trained on, you just get a biased model. I can’t remember the exact word for it but it’s essentially over-relying on your own dataset. So of course it will get near-100% accuracy on what it was trained with. I’d be curious to see what the accuracy on other papers is.