Large language models are fed on what data can be found crawling the Internet. The more data you feed ChatGPT, Gemini, or Claude, the higher the quality of their outputs. But that can work in rever…
And some AI companies like Perplexity just shout “YOLO” and try to stealth-crawl it anyway. [404 Media; Wired]
it’s fucking wild that automated scraping without permission used to be something you did as a last resort under strict restrictions and secrecy, cause whoever had the data you needed wasn’t exposing a usable API. but not in the AI industry, there it’s the fucking foundation of the entire company
Yeah. There probably was a fair bit of stealth-crawling up to this point, but the perps knew they needed to keep it on the down-low.
The AI bubble, on the other hand, lacks the ability to keep it subtle, making it plainly obvious people's shit was getting stolen and showcasing AI bros/techbros' utter disregard for anyone but themselves (e.g. by ignoring robots.txt).
(Off-the-cuff prediction: anti-AI scraping measures will likely start feeding false info to AI scrapers they detect - beyond simply throwing a wrench into those models, it'd also make it less likely AI scrapers will realise "hey, our shit's getting blocked")
It means there's no attempt to block AI models from using this article about AI models being blocked. Mind you I don't know how effective it would be if there were.