OpenAI's existential problem is that they'll eat their own lunch and then have nothing left. The reason people make useful content now and give it away for free is because they can get paid for the traffic.
Take that traffic away and all the content goes behind paywalls and login screens where OpenAI can't touch it.
But the content has already been absorbed. I wouldn’t be surprised if they have all of it sucked up (many would argue illegally) and stored as a corpus for them to iterate onto. It’s not like they go out to touch all the web every time they train a new version of their model.
One of the craziest facts about GPT (to me) is that it was trained on 570GB of text data. That’s obviously a lot of text, but it’s bewildering to me that I could theoretically store their entire training dataset on my laptop.