There’s already more than enough training data out there. The important thing that remains is to filter it so it doesn’t also include humanity’s stupidest data.
That and make the algorithms smarter so they are resistant to hallucination and misinformation - that’s not a data problem, it’s an architecture problem.
Well, it's established wisdom that the dataset size needs to scale with the number of model parameters. Quadratically, IIRC. If you don't have that much data the training basically won't work; it will overfit or just not progress.