I’m not surprised they used Reddit data to train. I am shocked a bit at how fucking lazy and haphazard they were with the data.
There’s only logical arguments for anonymizing the data which it looks like they didn’t do. It’s such a massive privacy risk not to. It also puts the company at legal risk. Who knows what other bizarre info it’ll leak.
The silliness of anonymizing data that's already wide open in the public aside, if you were to anonymize the usernames you'd end up producing a worse AI because often the literal username of the person in question is significant to the context of what's being written. Think of all the "relevant username" comments, for example. People make puns about usernames, berate people for having offensive usernames, and so forth. If those usernames were all replaced with anonymized substitutes the AI would be training on nonsense.