This article seems misleading. It uses the loaded Western term "selfie" to generate these images of different cultures smiling. If you use the term "group photo" instead, you get much more natural looking results, where certain cultures are smiling and others aren't.
Isn't that the essence of the issue, that those models are loaded with biases, that might or might not overlap with dominant ones in inscrutable ways, hence producing new levels of confusion and indirection?
The content about the reasons for different smiles is cool. And the highlighting of the training data influencing things is also good stuff.
But as far as realistic image generation based on culturally relatable smiling sounds like a skill issue. You can't just generate images about specific times or settings or people with "defaults". You have to specify your prompt.