Stable Diffusion together with Controlnet. You basically feed it the text as a black and white image and provide it with a description of the picture of cats. It will then generate this output while using the black and white image as a base. It's fairly simple to do but it can take a while to get a quality result such as this one.
Is this the kind of thing you mean? https://files.catbox.moe/0c97kh.png Getting the text to be more subtle isn't as simple as telling the model, but changing some of the settings got me this and I think it's a lil better