Skip Navigation
Could someone please give me a walk through on how to crawl an entire web domain and scrape the images only?
  • Thank you very much, i havent tried it until you suggested it, its literally the only program that appears to work so far. I started it & its running/webcrawling right now as we speak.
    My only concerns are the instructions dont mention webp image files, and i wasnt sure if Wget is built for that image type, so instead i just did the instructions "jpeg,jpg,bmp,gif,png".

    But i definitely want to do webp and actually ALL image file formats. But I'm not sure if wget is built to recognize all image file formats.

    2.) wget's "Recursive retrieval" follows links by a default maximum depth of five layers.
    But is that enough? and how do i set it to deeper? and how much is too deep for a webpage? can it be too deep? logically speaking, once the domain or subdomain name starts to change completely, that appears to me is the best indication to stop.

    3.) If there are any errors or time outs that the websites server causes,etc, at the end when wget is done, will it tell me how many URL & images it was blocked from downloading?

  • Data Hoarder @selfhosted.forum stayjuicecom @alien.top
    BOT
    Could someone please give me a walk through on how to crawl an entire web domain and scrape the images only?

    I've got ZorinOS/ubuntu. I've tried httrack, but it gets slimjet launch terminal errors. I've tried getting chatgpt to write python scripts for me. I've tried WFDownloaderApp, but it's GUI glitches horribly. I've tried "DownloadThemAll!" but its just a browser extension, and it will only download a single webpage & i see no way to enable crawling or filters.

    Please help, thanks.

    3
    InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)ST
    stayjuicecom @alien.top
    BOT
    Posts 1
    Comments 1