Skip Navigation
Data Hoarder @selfhosted.forum A_Zythera @alien.top
BOT

Searching internet archive for URLs containing substring

Hi all, not a data hoarder myself but have been digging into using the wayback machine to find old images and videos for the past few weeks. I've been trying to find a way to search all URLs on the archive for any containing particular substrings (typically video/img IDs) but haven't had much luck. Yesterday I was directed to the wayback CDX API and its search functions but have some major issues regarding is usage for my desired outcome:

  1. Using the search function via the CDX API requires a domain input. I'm not looking for specific sites perse, instead just looking for a URL for any domain containing the specific strings in question.

  2. Even when searching within a large domain, the system seems to retrieve as many entries relevant to the domain before applying the search filters and has an upper limit for entries it can retrieve. This means that the entries containing the desired substring may not be in the list of entries retrieved before filtering and so will not be flagged.

I have tried using the in-built Pagination API to retrieve all relevant domain entries by splitting them into blocks but, due to the way the filters are applied, this only tells me if the entry is in the current block and I have to search each one manually. I have basically no coding knowledge (sorry) so just figuring out how to use the CDX search properly was a bit of a challenge. I definitely don't have the ability to automate the search process for the paginated data.

Maybe a long shot and sorry for my lack of understanding, but would anyone here know how I could go about solving my issue? It's possible you may have to explain to me like I'm 5 but I normally pick stuff up pretty quick.

Thanks for any help in advance!

1
1 comments