Skip Navigation

Is there a network to retrieve files by their hash?

inb4: IPFS doesn't work, unfortunately as you cannot provide the hash of an arbitrarily large file and retrieve it from the network. IPFS content IDs (CID) are a hash of the tree of chunks. Changes to chunk size can also change the hash!

Basically, I'd like to take the SHA256, SHA3, blake2, md5, of a file and either retrieve it from a network or get a list of sources for that file. Does something like that exist already or will I have to build it?

16 comments
  • BitTorrent and Hyphanet have mechanisms that do this.

    Magnet URIs are a standard way of encoding this.

    EDIT: You typically want a slightly-more-elaborate approach than just handing the network a hash and then getting a file.

    You typically want to be able to "chunk" a large file, so that you can pull it from multiple sources. The problem is that you can only validate that information is correct once you have the whole file. So, say you "chunk" the file, get part of it from one source and part from another. A malicious source could feed you incorrect data. You can validate that the end file does not hash to the right value, but then you have no idea what part of the file that some source fed you is invalid, so you don't know who to re-fetch data from.

    What's more-common is a system where you have the hash of a hash tree of a file. That way, you can take the hash, request the hash tree from the network, validate that the hash tree hashes to the hash, and then start requesting chunks of the file, where a leaf node in the hash tree is the hash of a chunk. That way, you can validate data at a chunk level, and know that a chunk is invalid after requesting no more than one chunk from a given source.

    See Merkle tree, which also mentions Tiger Tree Hash; TTH is typically used as a key in magnet URIs.

    EDIT2:

    Can’t think of a way to do it with a DHT

    All of the DHTs that I can think of exist to implement this sort of thing.

    EDIT3: Oh, skimmed over your concern, didn't notice that you took issue with using a hash tree. I think that one normally does want a hash tree, that it's a mistake to use a straight hash. I mean, you can generate the hash of a hash tree as easily as the hash of a file, if you have that file, which it sounds like you do. On Linux, rhash(1) can generate hashes of hash trees. So if you already have the file, that's probably what you want.

    Hypothetically, I guess you could go build some kind of index mapping hashes to hashes of hash trees. Don't know whether you can pull the hash off BitTorrent or something, but I wouldn't be surprised if it is. But...you're probably better off with hash trees, unless you can't see the file and already are committed to a straight hash of the file.

    EDIT4:

    I mean:

     
            $ rhash --sha1 --hex pkgs 
        7d3a772009aacfe465cb44be414aaa6604ca1ef0  pkgs
        $ rhash -T --hex pkgs 
        18cab20ffdc55614ed45c5620d85b0230951432cdae2303a  pkgs
        $
    
    
      

    Either way, straight hash or hash of a hash tree, you're getting a hex string that identifies your file uniquely. Just that in the hash tree case, you solve some significant problems related to the other thing that you want to do, fetch your file. Might be more compute-intensive to generate a hash of a hash tree, but unless you're really compute-constrained...shrugs

16 comments