Skip Navigation
2 comments
  • Wouldn’t their patch embeddings return different results depending on the visual boundaries? They don’t appear to use overlap redundancy; this means it’s going to be significantly less resource intensive, but the chance of losing significant signals in the image to text translation surely must be inversely high?

    • Good question, not sure how they account for that. Maybe there's a higher level layer responsible for dealing with the boundaries?