Skip Navigation

Technology @lemmy.ml ☆ Yσɠƚԋσʂ ☆ @lemmy.ml 4 mo. ago

Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

aksh-garg.medium.com Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

Overview

Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

Hacker News @lemmy.smeargle.fans bot @lemmy.smeargle.fans

4 mo. ago

Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

aksh-garg.medium.com /llama-3v-building-an-open-source-gpt-4v-competitor-in-under-500-7dd8f1f6c9ee

technology @hexbear.net ☆ Yσɠƚԋσʂ ☆ @lemmy.ml 4 mo. ago

Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

aksh-garg.medium.com /llama-3v-building-an-open-source-gpt-4v-competitor-in-under-500-7dd8f1f6c9ee

Technology @lemmygrad.ml ☆ Yσɠƚԋσʂ ☆ @lemmygrad.ml 4 mo. ago

Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

aksh-garg.medium.com /llama-3v-building-an-open-source-gpt-4v-competitor-in-under-500-7dd8f1f6c9ee

2 comments

Wouldn’t their patch embeddings return different results depending on the visual boundaries? They don’t appear to use overlap redundancy; this means it’s going to be significantly less resource intensive, but the chance of losing significant signals in the image to text translation surely must be inversely high?
- Good question, not sure how they account for that. Maybe there's a higher level layer responsible for dealing with the boundaries?