LocalLLaMA @sh.itjust.works hok @lemmy.dbzer0.com 5 days ago

Llama 3.3 70b - End of open-weight pretrained models from Meta or just a better Llama 3.1 405b finetune?

People are talking about the new Llama 3.3 70b release, which has generally better performance than Llama 3.1 (approaching 3.1's 405b performance): https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3

However, something to note:

Llama 3.3 70B is provided only as an instruction-tuned model; a pretrained version is not available.

Is this the end of open-weight pretrained models from Meta, or is Llama 3.3 70b instruct just a better-instruction-tuned version of a 3.1 pretrained model?

Comparing the model cards: 3.1: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md 3.3: https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md

The same knowledge cutoff, same amount of training data, and same training time give me hope that it's just a better finetune of maybe Llama 3.1 405b.

6 comments

This is making me realize that I don’t fully understand the relationship between “instruction-tuned” and “pre-trained”. I thought instruction tuning was a form of fine-tuning, and that fine-tuning comes after the primary training of the model.
- A base-model / pre-trained is fed with a large dataset of random text files. Books, Wikipedia etc. After that the model can autocomplete text. And it has learned language and concepts about the world. But it won't answer your questions. It'll refine them, or think you're writing an email or long list of unanswered questions and write some more questions underneath, instead of engaging with you. Or think it's writing a novel and autocomplete "...that's what character asked while rolling their eyes." Or something completely arbitrary like that.
  
  After that major first step it'll get fine-tuned to some task. The procedure is the same, it'll get fed different text in almost the same way. And this just continues the training. But now it's text that tunes it to it's role. For example be a Chatbot. It'll get lots of text that is a question, then a special character/token and then an answer to the question. And it'll learn to reply with an (correct) answer if you put in a question and that token. It'll probably also be fine-tuned to write dialogue as a Chatbot. And follow instructions. (And refuse some things and speak more unbiased, be nice...)
  
  You can also put in domain-specific data, make it learn/focus on medicine... I think that's also called fine-tuning. But as far as I understand teaching knowledge with arbitrary data comes before teaching/tuning it to follow instructions, or it might forget that.
  
  I think instruction tuning is a form of fine-tuning. It's just called that to distinguish it from other forms of fine-tuning. But I'm not really an expert on any of this.
- I was also not sure what this meant, so I asked Google's Gemini, and I think this clears it up for me:
  
  This means that the creators of Llama 3.3 have chosen to release only the version of the model that has been fine-tuned for following instructions. They are not making the original, "pretrained" version available.
  
  Here's a breakdown of why this is significant:
  
  Pretrained models: These are large language models (LLMs) trained on a massive dataset of text and code. They have learned to predict the next word in a sequence, and in doing so, have developed a broad understanding of language and a wide range of general knowledge. However, they may not be very good at following instructions or performing specific tasks.
  
  Instruction-tuned models: These models are further trained on a dataset of instructions and desired outputs. This fine-tuning process teaches them to follow instructions more effectively, generate more relevant and helpful responses, and perform specific tasks with greater accuracy.
  
  In the case of Llama 3.3 70B, you only have access to the model that has already been optimized for following instructions and engaging in dialogue. You cannot access the initial pretrained model that was used as the foundation for this instruction-tuned version.
  
  Possible reasons why Meta (the creators of Llama) might have made this decision:
  
  Focus on specific use cases: By releasing only the instruction-tuned model, Meta might be encouraging developers to use Llama 3.3 for assistant-like chat applications and other tasks where following instructions is crucial.
  
  Competitive advantage: The pretrained model might be considered more valuable intellectual property, and Meta may want to keep it private to maintain a competitive advantage.
  
  Safety and responsibility: Releasing the pretrained model could potentially lead to its misuse for generating harmful or misleading content. By releasing only the instruction-tuned version, Meta might be trying to mitigate these risks.
  
  Ultimately, the decision to release only the instruction-tuned model reflects Meta's strategic goals for Llama 3.3 and their approach to responsible AI development.
AFAIK it is still a tuning of llama 3[.1], the new Base models will come with the release of 4 and the "Training Data" section of both the model cards is basically a copy paste.

Honestly I didn't even consider the fact they would not be giving Base models anymore before reading this post and, even now, I don't think this is the case. I went to search the announcements posts to see if there was something that could make me think about it being a possibility, but nothing came out.

It is true that they released Base models with 3.2, but there they had added a new projection layer on top of that, so the starting point was actually different. And 3.1 did supersede 3...

So I went and checked the 3.3 hardware section and compare it with the 3 one, the 3.1 one and the 3.2 one.

3 3.1 3.2 3.3

7.7M GPU hours 39.3M GPU hours 2.02M GPU hours 39.3M GPU hours

So yeah, I'm pretty sure the base of 3.3 is just 3.1 and they just renamed the model in the card and added the functional differences. The instruct and base versions of the models have the same numbers in the HW section, I'll link them at the end just because.

All these words to say: I've no real proof, but I will be quite surprised if they will not release the Base version of 4.

Mark Zuckerberg on threads

Link to post on threads
zuck a day ago
Last big AI update of the year:
•⁠ ⁠Meta AI now has nearly 600M monthly actives
•⁠ ⁠Releasing Llama 3.3 70B text model that performs similarly to our 405B
•⁠ ⁠Building 2GW+ data center to train future Llama models
Next stop: Llama 4. Let's go! 🚀

Meta for Developers

Link to post on facebook
Today we're releasing Llama 3.3 70B which delivers similar performance to Llama 3.1 405B allowing developers to achieve greater quality and performance on text-based applications at a lower price point.
Download from Meta: --

3

70B

70B Instruct

3.1

70B

70B Instruct

3.2

90B Vision

90B Vision Instruct

3.3

70B Instruct

Small note: I did delete my previous post because I had messed up the links, so I had to recheck them, whoops
On Huggingface, someone said it's still the same base model: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/discussions/10

And I remember watching some interview with Zuckerberg this year, where he said releasing the models to the public, including base models, is what he wants and part of their strategy.
- Thank you so much, that exactly answers my question with the official response (that guy works at Meta) that confirms it's the same base model!
  
  I was concerned primarily because in the release notes it strangely didn't mention it anywhere, and I thought it would have been important enough to mention.
AFAIK it is still a tuning of llama 3[.1], the new Base models will come with the release of 4 and the "Training Data" section of both the model cards is basically a copy paste.

Honestly I didn't even consider the fact they would not be giving Base models anymore before reading this post and, even now, I don't think this is the case. I went to search the announcements posts to see if there was something that could make me think about it being a possibility, but nothing came out.

It is true that they released Base models with 3.2, but there they had added a new projection layer on top of that, so the starting point was actually different. And 3.1 did supersede 3...

So I went and checked the 3.3 hardware section and compare it with the 3 one, the 3.1 one and the 3.2 one.

3 3.1 3.2 3.3

7.7M GPU hours 39.3M GPU hours 2.02M GPU hours 39.3M GPU hours

So yeah, I'm pretty sure the base of 3.3 is just 3.1

All these words to say: I've no real proof, but I will be quite surprised if they will not release the Base version of 4.

Mark Zuckerberg on threads

Link to post on threads
zuck a day ago
Last big AI update of the year:
•⁠ ⁠Meta AI now has nearly 600M monthly actives
•⁠ ⁠Releasing Llama 3.3 70B text model that performs similarly to our 405B
•⁠ ⁠Building 2GW+ data center to train future Llama models
Next stop: Llama 4. Let's go! 🚀

Meta for Developers

Link to post on facebook
Today we're releasing Llama 3.3 70B which delivers similar performance to Llama 3.1 405B allowing developers to achieve greater quality and performance on text-based applications at a lower price point.
Download from Meta: https://bit.ly/4g2WWOp

3	3.1	3.2	3.3
7.7M GPU hours	39.3M GPU hours	2.02M GPU hours	39.3M GPU hours

3	3.1	3.2	3.3
7.7M GPU hours	39.3M GPU hours	2.02M GPU hours	39.3M GPU hours

You've viewed 6 comments.