2d ago

DeepSeek R1 just got a 2X speed boost, the code for the boost was written by R1 itself!

simonwillison.net ggml : x2 speed for WASM by optimizing SIMD

PR by Xuan-Son Nguyen for `llama.cpp`: > This PR provides a big jump in speed for WASM by leveraging SIMD instructions for `qX_K_q8_K` and `qX_0_q8_0` dot product functions. > > …

Technology @lemmy.ml

☆ Yσɠƚԋσʂ ☆ @lemmy.ml

2d ago

DeepSeek R1 just got a 2X speed boost, the code for the boost was written by R1 itself!

simonwillison.net /2025/Jan/27/llamacpp-pr/

32 1

53 comments

I'm going to say 2 things that are going to be very unpopular but kinda need to be heard.
DeepSeek is turning this place into /r/OpenAI but red, which is incredibly lame
If LLMs are significantly helping your development workflow, you are doing grunt work, you're not improving your skills, and you're not working on problems that have any significant difficulty beyond memorizing multiplication tables type recall but for tech.
This optimization is actually grunt work, it's not a new discovery, it's simply using SIMD instructions on matrices something that should have been done in the first place either by hand or by a compiler.
- The reality is that most code is very boring, and a lot of optimizations are a result of doing really basic things like this. A model being able to look through the code, notice patterns, and then let you know these kinds of obvious improvements is in fact very useful. It's not different than using a profiler to find bottlenecks. Having done development for over two decades, I don't feel like combing through the code to find these kinds of things is a really good use of my time or that it's improving my skills in any way.
  
  This type of tooling isn't new and doesn't require AI models. Performance linters exist in many languages. Rubocop perf, perlint in python, eslint perf rules etc. For C++, clang-tidy and cpp-perf exist.
  The only reason LLMs are in this space is because there is a lack of good modern tooling in many languages. Jumping straight to LLMs is boiling the ocean (literally and figuratively).
  Not only that but if we're really gonna argue that "most code is very boring". That already negates your premise, most boring code isn't really highly perf sensitive and unique enough to be treated individually through LLMs. Directly needing to point out SIMD instructions in your C++ code basically shows that your compiler tool chain sucks or you're writing your code in such a "clever" way that it isn't getting picked up. This is an optimization scenario from 1999.
  Likewise if you're not looking through the code you're not actually understanding what the performance gaps are or if the LLM is making new ones by generating sub-optimal code. Sometimes the machine spirits react to the prayer protocols and sometimes they don't. That's the level of technology you're arguing at. These aren't functional performance translations being applied. Once your system is full of this kind of junk, you won't actually understand what's going on or how things practically work. Standard perf linters are already not side effects free in some cases but they publish their side effects. LLMs cannot do this by comparison. That's software engineering, it's mostly risk management and organization. Yes it's boring.

It is 2025
AI writes code to update itself
I still have to load the dishwasher by hand
I still have to change the baby's diapers
I still have to go to work tomorrow
Do things ever happen?
Some say nothing ever happens.
Others argue that everything always happens.
I love the real movement.

A thing I’ve noticed with deepseek is that it operates in a very system-oriented manner (it carefully plans out how to answer your question when you use thinking mode and it’s actually quite interesting) whereas chatgpt just tells you how long it “thought” and ultimately regurgitates an output that it is statistically likely. So we actually get to see a bit of the black box in my view
- ChatGPT o1 hides it's chain of thought so you don't even really know what the 'reasoning' is
- yeah it's fascinating to see how the sausage is made

the code for the boost was written by R1 itself!
Pretty neat, but this kind of thing will impress me a lot more when it's genuinely new and creative output, not just the result of being prompted to optimize an existing routine in a prescribed way (using simd instructions to calculate inner products)
- I'm still impressed because it was able to look at the existing solution, recognize a bottleneck, and write the code to address it. Most code is very boring, you don't need genius solutions in it. And this could be of huge help for developers as well where you could have it analyze code and suggest where improvements can be made. It could be faster than profiling things.

Ask ChatGPT what to do and it will ask to have a gallon of movie concession stand nacho cheese flavored sauce poured directly onto the server rack.
- Tbh that would be for the best, so I'll give it to chatGPT on a technicality
- I have no mouth and I must cream

My question is who is naming these functions 'qX_K_q8_K'
- This is a quantization function. It's a fairly "math brained" name I agree, but the function is called qX_K_q8_K because it quantizes a value with a quantization index of X (unknown) to one with a quantization index of 8 (bits) which correlates to the memory usage. The 0 vs K portions are how it does rounding, 0 means it does rounding by equal distribution (without offset), and K means it creates a distribution that is more fine grained around more common values and is more rough around least common values. e.g. I have a data set that has a lot of values between 4 and 5 but not a lot of 10s. I have lets say 10 brackets between 4 and 5 but only 3 between 5 and 10.
  Basically it's a lossy compression for a data set into a specific enumeration (roughly correlates with size), so it's a way to given 1,000,000 numbers from 1-1000000, of putting their values into a range of numbers based on the q level How using different functions affects the output of models is more voodoo than anything else. You get better "quality" output from higher memory space, but quality is a complex metric and doesn't necessarily map to factual accuracy in the output, just statistical correlation with the model's data set.
  An example of a common quantizer is an analog to digital converter. It must take continuous values from a wave that goes 0 to 1 and transform them into digital values of 0 and 1 with a specific sample rate.
  Taking a 32 bit float and copying the value into 32 bit float is an identity quantizer.
- C devs love cryptic names :)
  
  Writing lisp:
  undefined
  
  (defun generate-eight-new-magic-numbers-for-system-x ()...)
  Writing c:
  undefined
  
  struct mnums* g8_nmn_sx () {...}

53 comments