I was just recently going through some old episodes from Software Engineering Radio when I came across this one episode featuring Casey Muratori, where he goes through some of his thoughts around his video from February 2023, titled "‘Clean’ Code, Horrible Performance". I was actually already aware ...
Loop unrolling is not really the speedup, autovectorization is. Loop unrolling does often help with autovectorization, but is not enough, especially with floating point numbers. In fact the accumulation operation you're doing needs to be associative, and floating point numbers addition is not associative (i.e. (x + y) + z is not always equal to (x + (y + z)). Hence autovectorizing the code would change the semantics and the compiler is not allowed to do that.