A response to the blog post "{n} times faster than C". Our final program achieved a speedup of 128x (36 GiB/s throughput) by reformulating the problem and leveraging SIMD intrinsics.
The reddit thread has some interesting discussion, and a solution using no SIMD intrinsincs that is more than 200x faster, by using .chunks_exact(), and letting the compiler auto-vectorize it.
The reddit thread has some interesting discussion, and a solution using no SIMD intrinsincs that is more than 200x faster, by using
.chunks_exact()
, and letting the compiler auto-vectorize it.