Description
The intrinsic-test
crate runs incredibly slowly, and takes a long time both on CI and locally. I'd like to speed that up.
Looking at the code, it also shows signs of its age. I think we can do a much better job today. Based on some rough profiling, the main bottleneck appears to be the compilation of 3K+ C++ files into executables. On my machine each file takes roughly ~280ms to compile. By using C instead, and compiling to an object file, I'm able to get a ~4X speedup.
My idea is to emit C files like this (we emit many C files because clang won't parallelize its workload by itself):
#include <arm_neon.h>
#include <arm_acle.h>
#include <arm_fp16.h>
const uint32_t a_vals[] = {
0x0,
0x800000,
0x3effffff,
0x3f000000,
// ...
};
const uint8_t b_vals[] = {
0x0,
0x1,
0x2,
0x3,
// ...
};
uint32_t __crc32b_output[20] = {};
extern uint32_t *c___crc32b_generate(void) {
for (int i=0; i<20; i++) {
__crc32b_output[i] = __crc32b(a_vals[i], b_vals[i]);
}
return __crc32b_output;
}
The for rust, we can generate all the tests in one binary (not sure if splitting into files is useful there; it might be), and link it together to the C object files. Then the final check of the output can happen in rust (calling the rust and C version of the test and comparing results). This crucially means we only need to compile the formatting logic once (and in rust, so it's trivially consistent).
cc @adamgemmell @Jamesbarford if you have thoughts on this idea, or other ways to speed up this program.