I have an implementation put together here: https://github.com/tommyyliu/lut_mm/blob/master/src/ternary_mm_avx512.cpp
The table has all the dot products with 5 activations instead of just 3. This means that each lookup handles 5 instead of 3, and you can see pretty substantial speedup.
I have an implementation put together here: https://github.com/tommyyliu/lut_mm/blob/master/src/ternary_mm_avx512.cpp
The table has all the dot products with 5 activations instead of just 3. This means that each lookup handles 5 instead of 3, and you can see pretty substantial speedup.