There are two variants:
- AVX512_VNNI (Tiger Lake, Rocket Lake) - 512bit/256bit/128bit
- AVX_VNNI - (upcoming Alder Lake) - 256bit/128bit
VNNI replaces 3 simd instructions with one instruction.
It seems that we can use it inside MultiplyGroup().
https://software.intel.com/content/www/us/en/develop/articles/intel-advanced-vector-extensions-512-intel-avx-512-new-vector-neural-network-instruction.html