Title: TL1/TL2 codegen fails for any configuration with bm=16 on Windows 11

When generating TL1/TL2 kernels with bm=16, all configurations fail either during
(1) codegen_tl1.py / codegen_tl2.py execution, or
(2) CMake build (llama-bench build failure).

This happens consistently for all BM/BK settings.
Other block sizes (e.g., bm=32, bm=64, bm=128) work normally.