Skip to content

Conversation

@Ph0rk0z
Copy link

@Ph0rk0z Ph0rk0z commented Dec 21, 2025

So cublas ops in comfyui is actually a python package. https://github.com/aredden/torch-cublas-hgemm

It uses custom cuda kernels to speed up matmuls and does wonders for FP16 weights. On my 2080ti-22g I actually flip to IT/s from s/IT using the BF16 cast weights. Not so dramatic here. Only about 1.40s/it to 1.12s/it. In practical terms it turns a 14.x second generation to a 10.3 second one. with LoRA

Everything comes with a price and using FP16 is not always the best quality. If you're doing sage attention already.. that's int8. And no, that's not lossless either.

Being a shitty developer, I whip this up with some help for your perusal. If you want faster gens on meh hardware, it's worth looking into. Probably did some stupids. My gens in sillytavern aren't meant to be a masterpiece or a deliverable so I'd rather not wait.

model I tested this with is z-image Q8 and of course a GGUF qwen. we know those 2 will work and dare I say, it appears to run faster and have better quality than what I got out of nunchaku (2080ti). ofc I am also compiling with torch.

edit: This does not work for T5. It dequants to F32 and doesn't survive the casting. My only viable idea is to detect F32/BF16 and then skip cublas_ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant