use cublas_ops for a 15-20% speedup on matmuls with some loss of quality. #389
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
So cublas ops in comfyui is actually a python package. https://github.com/aredden/torch-cublas-hgemm
It uses custom cuda kernels to speed up matmuls and does wonders for FP16 weights. On my 2080ti-22g I actually flip to IT/s from s/IT using the BF16 cast weights. Not so dramatic here. Only about 1.40s/it to 1.12s/it. In practical terms it turns a 14.x second generation to a 10.3 second one. with LoRA
Everything comes with a price and using FP16 is not always the best quality. If you're doing sage attention already.. that's int8. And no, that's not lossless either.
Being a shitty developer, I whip this up with some help for your perusal. If you want faster gens on meh hardware, it's worth looking into. Probably did some stupids. My gens in sillytavern aren't meant to be a masterpiece or a deliverable so I'd rather not wait.
model I tested this with is z-image Q8 and of course a GGUF qwen. we know those 2 will work and dare I say, it appears to run faster and have better quality than what I got out of nunchaku (2080ti). ofc I am also compiling with torch.
edit: This does not work for T5. It dequants to F32 and doesn't survive the casting. My only viable idea is to detect F32/BF16 and then skip cublas_ops.