ggml-hexagon: gelu optimization #18151

joeldushouyu · 2025-12-17T22:51:58Z

Following the discussion regarding the refactor idea here, I have re-evaluated the approach. Upon reviewing the current implementations for activation, unary, and binary functions, I observed that the existing code heavily relies on L2 prefetching while under-utilizing the VTCM and DMA.

While L2 prefetching offers some benefits, it is limited by two main factors:

The L2 cache is significantly smaller than the VTCM.
The L2 cache is shared between L1 program instructions and the L1 data cache.

This PR optimizes the code (using GELU as the initial implementation) by shifting the workload to heavily utilize the VTCM and DMA, thereby freeing up the L2 cache for L1 instruction and data cache

Optimization Strategy

Instead of relying on L2 prefetching, this implementation employs a DMA ping-pong buffering approach:

Fetch: DMA moves data from DDR to VTCM (buffer A).
Compute: DSP processes data in VTCM (buffer B) while DMA fetches the next chunk into buffer A.
Write back :DMA moves processed data from VTCM back to DDR.

This allows for overlapping computation and memory transfer, resulting in significantly higher throughput.

Performance Benchmarks

The performance improvements are significant. Below is a comparison between the existing implementation and the new DMA/VTCM approach:

Dimensions	Old Implementation (L2 Prefetch)	New Implementation (DMA/VTCM)
4096 x 4096	~5000 µs	~3800 µs
4096 x 4304	~5300 µs	~3700 µs

NOTE: I used the GELU as an example, but this approach can easily extend to other operations.

Unaligned Load Resolution:
This approach inherently solves the unaligned load issues encountered previously. Since data is fetched from DDR via DMA, the DMA engine ensures that data is stored into aligned addresses within the VTCM, even if the source data in DDR is unaligned.

@max-krasnyansky

joeldushouyu · 2025-12-17T22:53:12Z

Obviously, I need to do more cleanup for this pr. But if you think this approach is good, I can probably apply the approach to all activation, unary(mul, add, sub).. and similar functions? I think this should bring some improvement for all models, especially for long context

chraac · 2025-12-18T11:23:56Z

ggml/src/ggml-hexagon/htp/main.c

    ctx->n_threads = n_hvx;
    for (int i = 0; i < ctx->n_threads; i++) {
-        ctx->dma[i] = dma_queue_create(HTP_SPAD_SRC0_NROWS * 2);
+        ctx->dma[i] = dma_queue_create(ctx->vtcm_size); //NOTE: for now, whole vtcm size 


Nit: looks the capacity here is for the max dma queue depth, which in mum_mat case is the HTP_SPAD_SRC0_NROWS * 2 for each thread. Its not the vtcm size for each thread I think...

Yes. But with this pr, each thread could theoretically now fetch up to the whole vtcm size. Thus, I set it to the size of VTCM instead

Thought theoretically its not size in bytes, but the count of row that gonna fetch to vtcm tho....

Actually, as I take another look, it is not even in size of byte or count of row. It seems to me that the maximum size of the DMA-linked list that it can support. Maybe I should set to like total_thread*(4*2) 4 stands for number of input/output and 2 for pingpong

It's a number of DMA descriptors, aka a number of DMA transactions that can be enqueued at a time.
The DMA engine is per thread i.e each HW/SW thread literally has its own DMA HW state.
So there is no need to include the number if threads in the sizing.
Also the size of the queue doesn't affect the performance it's just an array of preallocated descriptors.
So we can just set it to something reasonable like 64 (ie 64 outstanding transactions per thread), add an error message for full queue for the debug builds, and that should be good enough.

see f3e9f6a

max-krasnyansky · 2025-12-18T19:34:07Z

Good timing.
We just had an internal discussion on the L2-prefetch vs DMA/VTCM and the conclusion is that using DMA/VTCM should better in most scenarios. And yes, it'd naturally fix most of not all alignment issues.
I didn't get a chance to read through the changes yet but overall it's step in the right direction.

joeldushouyu · 2025-12-18T19:40:03Z

Good timing. We just had an internal discussion on the L2-prefetch vs DMA/VTCM and the conclusion is that using DMA/VTCM should better in most scenarios. And yes, it'd naturally fix most of not all alignment issues. I didn't get a chance to read through the changes yet but overall it's step in the right direction.

Thanks for confirming!

Should I go ahead and apply the same optimization on all other functions, or wait until you have a thorough look at it and fix the code format/style before applying it to other functions?

max-krasnyansky · 2025-12-21T00:19:38Z

Alright. Got the chance to review and play with it.
Looks good overall but I do have a bunch of suggestions :)

The current implementation of the ping pong buffer looks quite "forced". i.e Lots of conditional logic.
Ideally, ping pong buffering should happen naturally by just sequencing things properly.

Why don't we do the following

init:
  push : dst0 (nrows=0) // noop
  push : src0
  push : dst1 (nrows=0) // noop
  push : src1

iter0
  pop  : dst0
  pop  : src0
  proc : src0 -> dst0
  push : dst0
  push : src0

iter1
  pop  : dst1
  pop  : src1
  proc : src1 -> dst1
  push : dst1
  push : src1

iter2
  pop  : dst0
  pop  : src0
  proc : src0 -> dst0
  push : dst0
  push : src0
...

In other words, we can prefetch both buffers and use dummy DMA transactions (nrows=0) to properly sequence/interleave vtcm-to-ddr (dst flushing).
This way we do not need any conditional logic like if first row, etc inside the loop because the dma queue is already primed.

I did a quick-n-dirty update to this branch to test this idea and it works great.

$ HB=0 NHVX=1 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU
...

$ adb logcat -e CDSP
...
CDSP0:[SU]: gelu-f32: 128x2x2x2 -> 128x2x2x2 : src0-spad-size 4194304 src1-spad-size 0 dst-spad-size 4194304
CDSP0:[SU]: dma-push: i 0 dst FDE97000 src FC400000 width 512 nrows 0 <<<
CDSP0:[SU]: dma-push: i 1 dst FC000000 src FDE95000 width 512 nrows 8
CDSP0:[SU]: dma-pop: i 0 dst FDE97000
CDSP0:[SU]: dma-pop: i 1 dst FC000000
CDSP0:[SU]: dma-push: i 2 dst FDE97000 src FC400000 width 512 nrows 8
CDSP0:[SU]: dma-pop: i 2 dst FDE97000
CDSP0:[SU]: gelu-f32 0/1: 128x2x2x2 (0:8) -> 128x2x2x2 usec 94

CDSP0:[SU]: gelu-f32: 5x7x11x13 -> 5x7x11x13 : src0-spad-size 4194304 src1-spad-size 0 dst-spad-size 4194304
CDSP0:[SU]: dma-push: i 3 dst FDE9AE80 src FC400000 width 20 nrows 0  <<<
CDSP0:[SU]: dma-push: i 4 dst FC000000 src FDE95000 width 20 nrows 1001
CDSP0:[SU]: dma-pop: i 3 dst FDE9AE80
CDSP0:[SU]: dma-pop: i 4 dst FC000000
CDSP0:[SU]: dma-push: i 5 dst FDE9AE80 src FC400000 width 20 nrows 1001
CDSP0:[SU]: dma-pop: waiting for DMA : 5
CDSP0:[SU]: dma-pop: waiting for DMA : 5
CDSP0:[SU]: dma-pop: i 5 dst FDE9AE80
CDSP0:[SU]: gelu-f32 0/1: 5x7x11x13 (0:1001) -> 5x7x11x13 usec 244

Note that I enabled logs in htp-dma.h and used NHVX=1 to make logs easier to parse.

I'd also replace all the code like if (ping_pong) { ptr = buf_ping ... with ptr = buf[buf_idx].
No need for conditionals there and later buf_idx could be >1 (ie basically what we have in q4/q8 matmuls).

And yes, if you have the time & energy :) switching to using DMA + VTCM for all opts would be great.
We could iterate on and merge the GELU first though.

joeldushouyu · 2025-12-21T02:13:13Z

Alright. Got the chance to review and play with it. Looks good overall but I do have a bunch of suggestions :)

The current implementation of the ping pong buffer looks quite "forced". i.e Lots of conditional logic. Ideally, ping pong buffering should happen naturally by just sequencing things properly.

Why don't we do the following
init:
  push : dst0 (nrows=0) // noop
  push : src0
  push : dst1 (nrows=0) // noop
  push : src1

iter0
  pop  : dst0
  pop  : src0
  proc : src0 -> dst0
  push : dst0
  push : src0

iter1
  pop  : dst1
  pop  : src1
  proc : src1 -> dst1
  push : dst1
  push : src1

iter2
  pop  : dst0
  pop  : src0
  proc : src0 -> dst0
  push : dst0
  push : src0
...
In other words, we can prefetch both buffers and use dummy DMA transactions (nrows=0) to properly sequence/interleave vtcm-to-ddr (dst flushing). This way we do not need any conditional logic like if first row, etc inside the loop because the dma queue is already primed.

I did a quick-n-dirty update to this branch to test this idea and it works great.
$ HB=0 NHVX=1 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU
...

$ adb logcat -e CDSP
...
CDSP0:[SU]: gelu-f32: 128x2x2x2 -> 128x2x2x2 : src0-spad-size 4194304 src1-spad-size 0 dst-spad-size 4194304
CDSP0:[SU]: dma-push: i 0 dst FDE97000 src FC400000 width 512 nrows 0 <<<
CDSP0:[SU]: dma-push: i 1 dst FC000000 src FDE95000 width 512 nrows 8
CDSP0:[SU]: dma-pop: i 0 dst FDE97000
CDSP0:[SU]: dma-pop: i 1 dst FC000000
CDSP0:[SU]: dma-push: i 2 dst FDE97000 src FC400000 width 512 nrows 8
CDSP0:[SU]: dma-pop: i 2 dst FDE97000
CDSP0:[SU]: gelu-f32 0/1: 128x2x2x2 (0:8) -> 128x2x2x2 usec 94

CDSP0:[SU]: gelu-f32: 5x7x11x13 -> 5x7x11x13 : src0-spad-size 4194304 src1-spad-size 0 dst-spad-size 4194304
CDSP0:[SU]: dma-push: i 3 dst FDE9AE80 src FC400000 width 20 nrows 0  <<<
CDSP0:[SU]: dma-push: i 4 dst FC000000 src FDE95000 width 20 nrows 1001
CDSP0:[SU]: dma-pop: i 3 dst FDE9AE80
CDSP0:[SU]: dma-pop: i 4 dst FC000000
CDSP0:[SU]: dma-push: i 5 dst FDE9AE80 src FC400000 width 20 nrows 1001
CDSP0:[SU]: dma-pop: waiting for DMA : 5
CDSP0:[SU]: dma-pop: waiting for DMA : 5
CDSP0:[SU]: dma-pop: i 5 dst FDE9AE80
CDSP0:[SU]: gelu-f32 0/1: 5x7x11x13 (0:1001) -> 5x7x11x13 usec 244
Note that I enabled logs in htp-dma.h and used NHVX=1 to make logs easier to parse.

I'd also replace all the code like if (ping_pong) { ptr = buf_ping ... with ptr = buf[buf_idx]. No need for conditionals there and later buf_idx could be >1 (ie basically what we have in q4/q8 matmuls).

And yes, if you have the time & energy :) switching to using DMA + VTCM for all opts would be great. We could iterate on and merge the GELU first though.

Thanks for the suggestion and feedback!
Yes, I do find the code being ugly(it was intended to be a proof of concept, so :)). I have cooperated your feedback in commit f590092, take a look and see if you like it or not @max-krasnyansky

Yes, I think it's a good idea if we can merge GELU first and have another pr that applies it on all other code(which I can probably work on during/after the holiday :) ).

max-krasnyansky · 2025-12-22T07:19:58Z

I was going to add more comments and suggestions and realized that it's probably easier to just update the code.
Take a look at
https://github.com/qualcomm/llama.cpp/tree/hexagon-gelu-optimization
qualcomm@dccbcb2

Here is the summary:

It's better to allocate a single buffer for tracking dst/src pointers in the dma queue
Improved dma_queue_flush that we can use at the end of the loop
There is no need to use internal dma_queue index
We can prime the queue exactly as I described earlier and make things simple and readable
(i.e. always push and pop a pair of DMA transactions and reuse buffer pointers)
The prefill and compute loops are similar and consistent with matmul implementation

It looks pretty clean now and can be used as a template for other ops.
Please pull & test on your setup, update this PR with my commits and we can merge it.

joeldushouyu · 2025-12-22T14:33:18Z

I was going to add more comments and suggestions and realized that it's probably easier to just update the code. Take a look at https://github.com/qualcomm/llama.cpp/tree/hexagon-gelu-optimization qualcomm@dccbcb2

Here is the summary:
* It's better to allocate a single buffer for tracking dst/src pointers in the dma queue

* Improved dma_queue_flush that we can use at the end of the loop

* There is no need to use internal dma_queue index

* We can prime the queue exactly as I described earlier and make things simple and readable
  (i.e. always push and pop a pair of DMA transactions and reuse buffer pointers)

* The prefill and compute loops are similar and consistent with matmul implementation
It looks pretty clean now and can be used as a template for other ops. Please pull & test on your setup, update this PR with my commits and we can merge it.

Thanks for the refactor! I apologize for misunderstanding your earlier suggestions—seeing your implementation really clarified what you were describing. The DMA queue handling is definitely much cleaner this way.

I’ve pulled the changes and tested everything on my end, and it's working great!

max-krasnyansky · 2025-12-22T18:04:48Z

It looks pretty clean now and can be used as a template for other ops. Please pull & test on your setup, update this PR with my commits and we can merge it.

Thanks for the refactor! I apologize for misunderstanding your earlier suggestions—seeing your implementation really clarified what you were describing. The DMA queue handling is definitely much cleaner this way.

I’ve pulled the changes and tested everything on my end, and it's working great!

No apologies necessary. Happy to collaborate.

We need one more update though. I realized (it was a bit late last night) that I got the N+2 prefetch logic wrong.
I guess we never really hit that case (even with full gemma-3n).

The current logic does:

prefetch block N
prefetch block N+1
process block N
prefetch block N+1 << this is wrong needs to be N+2 here
process block N+1

Basically this

        // prefetch N+2 loop iteration if any
        const uint32_t pref_block = (ir + BLOCK * 2);
        if (pref_block < src0_end_row) {
            const uint32_t pref_block_size = MIN(BLOCK, src0_end_row - pref_block);
            dma_queue_push_ddr_to_vtcm(dma_queue,
                dma_make_ptr(src0_spad, data_src0 + (pref_block * src0_row_size)),
                src0_row_size_aligned, src0_row_size, pref_block_size);
        }

I also realized that we can simplify the inner block loop a bit.
Pushed another commit to https://github.com/qualcomm/llama.cpp/tree/hexagon-gelu-optimization

Can you please double check it again.
It looks good over here

$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU
 ...
   GELU(type=f32,ne_a=[128,2,2,2],v=0): OK
  GELU(type=f32,ne_a=[5,7,11,13],v=0): OK

$ M=../gguf/gemma-3n-E2B-it-Q4_0.gguf NDEV=1 D=HTP0 ./scripts/snapdragon/adb/run-cli.sh -f ../surfing.txt -no-cnv
...
Sensible output

joeldushouyu · 2025-12-22T18:33:21Z

It looks pretty clean now and can be used as a template for other ops. Please pull & test on your setup, update this PR with my commits and we can merge it.

Thanks for the refactor! I apologize for misunderstanding your earlier suggestions—seeing your implementation really clarified what you were describing. The DMA queue handling is definitely much cleaner this way.
I’ve pulled the changes and tested everything on my end, and it's working great!

No apologies necessary. Happy to collaborate.

We need one more update though. I realized (it was a bit late last night) that I got the N+2 prefetch logic wrong. I guess we never really hit that case (even with full gemma-3n).

The current logic does:
* prefetch block N

* prefetch block N+1

* process block N

* prefetch block N+1 << this is wrong needs to be N+2 here

* process block N+1
Basically this
        // prefetch N+2 loop iteration if any
        const uint32_t pref_block = (ir + BLOCK * 2);
        if (pref_block < src0_end_row) {
            const uint32_t pref_block_size = MIN(BLOCK, src0_end_row - pref_block);
            dma_queue_push_ddr_to_vtcm(dma_queue,
                dma_make_ptr(src0_spad, data_src0 + (pref_block * src0_row_size)),
                src0_row_size_aligned, src0_row_size, pref_block_size);
        }
I also realized that we can simplify the inner block loop a bit. Pushed another commit to https://github.com/qualcomm/llama.cpp/tree/hexagon-gelu-optimization

Can you please double check it again. It looks good over here
$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o GELU
 ...
   GELU(type=f32,ne_a=[128,2,2,2],v=0): OK
  GELU(type=f32,ne_a=[5,7,11,13],v=0): OK

$ M=../gguf/gemma-3n-E2B-it-Q4_0.gguf NDEV=1 D=HTP0 ./scripts/snapdragon/adb/run-cli.sh -f ../surfing.txt -no-cnv
...
Sensible output 

Thanks for the catch!

Can confirm it works as expected on my end. Probably the test cases in ggml are not large enough to catch the bug.

I tried plotting out the gelu curve from -100 to 100, and it does reflect the bug fix. Picture one is the older commit, and picture two is the newer commit.

joeldushouyu changed the title ~~Hexagon gelu optimization~~ ggml-hexagon: gelu optimization Dec 17, 2025

loci-dev mentioned this pull request Dec 17, 2025

UPSTREAM PR #18151: ggml-hexagon: gelu optimization auroralabs-loci/llama.cpp#609

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 18, 2025

chraac reviewed Dec 18, 2025

View reviewed changes

joeldushouyu and others added 9 commits December 21, 2025 22:39

feat: working gelu with src0 put on vtcm

59c8869

feat: gelu ping-pong for both in and out

423a7f8

fix: fixu compile error

1c5ffc4

break: distinguish dma ddr->vtcm and vtcm->ddr operation

2216187

fix: fix dma queue size

3a607f3

break: update dma api to either pop src or dst ptr

29a46c3

fix: fix activation vtcm allocation issue for src1 when swapperd

4032576

refactor: ping-pong gelu logic to avoid unnecessary if else

e9f13ba

dma: improved queue interface and prefetch handling

dccbcb2

joeldushouyu force-pushed the hexagon-gelu-optimization branch from f590092 to dccbcb2 Compare December 22, 2025 14:25

joeldushouyu marked this pull request as ready for review December 22, 2025 14:34

joeldushouyu requested a review from lhez as a code owner December 22, 2025 14:34

joeldushouyu requested a review from max-krasnyansky December 22, 2025 14:34

gelu: fix N+2 block prefetch

4ad1507

max-krasnyansky approved these changes Dec 22, 2025

View reviewed changes

max-krasnyansky merged commit bf6bc3c into ggml-org:master Dec 22, 2025
67 of 68 checks passed

ggml-hexagon: gelu optimization #18151

ggml-hexagon: gelu optimization #18151

Uh oh!

Conversation

joeldushouyu commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimization Strategy

Performance Benchmarks

Uh oh!

joeldushouyu commented Dec 17, 2025

Uh oh!

chraac Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

joeldushouyu Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

chraac Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joeldushouyu Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

joeldushouyu Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Dec 18, 2025

Uh oh!

joeldushouyu commented Dec 18, 2025

Uh oh!

max-krasnyansky commented Dec 21, 2025

Uh oh!

joeldushouyu commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Dec 22, 2025

Uh oh!

joeldushouyu commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Dec 22, 2025

Uh oh!

joeldushouyu commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joeldushouyu commented Dec 17, 2025 •

edited

Loading

chraac Dec 18, 2025 •

edited

Loading

joeldushouyu commented Dec 21, 2025 •

edited

Loading

joeldushouyu commented Dec 22, 2025 •

edited

Loading

joeldushouyu commented Dec 22, 2025 •

edited

Loading