ggml-hexagon: Add lightweight atomic synchronization support to htp_ops_context for inter-task coordination #18113

ngdxzy · 2025-12-16T21:37:38Z

Background:

The current ggml-hexagon backend uses a worker pool to launch user-defined tasks such as quantization and matrix multiplication. These worker threads are pre-created and execute independently, and the framework currently provides no synchronization primitives that can be safely used inside user task callbacks.

As a result:

User callbacks cannot coordinate or exchange state
Future optimizations that require staged execution, pipelining, or shared intermediate state are difficult to implement
Data sharing is currently not possible

What this PR proposes

This PR explores adding a minimal atomic synchronization mechanism to the existing framework by introducing a shared atomic variable in htp_ops_context. This mechanism enables basic coordination (such as “all quant jobs finished”) while preserving the current worker pool design and execution model.

With this minor change, together with previous work (thread id is provided for the worker function), we can almost program the NPU just like a SIMT architecture.

Motivation

In the current design, multi-precision matrix multiplication requires the entire quantized src1 tensor to be stored in VTCM. This imposes a hard limit on the problem size that can be handled by the MM kernel.

Since src1 typically corresponds to the hidden states in an LLM, this effectively constrains the maximum context length that can be executed on the NPU.

If the proposed atomic synchronization mechanism is accepted, it would enable more flexible execution patterns and staged processing, allowing VTCM to be used more efficiently. This opens the door to follow-up work that reduces VTCM pressure and relaxes the current context-length limitations without major changes to the existing framework.

Request for Feedback

I would appreciate feedback on:

Whether exposing a shared atomic in htp_ops_context is acceptable
Whether this aligns with the intended direction of the worker pool design
Suggestions for alternative lightweight synchronization mechanisms

If this approach is considered acceptable, I will follow up with a separate commit to remove the concept-demonstration logic currently added in matmul-ops.c, leaving only the minimal infrastructure changes required to support synchronization.

max-krasnyansky · 2025-12-21T00:57:53Z

Sorry for the delay.
I have no objections to adding a simple sync using atomics, or perhaps even a proper full threadpool barrier like we do in the CPU backend.
Though I don't think it makes sense to merge this for now. I'd say once we have the first use-cases that uses/requires it that we can include this sync mechanism in that PR.

btw The comment about the context size is not quite correct. We do store the entire src1 in the VTCM but we limit the batch size to 128. So the longer prompts simply need to be chunked by 128. I tested up to 16K context with Qwen3 and Llama-3.2. Larger contexts are going to be tricky to fit into memory (for example 16K @FP16 is over 2GB for Qwen3-4B).

ngdxzy · 2025-12-22T19:23:14Z

Sorry for the delay.

I have no objections to adding a simple sync using atomics, or perhaps even a proper full threadpool barrier like we do in the CPU backend.

Though I don't think it makes sense to merge this for now. I'd say once we have the first use-cases that uses/requires it that we can include this sync mechanism in that PR.

btw The comment about the context size is not quite correct. We do store the entire src1 in the VTCM but we limit the batch size to 128. So the longer prompts simply need to be chunked by 128. I tested up to 16K context with Qwen3 and Llama-3.2. Larger contexts are going to be tricky to fit into memory (for example 16K @FP16 is over 2GB for Qwen3-4B).

Thanks for the reply!

Batch (L dim) size constraints can work, but they cause multi-round MMs and add CPU–NPU overhead. A true MM tiling scheme would allow us to use the full ~4 GB NPU thread memory, resulting in much higher speed and efficiency.

We believe we know how to implement this, but it requires this infra modificaiton (atomic sync).

feat: add atomic lock to ops context

d058fc4

ngdxzy requested review from lhez and max-krasnyansky as code owners December 16, 2025 21:37

loci-dev mentioned this pull request Dec 16, 2025

UPSTREAM PR #18113: ggml-hexagon: Add lightweight atomic synchronization support to htp_ops_context for inter-task coordination auroralabs-loci/llama.cpp#597

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-hexagon: Add lightweight atomic synchronization support to htp_ops_context for inter-task coordination #18113

ggml-hexagon: Add lightweight atomic synchronization support to htp_ops_context for inter-task coordination #18113

Uh oh!

ngdxzy commented Dec 16, 2025 •

edited

Loading

Uh oh!

max-krasnyansky commented Dec 21, 2025

Uh oh!

ngdxzy commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggml-hexagon: Add lightweight atomic synchronization support to htp_ops_context for inter-task coordination #18113

Are you sure you want to change the base?

ggml-hexagon: Add lightweight atomic synchronization support to htp_ops_context for inter-task coordination #18113

Uh oh!

Conversation

ngdxzy commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background:

What this PR proposes

Motivation

Request for Feedback

Uh oh!

max-krasnyansky commented Dec 21, 2025

Uh oh!

ngdxzy commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngdxzy commented Dec 16, 2025 •

edited

Loading