presets: refactor, allow cascade presets from different sources, add global section #18169

ngxson · 2025-12-18T12:56:49Z

Alternative to #17959

Before this PR, the logic for loading models from different sources (cache / local / custom ini) was quite messy and doesn't allow ini preset to take precedence over other sources.

With this PR, we unify the method for loading server models and presets:

preset.cpp is responsible for collecting all model sources (cache / local) and generate a base preset for each of the known GGUF
preset.cpp then load INI and parse the global section ([*])
it is then up to downstream code (e.g. server-models.cpp) to decide how to cascade these presets

The current cascading rule can be found in server's docs:

Command-line arguments passed to llama-server (highest priority)
Model-specific options defined in the preset file (e.g. [ggml-org/MY-MODEL...])
Global options defined in the preset file ([*])

ServeurpersoCom · 2025-12-18T15:03:54Z

Looks good! I'll deploy this on my test server tonight and report back with results!

ServeurpersoCom · 2025-12-18T17:02:27Z

First basic test OK: In my case, as a user, this allows to have a complete configuration file; it is no longer necessary to modify the command line (systemd) each time I want to change a global

./llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

Presets (backend.ini)

[*]
fit = off
ngl = 999
ctk = q8_0
ctv = q8_0
fa = on
mlock = on
np = 4
kvu = on

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = [unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF](https://www.serveurperso.com/ia/models/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/)/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
; chat-template-file = [unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF](https://www.serveurperso.com/ia/models/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/)/Devstral-Small-2-24B-Instruct-2512.jinja
c = 131072

etc...

Log :

srv  log_server_r: request: GET /v1/models 127.0.0.1 200
srv          load: spawning server instance with name=Dense-Devstral-Small-2-24B-Instruct-2512 on port 48949
srv          load: spawning server instance with args:
srv          load:   /root/llama.cpp.pascal/build/bin/llama-server
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   -kvu
srv          load:   --mlock
srv          load:   --port
srv          load:   48949
srv          load:   --webui-config-file
srv          load:   frontend.json
srv          load:   --alias
srv          load:   Dense-Devstral-Small-2-24B-Instruct-2512
srv          load:   --ctx-size
srv          load:   131072
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --flash-attn
srv          load:   on
srv          load:   --fit
srv          load:   off
srv          load:   --model
srv          load:   unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
srv          load:   --n-gpu-layers
srv          load:   999
srv          load:   --parallel
srv          load:   4
srv  log_server_r: request: POST /models/load 127.0.0.1 200
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
[48949] ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[48949] ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[48949] ggml_cuda_init: found 1 CUDA devices:

-> Minor nitpick (not in this PR): if we want --kv-unified in logs instead of -kvu, we could swap the order in arg.cpp: {"-kvu", "--kv-unified"} since to_args() uses .back() -> EDIT : #18196 (for all args + doc)

CLI with -c 1024 overwrite .ini per-model config -> wanted -> OK

llama-server --port 8082 -c 1024 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

main:       it is not recommended to use this mode in untrusted environments
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  ensure_model: model name=Dense-Devstral-Small-2-24B-Instruct-2512 is not loaded, loading...
srv          load: spawning server instance with name=Dense-Devstral-Small-2-24B-Instruct-2512 on port 41119
srv          load: spawning server instance with args:
srv          load:   /root/llama.cpp.pascal/build/bin/llama-server
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   -kvu
srv          load:   --mlock
srv          load:   --port
srv          load:   41119
srv          load:   --webui-config-file
srv          load:   frontend.json
srv          load:   --alias
srv          load:   Dense-Devstral-Small-2-24B-Instruct-2512
srv          load:   --ctx-size
srv          load:   1024
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --flash-attn
srv          load:   on
srv          load:   --fit
srv          load:   off
srv          load:   --model
srv          load:   unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
srv          load:   --n-gpu-layers
srv          load:   999
srv          load:   --parallel
srv          load:   4
srv  ensure_model: waiting until model name=Dense-Devstral-Small-2-24B-Instruct-2512 is fully loaded...

Single model mode testing (with some args) OK!

Major functionality validated, we can merge it!

ngxson · 2025-12-18T21:26:30Z

-> Minor nitpick (not in this PR): if we want --kv-unified in logs instead of -kvu, we could swap the order in arg.cpp: {"-kvu", "--kv-unified"} since to_args() uses .back()

Thanks for testing. Yes, feel free to create a new PR for fixing this. Out convention is to have short form first, then followed by long form

ggerganov

I'm traveling for a few days and won't be able to do very detailed testing/review. Approving to not block the work on this and added @ServeurpersoCom to write access group for additional approvals if needed.

ggerganov · 2025-12-19T08:57:58Z

tools/server/server-models.cpp

-    }
-    // 2. local models specificed via --models-dir
+    common_presets cached_models = ctx_preset.load_from_cache();
+    SRV_INF("Loaded %zu cached model presets\n", cached_models.size());


nit: for most logs we prefix with the function name:

Suggested change

SRV_INF("Loaded %zu cached model presets\n", cached_models.size());

SRV_INF("%s: Loaded %zu cached model presets\n", __func__, cached_models.size());

We can just test Windows (build OK on my side, need some basic test), merge, and I complete my separate PR "special nits" #18196 (that way we don't have any conflicts) :)

the SRV_INF macro should already be prefixed with the function name, so I think this is not necessary:

#define SRV_INF(fmt, ...) LOG_INF("srv %12.*s: " fmt, 12, __func__, __VA_ARGS__)

ngxson added 2 commits December 18, 2025 13:46

presets: refactor, allow cascade presets from different sources

4e475a8

update docs

5abab16

ngxson requested a review from ServeurpersoCom December 18, 2025 12:56

ngxson requested a review from ggerganov as a code owner December 18, 2025 12:56

This was referenced Dec 18, 2025

server: support global section of presets #17959

Closed

Feature Request: Default config for --model-presets #17948

Closed

ngxson added 4 commits December 18, 2025 14:03

fix neg arg handling

0d04bba

fix empty mmproj

60ec94e

also filter out server-controlled args before to_ini()

4004c47

Merge branch 'master' into xsn/refactor_server_preset

7cbb2d2

loci-dev mentioned this pull request Dec 18, 2025

UPSTREAM PR #18169: presets: refactor, allow cascade presets from different sources auroralabs-loci/llama.cpp#614

Open

github-actions bot added examples server labels Dec 18, 2025

skip loading custom_models if not specified

ac6f8ca

ngxson added 2 commits December 18, 2025 22:26

fix unset_reserved_args

11f8109

fix crash on windows

95b7996

ngxson changed the title ~~presets: refactor, allow cascade presets from different sources~~ presets: refactor, allow cascade presets from different sources, add global section Dec 18, 2025

ggerganov approved these changes Dec 19, 2025

View reviewed changes

ngxson merged commit 98c1c7a into ggml-org:master Dec 19, 2025
70 of 71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

presets: refactor, allow cascade presets from different sources, add global section #18169

presets: refactor, allow cascade presets from different sources, add global section #18169

ngxson commented Dec 18, 2025 •

edited

Loading

Uh oh!

ServeurpersoCom commented Dec 18, 2025

Uh oh!

ServeurpersoCom commented Dec 18, 2025 •

edited

Loading

Uh oh!

ngxson commented Dec 18, 2025

Uh oh!

ggerganov left a comment

Uh oh!

ggerganov Dec 19, 2025

Uh oh!

ServeurpersoCom Dec 19, 2025

Uh oh!

ngxson Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	SRV_INF("Loaded %zu cached model presets\n", cached_models.size());
	SRV_INF("%s: Loaded %zu cached model presets\n", __func__, cached_models.size());

presets: refactor, allow cascade presets from different sources, add global section #18169

presets: refactor, allow cascade presets from different sources, add global section #18169

Conversation

ngxson commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Dec 18, 2025

Uh oh!

ServeurpersoCom commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Presets (backend.ini)

Log :

CLI with -c 1024 overwrite .ini per-model config -> wanted -> OK

Single model mode testing (with some args) OK!

Uh oh!

ngxson commented Dec 18, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Dec 18, 2025 •

edited

Loading

ServeurpersoCom commented Dec 18, 2025 •

edited

Loading