Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 18, 2025

Alternative to #17959

Fix #17948

Before this PR, the logic for loading models from different sources (cache / local / custom ini) was quite messy and doesn't allow ini preset to take precedence over other sources.

With this PR, we unify the method for loading server models and presets:

  • preset.cpp is responsible for collecting all model sources (cache / local) and generate a base preset for each of the known GGUF
  • preset.cpp then load INI and parse the global section ([*])
  • it is then up to downstream code (e.g. server-models.cpp) to decide how to cascade these presets

The current cascading rule can be found in server's docs:

  1. Command-line arguments passed to llama-server (highest priority)
  2. Model-specific options defined in the preset file (e.g. [ggml-org/MY-MODEL...])
  3. Global options defined in the preset file ([*])

@ServeurpersoCom
Copy link
Collaborator

Looks good! I'll deploy this on my test server tonight and report back with results!

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 18, 2025

First basic test OK: In my case, as a user, this allows to have a complete configuration file; it is no longer necessary to modify the command line (systemd) each time I want to change a global

./llama-server --port 8082 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

Presets (backend.ini)

[*]
fit = off
ngl = 999
ctk = q8_0
ctv = q8_0
fa = on
mlock = on
np = 4
kvu = on

[Dense-Devstral-Small-2-24B-Instruct-2512]
m = [unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF](https://www.serveurperso.com/ia/models/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/)/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
; chat-template-file = [unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF](https://www.serveurperso.com/ia/models/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/)/Devstral-Small-2-24B-Instruct-2512.jinja
c = 131072

etc...

Log :

srv  log_server_r: request: GET /v1/models 127.0.0.1 200
srv          load: spawning server instance with name=Dense-Devstral-Small-2-24B-Instruct-2512 on port 48949
srv          load: spawning server instance with args:
srv          load:   /root/llama.cpp.pascal/build/bin/llama-server
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   -kvu
srv          load:   --mlock
srv          load:   --port
srv          load:   48949
srv          load:   --webui-config-file
srv          load:   frontend.json
srv          load:   --alias
srv          load:   Dense-Devstral-Small-2-24B-Instruct-2512
srv          load:   --ctx-size
srv          load:   131072
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --flash-attn
srv          load:   on
srv          load:   --fit
srv          load:   off
srv          load:   --model
srv          load:   unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
srv          load:   --n-gpu-layers
srv          load:   999
srv          load:   --parallel
srv          load:   4
srv  log_server_r: request: POST /models/load 127.0.0.1 200
srv  log_server_r: request: GET /v1/models 127.0.0.1 200
[48949] ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[48949] ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[48949] ggml_cuda_init: found 1 CUDA devices:

-> Minor nitpick (not in this PR): if we want --kv-unified in logs instead of -kvu, we could swap the order in arg.cpp: {"-kvu", "--kv-unified"} since to_args() uses .back() -> EDIT : #18196 (for all args + doc)

CLI with -c 1024 overwrite .ini per-model config -> wanted -> OK

llama-server --port 8082 -c 1024 --models-max 1 --models-preset backend.ini --webui-config-file frontend.json

main:       it is not recommended to use this mode in untrusted environments
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  log_server_r: request: GET /props 127.0.0.1 200
srv  ensure_model: model name=Dense-Devstral-Small-2-24B-Instruct-2512 is not loaded, loading...
srv          load: spawning server instance with name=Dense-Devstral-Small-2-24B-Instruct-2512 on port 41119
srv          load: spawning server instance with args:
srv          load:   /root/llama.cpp.pascal/build/bin/llama-server
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   -kvu
srv          load:   --mlock
srv          load:   --port
srv          load:   41119
srv          load:   --webui-config-file
srv          load:   frontend.json
srv          load:   --alias
srv          load:   Dense-Devstral-Small-2-24B-Instruct-2512
srv          load:   --ctx-size
srv          load:   1024
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --flash-attn
srv          load:   on
srv          load:   --fit
srv          load:   off
srv          load:   --model
srv          load:   unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q6_K_XL.gguf
srv          load:   --n-gpu-layers
srv          load:   999
srv          load:   --parallel
srv          load:   4
srv  ensure_model: waiting until model name=Dense-Devstral-Small-2-24B-Instruct-2512 is fully loaded...

Single model mode testing (with some args) OK!

Major functionality validated, we can merge it!

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 18, 2025

-> Minor nitpick (not in this PR): if we want --kv-unified in logs instead of -kvu, we could swap the order in arg.cpp: {"-kvu", "--kv-unified"} since to_args() uses .back()

Thanks for testing. Yes, feel free to create a new PR for fixing this. Out convention is to have short form first, then followed by long form

@ngxson ngxson changed the title presets: refactor, allow cascade presets from different sources presets: refactor, allow cascade presets from different sources, add global section Dec 18, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm traveling for a few days and won't be able to do very detailed testing/review. Approving to not block the work on this and added @ServeurpersoCom to write access group for additional approvals if needed.

}
// 2. local models specificed via --models-dir
common_presets cached_models = ctx_preset.load_from_cache();
SRV_INF("Loaded %zu cached model presets\n", cached_models.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for most logs we prefix with the function name:

Suggested change
SRV_INF("Loaded %zu cached model presets\n", cached_models.size());
SRV_INF("%s: Loaded %zu cached model presets\n", __func__, cached_models.size());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just test Windows (build OK on my side, need some basic test), merge, and I complete my separate PR "special nits" #18196 (that way we don't have any conflicts) :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SRV_INF macro should already be prefixed with the function name, so I think this is not necessary:

#define SRV_INF(fmt, ...) LOG_INF("srv  %12.*s: " fmt, 12, __func__, __VA_ARGS__)

@ngxson ngxson merged commit 98c1c7a into ggml-org:master Dec 19, 2025
70 of 71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Default config for --model-presets

3 participants