Skip to content

Misc. bug: visual <think> tag issues with Ring-mini-2.0 GGUF #17832

@pandruszkow

Description

@pandruszkow

Name and Version

$ llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 3385 (ff90508)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -m /llm-models/inclusionai/Ring-mini-2.0-Q8_0.gguf@inclusionAI -c 32768 --no-warmup -s 499715 -ngl 999 -ot 'blk\.0\.ffn_.*_exps\.weight=CPU' --flash-attn on -ub 8192 -b 16384 -ctk bf16 -ctv bf16 --host 0.0.0.0 --port 58093 --log-colors on --parallel 2 --threads 10

llama-server -m /llm-models/inclusionai/Ring-mini-2.0-Q8_0.gguf@inclusionAI -c 32768 --no-warmup -s 499715 -ngl 999 -ot 'blk\.0\.ffn_.*_exps\.weight=CPU' --flash-attn on -ub 8192 -b 16384 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 58093 --log-colors on --parallel 2 --threads 10

llama-server -m /llm-models/inclusionai/inclusionAI_Ring-mini-2.0-Q8_0.gguf@bartowski -c 32768 --no-warmup -s 499715 -ngl 999 -ot 'blk\.0\.ffn_.*_exps\.weight=CPU' --flash-attn on -ub 8192 -b 16384 -ctk bf16 -ctv bf16 --host 0.0.0.0 --port 58093 --log-colors on --parallel 2 --threads 10

llama-server -m /llm-models/inclusionai/inclusionAI_Ring-mini-2.0-Q8_0.gguf@bartowski -c 32768 --no-warmup -s 499715 -ngl 999 -ot 'blk\.0\.ffn_.*_exps\.weight=CPU' --flash-attn on -ub 8192 -b 16384 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 58093 --log-colors on --parallel 2 --threads 10

Problem description & steps to reproduce

General problem description:

For InclusionAI's Ring-mini-2.0, the tag doesn't seem to be handled properly in the Web UI. This doesn't occur for Ring-flash-2.0.

Repro:

  1. Open llama-server's built-in web UI
  2. In Settings, enable Show thought in progress, disable Show raw LLM output if enabled.
  3. Submit a request with the following properties:
  • default (empty) system prompt
  • default chat template (unchanged)
  • a simple user message, such as Please say "hello"

Expected behaviour:

Response contains a leading <think> tag, which causes the UI to wrap the thinking section into a box that can be collapsed to visually separate thinking from final response

Actual behaviour:

Response does not contain a leading <think> tag, which causes the UI mix the thinking section into the same text stream/UI text box as the final response, making them difficult to separate

Screenshot of Web UI showing actual behaviour:
Image

Models affected:

Q8_0 variants of the following:

Models not affected:

Ring-flash-2.0, the section is handled as expected

First Bad Commit

n/a (trying this model for the first time and I'm not sure behaviour as expected ever existed)

Relevant log output

+ llama-server -m /llm-models/inclusionai/inclusionAI_Ring-mini-2.0-Q8_0.gguf@bartowski -c 32768 --no-warmup -s 499715 -ngl 999 -ot 'blk\.0\.ffn_.*_exps\.weight=CPU' --flash-attn on -ub 8192 -b 16384 -ctk bf16 -ctv bf16 --host 0.0.0.0 --port 58093 --log-colors on -lv 1 --parallel 2 --threads 10
+ grep -v ' is not marked as EOG'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 3385 (ff90508d) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
system info: n_threads = 10, n_threads_batch = 10, total_threads = 34

system_info: n_threads = 10 (n_threads_batch = 10) / 34 | CUDA : ARCHS = 610,860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 33 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/llm-models/inclusionai/inclusionAI_Ring-mini-2.0-Q8_0.gguf@bartowski'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 23862 MiB free
llama_model_loader: loaded meta data with 54 key-value pairs and 278 tensors from /llm-models/inclusionai/inclusionAI_Ring-mini-2.0-Q8_0.gguf@bartowski (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bailingmoe2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Ring Mini 2.0
llama_model_loader: - kv   3:                            general.version str              = 2.0
llama_model_loader: - kv   4:                           general.basename str              = Ring
llama_model_loader: - kv   5:                         general.size_label str              = mini
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Ling Mini Base 2.0 20T
llama_model_loader: - kv   9:               general.base_model.0.version str              = 2.0
llama_model_loader: - kv  10:          general.base_model.0.organization str              = inclusionAI
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/inclusionAI/Li...
llama_model_loader: - kv  12:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  13:                    bailingmoe2.block_count u32              = 20
llama_model_loader: - kv  14:                 bailingmoe2.context_length u32              = 32768
llama_model_loader: - kv  15:               bailingmoe2.embedding_length u32              = 2048
llama_model_loader: - kv  16:            bailingmoe2.feed_forward_length u32              = 5120
llama_model_loader: - kv  17:           bailingmoe2.attention.head_count u32              = 16
llama_model_loader: - kv  18:        bailingmoe2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  19:                 bailingmoe2.rope.freq_base f32              = 600000.000000
llama_model_loader: - kv  20: bailingmoe2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:              bailingmoe2.expert_used_count u32              = 8
llama_model_loader: - kv  22:           bailingmoe2.attention.key_length u32              = 128
llama_model_loader: - kv  23:         bailingmoe2.attention.value_length u32              = 128
llama_model_loader: - kv  24:           bailingmoe2.rope.dimension_count u32              = 64
llama_model_loader: - kv  25:              bailingmoe2.rope.scaling.type str              = none
llama_model_loader: - kv  26:      bailingmoe2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  27:                     bailingmoe2.vocab_size u32              = 157184
llama_model_loader: - kv  28:     bailingmoe2.expert_feed_forward_length u32              = 512
llama_model_loader: - kv  29: bailingmoe2.expert_shared_feed_forward_length u32              = 512
llama_model_loader: - kv  30:           bailingmoe2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  31:                   bailingmoe2.expert_count u32              = 256
llama_model_loader: - kv  32:            bailingmoe2.expert_shared_count u32              = 1
llama_model_loader: - kv  33:             bailingmoe2.expert_group_count u32              = 8
llama_model_loader: - kv  34:        bailingmoe2.expert_group_used_count u32              = 4
llama_model_loader: - kv  35:            bailingmoe2.expert_weights_norm bool             = true
llama_model_loader: - kv  36:             bailingmoe2.expert_gating_func u32              = 2
llama_model_loader: - kv  37:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  38:                         tokenizer.ggml.pre str              = bailingmoe2
llama_model_loader: - kv  39:                      tokenizer.ggml.tokens arr[str,157184]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  40:                  tokenizer.ggml.token_type arr[i32,157184]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  41:                      tokenizer.ggml.merges arr[str,156635]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "h e...
llama_model_loader: - kv  42:                tokenizer.ggml.bos_token_id u32              = 156891
llama_model_loader: - kv  43:                tokenizer.ggml.eos_token_id u32              = 156892
llama_model_loader: - kv  44:            tokenizer.ggml.padding_token_id u32              = 156892
llama_model_loader: - kv  45:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  46:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  47:                    tokenizer.chat_template str              = {% for message in messages %}{% set r...
llama_model_loader: - kv  48:               general.quantization_version u32              = 2
llama_model_loader: - kv  49:                          general.file_type u32              = 7
llama_model_loader: - kv  50:                      quantize.imatrix.file str              = /models_out/Ring-mini-2.0-GGUF/inclus...
llama_model_loader: - kv  51:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt
llama_model_loader: - kv  52:             quantize.imatrix.entries_count u32              = 176
llama_model_loader: - kv  53:              quantize.imatrix.chunks_count u32              = 826
llama_model_loader: - type  f32:  119 tensors
llama_model_loader: - type q8_0:  159 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 16.11 GiB (8.51 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: printing all EOG tokens:
load:   - 156892 ('<|endoftext|>')
load: special tokens cache size = 262
load: token to piece cache size = 1.0010 MB
print_info: arch             = bailingmoe2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_embd_inp       = 2048
print_info: n_layer          = 20
print_info: n_head           = 16
print_info: n_head_kv        = 4
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 5120
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: n_expert_groups  = 8
print_info: n_group_used     = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = none
print_info: freq_base_train  = 600000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 16B.A1B
print_info: model params     = 16.26 B
print_info: general.name     = Ring Mini 2.0
print_info: n_layer_dense_lead   = 1
print_info: n_ff_exp             = 512
print_info: n_ff_shexp           = 512
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: nextn_predict_layers = 0
print_info: vocab type       = BPE
print_info: n_vocab          = 157184
print_info: n_merges         = 156635
print_info: BOS token        = 156891 '<|startoftext|>'
print_info: EOS token        = 156892 '<|endoftext|>'
print_info: EOT token        = 156892 '<|endoftext|>'
print_info: PAD token        = 156892 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 156892 '<|endoftext|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_qkv.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_q_norm.weight
create_tensor: loading tensor blk.0.attn_k_norm.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_qkv.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_q_norm.weight
create_tensor: loading tensor blk.1.attn_k_norm.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate_inp.weight
create_tensor: loading tensor blk.1.exp_probs_b.bias
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.1.ffn_gate_shexp.weight
create_tensor: loading tensor blk.1.ffn_down_shexp.weight
create_tensor: loading tensor blk.1.ffn_up_shexp.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_qkv.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_q_norm.weight
create_tensor: loading tensor blk.2.attn_k_norm.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate_inp.weight
create_tensor: loading tensor blk.2.exp_probs_b.bias
create_tensor: loading tensor blk.2.ffn_gate_exps.weight
create_tensor: loading tensor blk.2.ffn_down_exps.weight
create_tensor: loading tensor blk.2.ffn_up_exps.weight
create_tensor: loading tensor blk.2.ffn_gate_shexp.weight
create_tensor: loading tensor blk.2.ffn_down_shexp.weight
create_tensor: loading tensor blk.2.ffn_up_shexp.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_qkv.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_q_norm.weight
create_tensor: loading tensor blk.3.attn_k_norm.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate_inp.weight
create_tensor: loading tensor blk.3.exp_probs_b.bias
create_tensor: loading tensor blk.3.ffn_gate_exps.weight
create_tensor: loading tensor blk.3.ffn_down_exps.weight
create_tensor: loading tensor blk.3.ffn_up_exps.weight
create_tensor: loading tensor blk.3.ffn_gate_shexp.weight
create_tensor: loading tensor blk.3.ffn_down_shexp.weight
create_tensor: loading tensor blk.3.ffn_up_shexp.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_qkv.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_q_norm.weight
create_tensor: loading tensor blk.4.attn_k_norm.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate_inp.weight
create_tensor: loading tensor blk.4.exp_probs_b.bias
create_tensor: loading tensor blk.4.ffn_gate_exps.weight
create_tensor: loading tensor blk.4.ffn_down_exps.weight
create_tensor: loading tensor blk.4.ffn_up_exps.weight
create_tensor: loading tensor blk.4.ffn_gate_shexp.weight
create_tensor: loading tensor blk.4.ffn_down_shexp.weight
create_tensor: loading tensor blk.4.ffn_up_shexp.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_qkv.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_q_norm.weight
create_tensor: loading tensor blk.5.attn_k_norm.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate_inp.weight
create_tensor: loading tensor blk.5.exp_probs_b.bias
create_tensor: loading tensor blk.5.ffn_gate_exps.weight
create_tensor: loading tensor blk.5.ffn_down_exps.weight
create_tensor: loading tensor blk.5.ffn_up_exps.weight
create_tensor: loading tensor blk.5.ffn_gate_shexp.weight
create_tensor: loading tensor blk.5.ffn_down_shexp.weight
create_tensor: loading tensor blk.5.ffn_up_shexp.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_qkv.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_q_norm.weight
create_tensor: loading tensor blk.6.attn_k_norm.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate_inp.weight
create_tensor: loading tensor blk.6.exp_probs_b.bias
create_tensor: loading tensor blk.6.ffn_gate_exps.weight
create_tensor: loading tensor blk.6.ffn_down_exps.weight
create_tensor: loading tensor blk.6.ffn_up_exps.weight
create_tensor: loading tensor blk.6.ffn_gate_shexp.weight
create_tensor: loading tensor blk.6.ffn_down_shexp.weight
create_tensor: loading tensor blk.6.ffn_up_shexp.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_qkv.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_q_norm.weight
create_tensor: loading tensor blk.7.attn_k_norm.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate_inp.weight
create_tensor: loading tensor blk.7.exp_probs_b.bias
create_tensor: loading tensor blk.7.ffn_gate_exps.weight
create_tensor: loading tensor blk.7.ffn_down_exps.weight
create_tensor: loading tensor blk.7.ffn_up_exps.weight
create_tensor: loading tensor blk.7.ffn_gate_shexp.weight
create_tensor: loading tensor blk.7.ffn_down_shexp.weight
create_tensor: loading tensor blk.7.ffn_up_shexp.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_qkv.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_q_norm.weight
create_tensor: loading tensor blk.8.attn_k_norm.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate_inp.weight
create_tensor: loading tensor blk.8.exp_probs_b.bias
create_tensor: loading tensor blk.8.ffn_gate_exps.weight
create_tensor: loading tensor blk.8.ffn_down_exps.weight
create_tensor: loading tensor blk.8.ffn_up_exps.weight
create_tensor: loading tensor blk.8.ffn_gate_shexp.weight
create_tensor: loading tensor blk.8.ffn_down_shexp.weight
create_tensor: loading tensor blk.8.ffn_up_shexp.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_qkv.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_q_norm.weight
create_tensor: loading tensor blk.9.attn_k_norm.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate_inp.weight
create_tensor: loading tensor blk.9.exp_probs_b.bias
create_tensor: loading tensor blk.9.ffn_gate_exps.weight
create_tensor: loading tensor blk.9.ffn_down_exps.weight
create_tensor: loading tensor blk.9.ffn_up_exps.weight
create_tensor: loading tensor blk.9.ffn_gate_shexp.weight
create_tensor: loading tensor blk.9.ffn_down_shexp.weight
create_tensor: loading tensor blk.9.ffn_up_shexp.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_qkv.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_q_norm.weight
create_tensor: loading tensor blk.10.attn_k_norm.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate_inp.weight
create_tensor: loading tensor blk.10.exp_probs_b.bias
create_tensor: loading tensor blk.10.ffn_gate_exps.weight
create_tensor: loading tensor blk.10.ffn_down_exps.weight
create_tensor: loading tensor blk.10.ffn_up_exps.weight
create_tensor: loading tensor blk.10.ffn_gate_shexp.weight
create_tensor: loading tensor blk.10.ffn_down_shexp.weight
create_tensor: loading tensor blk.10.ffn_up_shexp.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_qkv.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_q_norm.weight
create_tensor: loading tensor blk.11.attn_k_norm.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate_inp.weight
create_tensor: loading tensor blk.11.exp_probs_b.bias
create_tensor: loading tensor blk.11.ffn_gate_exps.weight
create_tensor: loading tensor blk.11.ffn_down_exps.weight
create_tensor: loading tensor blk.11.ffn_up_exps.weight
create_tensor: loading tensor blk.11.ffn_gate_shexp.weight
create_tensor: loading tensor blk.11.ffn_down_shexp.weight
create_tensor: loading tensor blk.11.ffn_up_shexp.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_qkv.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_q_norm.weight
create_tensor: loading tensor blk.12.attn_k_norm.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate_inp.weight
create_tensor: loading tensor blk.12.exp_probs_b.bias
create_tensor: loading tensor blk.12.ffn_gate_exps.weight
create_tensor: loading tensor blk.12.ffn_down_exps.weight
create_tensor: loading tensor blk.12.ffn_up_exps.weight
create_tensor: loading tensor blk.12.ffn_gate_shexp.weight
create_tensor: loading tensor blk.12.ffn_down_shexp.weight
create_tensor: loading tensor blk.12.ffn_up_shexp.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_qkv.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_q_norm.weight
create_tensor: loading tensor blk.13.attn_k_norm.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate_inp.weight
create_tensor: loading tensor blk.13.exp_probs_b.bias
create_tensor: loading tensor blk.13.ffn_gate_exps.weight
create_tensor: loading tensor blk.13.ffn_down_exps.weight
create_tensor: loading tensor blk.13.ffn_up_exps.weight
create_tensor: loading tensor blk.13.ffn_gate_shexp.weight
create_tensor: loading tensor blk.13.ffn_down_shexp.weight
create_tensor: loading tensor blk.13.ffn_up_shexp.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_qkv.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_q_norm.weight
create_tensor: loading tensor blk.14.attn_k_norm.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate_inp.weight
create_tensor: loading tensor blk.14.exp_probs_b.bias
create_tensor: loading tensor blk.14.ffn_gate_exps.weight
create_tensor: loading tensor blk.14.ffn_down_exps.weight
create_tensor: loading tensor blk.14.ffn_up_exps.weight
create_tensor: loading tensor blk.14.ffn_gate_shexp.weight
create_tensor: loading tensor blk.14.ffn_down_shexp.weight
create_tensor: loading tensor blk.14.ffn_up_shexp.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_qkv.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_q_norm.weight
create_tensor: loading tensor blk.15.attn_k_norm.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate_inp.weight
create_tensor: loading tensor blk.15.exp_probs_b.bias
create_tensor: loading tensor blk.15.ffn_gate_exps.weight
create_tensor: loading tensor blk.15.ffn_down_exps.weight
create_tensor: loading tensor blk.15.ffn_up_exps.weight
create_tensor: loading tensor blk.15.ffn_gate_shexp.weight
create_tensor: loading tensor blk.15.ffn_down_shexp.weight
create_tensor: loading tensor blk.15.ffn_up_shexp.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_qkv.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_q_norm.weight
create_tensor: loading tensor blk.16.attn_k_norm.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate_inp.weight
create_tensor: loading tensor blk.16.exp_probs_b.bias
create_tensor: loading tensor blk.16.ffn_gate_exps.weight
create_tensor: loading tensor blk.16.ffn_down_exps.weight
create_tensor: loading tensor blk.16.ffn_up_exps.weight
create_tensor: loading tensor blk.16.ffn_gate_shexp.weight
create_tensor: loading tensor blk.16.ffn_down_shexp.weight
create_tensor: loading tensor blk.16.ffn_up_shexp.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_qkv.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_q_norm.weight
create_tensor: loading tensor blk.17.attn_k_norm.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate_inp.weight
create_tensor: loading tensor blk.17.exp_probs_b.bias
create_tensor: loading tensor blk.17.ffn_gate_exps.weight
create_tensor: loading tensor blk.17.ffn_down_exps.weight
create_tensor: loading tensor blk.17.ffn_up_exps.weight
create_tensor: loading tensor blk.17.ffn_gate_shexp.weight
create_tensor: loading tensor blk.17.ffn_down_shexp.weight
create_tensor: loading tensor blk.17.ffn_up_shexp.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_qkv.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_q_norm.weight
create_tensor: loading tensor blk.18.attn_k_norm.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate_inp.weight
create_tensor: loading tensor blk.18.exp_probs_b.bias
create_tensor: loading tensor blk.18.ffn_gate_exps.weight
create_tensor: loading tensor blk.18.ffn_down_exps.weight
create_tensor: loading tensor blk.18.ffn_up_exps.weight
create_tensor: loading tensor blk.18.ffn_gate_shexp.weight
create_tensor: loading tensor blk.18.ffn_down_shexp.weight
create_tensor: loading tensor blk.18.ffn_up_shexp.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_qkv.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_q_norm.weight
create_tensor: loading tensor blk.19.attn_k_norm.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate_inp.weight
create_tensor: loading tensor blk.19.exp_probs_b.bias
create_tensor: loading tensor blk.19.ffn_gate_exps.weight
create_tensor: loading tensor blk.19.ffn_down_exps.weight
create_tensor: loading tensor blk.19.ffn_up_exps.weight
create_tensor: loading tensor blk.19.ffn_gate_shexp.weight
create_tensor: loading tensor blk.19.ffn_down_shexp.weight
create_tensor: loading tensor blk.19.ffn_up_shexp.weight
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 20 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 21/21 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   326.19 MiB
load_tensors:        CUDA0 model buffer size = 16173.48 MiB
..............................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 32768
llama_context: n_ctx_seq     = 16384
llama_context: n_batch       = 16384
llama_context: n_ubatch      = 8192
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 600000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (16384) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     1.20 MiB
llama_kv_cache: layer   0: dev = CUDA0
llama_kv_cache: layer   1: dev = CUDA0
llama_kv_cache: layer   2: dev = CUDA0
llama_kv_cache: layer   3: dev = CUDA0
llama_kv_cache: layer   4: dev = CUDA0
llama_kv_cache: layer   5: dev = CUDA0
llama_kv_cache: layer   6: dev = CUDA0
llama_kv_cache: layer   7: dev = CUDA0
llama_kv_cache: layer   8: dev = CUDA0
llama_kv_cache: layer   9: dev = CUDA0
llama_kv_cache: layer  10: dev = CUDA0
llama_kv_cache: layer  11: dev = CUDA0
llama_kv_cache: layer  12: dev = CUDA0
llama_kv_cache: layer  13: dev = CUDA0
llama_kv_cache: layer  14: dev = CUDA0
llama_kv_cache: layer  15: dev = CUDA0
llama_kv_cache: layer  16: dev = CUDA0
llama_kv_cache: layer  17: dev = CUDA0
llama_kv_cache: layer  18: dev = CUDA0
llama_kv_cache: layer  19: dev = CUDA0
llama_kv_cache:      CUDA0 KV buffer size =  1280.00 MiB
llama_kv_cache: size = 1280.00 MiB ( 16384 cells,  20 layers,  2/2 seqs), K (bf16):  640.00 MiB, V (bf16):  640.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2224
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 8192, n_seqs = 2, n_outputs = 2
graph_reserve: reserving a graph for ubatch with n_tokens = 8192, n_seqs =  2, n_outputs = 8192
graph_reserve: reserving a graph for ubatch with n_tokens =    2, n_seqs =  2, n_outputs =    2
graph_reserve: reserving a graph for ubatch with n_tokens = 8192, n_seqs =  2, n_outputs = 8192
llama_context:      CUDA0 compute buffer size =  5056.00 MiB
llama_context:  CUDA_Host compute buffer size =   576.22 MiB
llama_context: graph nodes  = 1638
llama_context: graph splits = 42
clear_adapter_lora: call
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
srv          init: initializing slots, n_slots = 2
slot         init: id  0 | task -1 | new slot, n_ctx = 16384
slot        reset: id  0 | task -1 | 
slot         init: id  1 | task -1 | new slot, n_ctx = 16384
slot        reset: id  1 | task -1 | 
srv          init: prompt cache is enabled, size limit: 8192 MiB
srv          init: use `--cache-ram 0` to disable the prompt cache
srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: thinking = 0
init: chat template, chat_template: {% for message in messages %}{% set role = message['role'] | lower %}{% if role == 'user' %}{% set role = 'HUMAN' %}{% endif %}{% set role = role | upper %}{{ '<role>' + role + '</role>' + message['content'] }}{% endfor %}{% if add_generation_prompt %}{{ '<role>ASSISTANT</role><think>
' }}{% endif %}, example_format: '<role>SYSTEM</role>You are a helpful assistant<role>HUMAN</role>Hello<role>ASSISTANT</role>Hi there<role>HUMAN</role>How are you?<role>ASSISTANT</role><think>
'
main: model loaded
main: server is listening on http://0.0.0.0:58093
main: starting the main loop...
que    start_loop: processing new tasks
que    start_loop: update slots
srv  update_slots: all slots are idle
que    start_loop: waiting for new tasks
srv  params_from_: Grammar: 
srv  params_from_: Grammar lazy: false
srv  params_from_: Chat format: Content-only
res  add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 0/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 0
slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
slot        reset: id  1 | task -1 | 
slot launch_slot_: id  1 | task -1 | launching slot : {"id":1,"n_ctx":16384,"speculative":false,"is_processing":false}
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
slot launch_slot_: id  1 | task 0 | processing task
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 1, front = 0
slot update_slots: id  1 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, task.n_tokens = 17
slot update_slots: id  1 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  1 | task 0 | prompt processing progress, n_tokens = 17, batch.n_tokens = 17, progress = 1.000000
slot update_slots: id  1 | task 0 | prompt done, n_tokens = 17, batch.n_tokens = 17
srv  update_slots: decoding batch, n_tokens = 17
clear_adapter_lora: call
set_embeddings: value = 0
srv  update_chat_: Parsing chat message: Hmm
Parsing input with format Content-only: Hmm
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  1 | task 0 | n_decoded = 1, n_remaining = -1, next token: 90833 'Hmm'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 1
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 2, front = 0
slot update_slots: id  1 | task 0 | slot decode token, n_ctx = 16384, n_tokens = 18, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1765053588,"id":"chatcmpl-HKznn4KbI73lro0oh7p4qkzWprfWUwQO","model":"gpt-3.5-turbo","system_fingerprint":"b3385-ff90508d","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Hmm"}}],"created":1765053588,"id":"chatcmpl-HKznn4KbI73lro0oh7p4qkzWprfWUwQO","model":"gpt-3.5-turbo","system_fingerprint":"b3385-ff90508d","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":17,"prompt_ms":171.885,"prompt_per_token_ms":10.110882352941175,"prompt_per_second":98.90333653314717,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}


srv  update_chat_: Parsing chat message: Hmm, the user just said "Say 'hello'". That's a very straightforward request. 

Okay, let me think about this. The user might be testing basic functionality, or perhaps they're new to interacting with an AI and want to see how it responds to simple commands. Maybe they're in a hurry and just need a quick confirmation. 

I notice the request is in quotes, which makes it even clearer - they want the exact phrase "hello" to be output. No additional context or explanation needed. 

Since this is a basic interaction, I should keep the response simple and direct. No need for extra fluff when the user just wants a specific output. 

The safest approach is to literally say "hello" as requested. If I add anything else, it might confuse someone who just wants the phrase. 

I wonder if this is part of some larger task? But since the user didn't provide more context, I'll assume they want exactly what they asked for. 

Alright, responding with just "hello" it is. Keeping it minimal matches the user's direct instruction.
</think>
hello
Parsing input with format Content-only: Hmm, the user just said "Say 'hello'". That's a very straightforward request. 

Okay, let me think about this. The user might be testing basic functionality, or perhaps they're new to interacting with an AI and want to see how it responds to simple commands. Maybe they're in a hurry and just need a quick confirmation. 

I notice the request is in quotes, which makes it even clearer - they want the exact phrase "hello" to be output. No additional context or explanation needed. 

Since this is a basic interaction, I should keep the response simple and direct. No need for extra fluff when the user just wants a specific output. 

The safest approach is to literally say "hello" as requested. If I add anything else, it might confuse someone who just wants the phrase. 

I wonder if this is part of some larger task? But since the user didn't provide more context, I'll assume they want exactly what they asked for. 

Alright, responding with just "hello" it is. Keeping it minimal matches the user's direct instruction.
</think>
hello
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  1 | task 0 | stopped by EOS
slot process_toke: id  1 | task 0 | n_decoded = 239, n_remaining = -1, next token: 156892 ''
slot print_timing: id  1 | task 0 | 
prompt eval time =     171.88 ms /    17 tokens (   10.11 ms per token,    98.90 tokens per second)
       eval time =    2470.39 ms /   239 tokens (   10.34 ms per token,    96.75 tokens per second)
      total time =    2642.27 ms /   256 tokens
srv  update_chat_: Parsing chat message: Hmm, the user just said "Say 'hello'". That's a very straightforward request. 

Okay, let me think about this. The user might be testing basic functionality, or perhaps they're new to interacting with an AI and want to see how it responds to simple commands. Maybe they're in a hurry and just need a quick confirmation. 

I notice the request is in quotes, which makes it even clearer - they want the exact phrase "hello" to be output. No additional context or explanation needed. 

Since this is a basic interaction, I should keep the response simple and direct. No need for extra fluff when the user just wants a specific output. 

The safest approach is to literally say "hello" as requested. If I add anything else, it might confuse someone who just wants the phrase. 

I wonder if this is part of some larger task? But since the user didn't provide more context, I'll assume they want exactly what they asked for. 

Alright, responding with just "hello" it is. Keeping it minimal matches the user's direct instruction.
</think>
hello
Parsing input with format Content-only: Hmm, the user just said "Say 'hello'". That's a very straightforward request. 

Okay, let me think about this. The user might be testing basic functionality, or perhaps they're new to interacting with an AI and want to see how it responds to simple commands. Maybe they're in a hurry and just need a quick confirmation. 

I notice the request is in quotes, which makes it even clearer - they want the exact phrase "hello" to be output. No additional context or explanation needed. 

Since this is a basic interaction, I should keep the response simple and direct. No need for extra fluff when the user just wants a specific output. 

The safest approach is to literally say "hello" as requested. If I add anything else, it might confuse someone who just wants the phrase. 

I wonder if this is part of some larger task? But since the user didn't provide more context, I'll assume they want exactly what they asked for. 

Alright, responding with just "hello" it is. Keeping it minimal matches the user's direct instruction.
</think>
hello
Parsed message: {"role":"assistant","content":"Hmm, the user just said \"Say 'hello'\". That's a very straightforward request. \n\nOkay, let me think about this. The user might be testing basic functionality, or perhaps they're new to interacting with an AI and want to see how it responds to simple commands. Maybe they're in a hurry and just need a quick confirmation. \n\nI notice the request is in quotes, which makes it even clearer - they want the exact phrase \"hello\" to be output. No additional context or explanation needed. \n\nSince this is a basic interaction, I should keep the response simple and direct. No need for extra fluff when the user just wants a specific output. \n\nThe safest approach is to literally say \"hello\" as requested. If I add anything else, it might confuse someone who just wants the phrase. \n\nI wonder if this is part of some larger task? But since the user didn't provide more context, I'll assume they want exactly what they asked for. \n\nAlright, responding with just \"hello\" it is. Keeping it minimal matches the user's direct instruction.\n</think>\nhello"}
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot      release: id  1 | task 0 | stop processing: n_tokens = 255, truncated = 0
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 243
que    start_loop: update slots
srv  update_slots: all slots are idle
que    start_loop: waiting for new tasks
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1765053591,"id":"chatcmpl-HKznn4KbI73lro0oh7p4qkzWprfWUwQO","model":"gpt-3.5-turbo","system_fingerprint":"b3385-ff90508d","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":17,"prompt_ms":171.885,"prompt_per_token_ms":10.110882352941175,"prompt_per_second":98.90333653314717,"predicted_n":239,"predicted_ms":2470.389,"predicted_per_token_ms":10.336355648535566,"predicted_per_second":96.74589710365451}}


srv    operator(): all results received, terminating stream
srv    operator(): http: streamed chunk: data: [DONE]


srv    operator(): http: stream ended
srv  log_server_r: request: POST /v1/chat/completions 10.0.0.3 200
srv  log_server_r: request:  {"messages":[{"role":"user","content":"Say \"hello\""}],"stream":true,"reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":32768,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"timings_per_token":true}
srv  log_server_r: response: 
res  remove_waiti: remove task 0 from waiting list. current waiting = 1 (before remove)
srv          stop: all tasks already finished, no need to cancel

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingchat parserIssues related to the chat parser and chat templates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions