-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Model: Qwen3 Next #16095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pwilkin
wants to merge
433
commits into
ggml-org:master
Choose a base branch
from
pwilkin:qwen3_next
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,607
−32
Open
Model: Qwen3 Next #16095
Changes from 3 commits
Commits
Show all changes
433 commits
Select commit
Hold shift + click to select a range
3d7b227
common : use cpp-httplib as a cURL alternative for downloads (#16185)
angt 667d5f4
metal : report OOM errors (#16274)
ggerganov ff84e4d
mtmd : fix uninitialized variable in bicubic_resize (#16275)
AlekseiNikiforovIBM 2f0f872
codeowners : add rgerganov as owner of RPC [no ci] (#16279)
rgerganov 9f08f25
Always show message actions for mobile UI + improvements for user mes…
allozaur 9a25257
webui: switch to hash-based routing (alternative of #16079) (#16157)
isaac-mcfadyen 617549f
Allow viewing conversations even when llama server is down (#16255)
allozaur e03fa1d
Enhance text file detection logic for file attachments (#16199)
allozaur 807f6f6
devops: add s390x & ppc64le CI (#15925)
taronaeo 6d4a32e
model : make minicpm embedding_scale, residual_scale and logit_scale …
vinkal-chudgar ce07a80
build : add LLAMA_OPENSSL option (#16287)
angt b86b3bf
vulkan: support GET_ROWS for k-quants (#16235)
jeffbolznv e3638af
server : remove old LLAMA_SERVER_SSL (#16290)
angt 6e124be
vulkan: throw system error instead of SIGABRT during init on older de…
DmyMi ba67302
CUDA: refactor and deduplicate vector FA kernels (#16208)
JohannesGaessler 9e9c76e
CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 (#…
am17an 48d6dc4
Show message actions by default (#16289)
allozaur eae567b
vulkan : make the vulkan.hpp dynamic dispatcher instance private (#16…
Acly 71edc9d
vulkan: support arbitrary KV dimension in flash attention (#16160)
jeffbolznv e4e91bc
vulkan: handle mat_mul with A matrix > 4GB (#16176)
jeffbolznv dc861d2
metal : fuse non-sequential nodes (#16102)
ggerganov 967c966
metal : extend mat-mat multiplication support (#16225)
ggerganov 70042d5
vulkan: 64-bit im2col (#16135)
jeffbolznv f21a0aa
Fixed a few typos in the README of the LLaMA.cpp HTTP Server [no ci] …
ImadSaddik 8e80a01
devops: switch to using ubuntu-22.04-s390x image (#16302)
taronaeo db49b1c
ci : fix musa docker build (#16306)
yeahdongcn 8a7500b
common : fix reasoning before forced tool call via tool_choice = requ…
crat0z 94ee2c0
ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (#16307)
CISC a62b221
vulkan: Fix validation failure in quantized flash attention (#16292)
jeffbolznv 97905b2
ggml : fix dependencies for ggml_set_rows (#16318)
ggerganov e688dc3
perplexity : show more kl-divergence data (#16321)
ddh0 d751670
llama-cli: prevent spurious assistant token (#16202)
vinkal-chudgar 1a60170
fix: preserved zero values in chat settings inputs and textareas by s…
ServeurpersoCom 2ca531e
Improve Mobile UI for dialogs and action dropdowns (#16222)
allozaur 0fec081
ggml : check cuda and metal argsort limits and add test (#16323)
CISC 3a6e259
ggml-backend : add root cause in error message if loading backend lib…
rlewczuk 990758b
ggml : bump version to 0.9.1
ggerganov e5e210e
ggml : prepare for development of 0.9.2-dev
ggerganov 0c4c806
ggml : bump version to 0.9.3 (ggml/1353)
danbev 0996bd5
ggml : remove -dev suffix from release version (ggml/1355)
danbev a494252
sync : whisper.cpp (ggml/1359)
ggerganov 6fa8591
sync : ggml
ggerganov de31425
ggml: riscv: add riscv spacemit backend (#15288)
alex-spacemit a2af773
ci : add AMD runners and workflows (#16249)
ggerganov 7e6bba9
Fix thinking blocks with quotes + add handling `[THINK]...[/THINK]` b…
ServeurpersoCom 6d477a9
tests: override test_set_rows::max_nmse_err to allow for occasional r…
jeffbolznv 9ca998c
codeowners: add codeowners for opencl backend (#16344)
lhez 55c2cb0
kleidiai : fix work size and threads sync for fp16 (#16246)
chaxu01 b7f86d8
common : simplify etag tracking by removing json (#16342)
angt 79ec093
metal : dynamic simdgroups for MV kernels (#16340)
ggerganov ce2071b
cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (#16328)
anavp-nvidia e8136cb
ggml : bump version to 0.9.4 (ggml/1363)
ggerganov 637732f
ci : disable ccache for android (#16348)
CISC b32dde0
common : remove common_has_curl() (#16351)
angt e5e46df
opencl: support ne3 in get_rows (#15866)
lhez b05167d
ggml webgpu: support for rope,div,sub,glu,scale,cont operators (#16187)
reeselevine 3684b88
Chatapi ignore empty sampling (#16330)
ServeurpersoCom 4524a19
opencl: support pad_ext (#15888)
lhez 8f4603e
common : disable progress bar without a tty (#16352)
angt b0f974b
ci : fix ccache key for ubuntu-cpu-cmake (#16355)
CISC 8cb89e1
model : support GLM 4.6 (make a few NextN/MTP tensors not required) (…
bartowski1182 b3061db
webui: Remove running `llama-server` within WebUI `dev.sh` script (#1…
allozaur 51d50b9
vulkan: make ggml_vk_default_dispatcher support older vulkan headers …
netrunnereve 0fb527c
Add optional setting for showing "Model used:" information (#16337)
allozaur d8356c7
ci : use registry cache for docker builds (#16366)
CISC e6382cc
Improve code block color theming (#16325)
allozaur df29570
Conversation action dialogs as singletons from Chat Sidebar + apply c…
allozaur ea609f5
common: introduce http.h for httplib-based client (#16373)
angt 203e157
ci: Properly install rocwmma for hip builds (#16305)
IMbackK 9376cb8
llama : parameter conversion and loading fixes for PLaMo2 variants (#…
mitmul 3b6b223
HIP: Disable ROCWMMA fattn on CDNA when compiled against ROCWMMA 2.0.…
IMbackK d5062cf
CI: reenable cdna in rocm docker builds (#16376)
IMbackK 3d70842
HIP: add IMbackK to codeowner (#16375)
IMbackK c400017
SYCL: Update to oneAPI 2025.2 (#16371)
NeoZhangJianyu a3066ee
ci : fix clean-up of old logs (#16381)
ggerganov d683cbb
ci: update vulkan ci (#16294)
netrunnereve 9beeea2
ci : fix ubuntu-latest-cmake-rpc (disable ccache) (#16388)
CISC 8a1a8b4
musa: update compile flags (#16265)
yeahdongcn 2401dea
model : Apertus model implementation (#15852)
pwilkin ac81e42
ggml webgpu: add support for soft_max, optimize rms_norm (#16357)
reeselevine 1bdff01
test-barrier : do not use more threads than physically available (#16…
CISC 9cf2466
fix: track viewportHeight via window.innerHeight to avoid unwanted sc…
ServeurpersoCom 29e391e
webui : Fix messages payload sent to chat completions (#16402)
allozaur eb76a30
vulkan: in flash attention, bounds check against nem1 (don't rely on …
jeffbolznv 74704df
Capture model name only after first token (streaming) or completed re…
allozaur 9b3adcf
ci : change macos-13 to macos-15-intel (#16401)
danbev 1e05b4e
vulkan: Fix FA coopmat1 invalid array indexing (#16365)
jeffbolznv 3ef1cce
vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#1…
jeffbolznv bdec5a2
Fix missing messages on sibling navigation (#16408)
allozaur 5acd3e8
ggml : fix graph reallocation with multiple chunks (#16396)
Acly 9e84fbf
llama : fix shapes for bert/mpt q/k norm (#16409)
CISC 475678f
metal : fix loop bound in ggml_mem_ranges (#16412)
ggerganov 5eff7c1
server : context checkpointing for hybrid and recurrent models (#16382)
ddh0 48cf3db
chat : support Magistral thinking (#16413)
ServeurpersoCom f78c8d8
vulkan : incremental shader builds (#16341)
Acly c12d919
rpc : add support for multiple devices (#16276)
rgerganov 33ee9f7
rpc : check src buffer when copying tensor (#16421)
rgerganov a4d4236
vulkan: use a more appropriate amount of threads when generating shad…
netrunnereve db10b7a
ggml webgpu: actually add softmax, fix rms_norm offset (#16400)
reeselevine 94fb727
model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)
gabe-l-hart 0e54749
server: update readme to mention n_past_max metric (#16436)
okuvshynov 04c340f
nix : removed metal for nix (#16118)
yuannan c3d2fdd
ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (#16443)
danbev 74321e2
ci : remove missing reranker model files (#16444)
danbev 1244ada
ggml : fix unaligned access in AMX code (#16315)
ggerganov 2f6fd3e
ci : refactor sdk caching to minimize storage (#16414)
CISC 6914cd2
chat : Granite Docling stopping (#16438)
gabe-l-hart bae34f0
llama : add --no-host to disable host buffers (#16310)
Gadflyii 96c4732
metal : various optimizations + refactoring (#16446)
ggerganov a59838d
tests : add -INF blocks to the KQ mask in the FA tests (#16380)
ggerganov c81638c
metal : add support for non-padded FA KV (#16148)
ggerganov c8b914f
memory : use sequential equal splits for recurrent modules (#16442)
ggerganov ac6274e
rpc : update documentation (#16441)
rgerganov c95d3be
presets : fix pooling param for embedding models (#16455)
ggerganov 3c5291c
webui : added download action (#13552) (#16282)
srogmann 542bee8
server : add `/v1/health` endpoint (#16461)
ggerganov ff4bf58
llama : support LiquidAI LFM2-MoE hybrid model (#16464)
tdakhran d1ff4d4
ggml webgpu: profiling, CI updates, reworking of command submission (…
reeselevine c20f70b
server : improve context checkpoint logic (#16440)
ggerganov 5ceda55
metal : mark FA blocks (#16372)
ggerganov 59dee5d
server : fix cancel pending task (#16467)
issixx 30115cf
Disable CUDA host buffers on integrated GPUs (#16308)
ai-fonsi 2ed098f
refactor: centralize CoT parsing in backend for streaming mode (#16394)
ServeurpersoCom 2d3be68
model: EmbeddingGemma Adding Support for SentenceTransformers Dense M…
sfallah d3e2ecc
[SYCL] refactor soft_max, add soft_max_back (#16472)
NeoZhangJianyu fc96244
kleidiai: kernel interface refactoring (#16460)
chaxu01 e7f4508
CANN: Improve ACL graph matching (#16166)
noemotiovon 79a0378
ci: add ARM64 Kleidiai build and test support (#16462)
sudhiarm 34b1ad3
model-conversion : add support for SentenceTransformers (#16387)
danbev 84b7e4f
No markdown in cot (#16483)
ServeurpersoCom 7a35736
server : host-memory prompt caching (#16391)
ggerganov be62614
cpu : optimize the ggml NORM operation (#15953)
duduta 61bb80d
webui: updated the chat service to only include max_tokens in the req…
ServeurpersoCom 5e1b18f
cmake : Dont define XOPENSOURCE on AIX (#16481)
mehendarkarprajwal a70c2f3
server : log requests to /v1/completions (#16495)
rgerganov 055dafc
server : return HTTP 400 if prompt exceeds context length (#16486)
rgerganov c862b0e
vocab : mark EOT token for Granite models (#16499)
ggerganov ebf89db
server : fix division by zero when reporting stats (#16501)
ggerganov c3e195f
convert : correctly handle LLaMA tokenizer for Jamba (#16470)
amirai21 95d06c2
cuda : avoid initializing unused devices (#16510)
slaren 08ddcdd
server / ranking : add sorting and management of top_n (#16403)
YannFollet fc7de03
feat: render user content as markdown option (#16358)
ServeurpersoCom c7a17b8
metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494)
ggerganov fe6f07c
CUDA: faster tile FA, add oob checks, more HSs (#16492)
JohannesGaessler 535afa5
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (#16518)
sirus20x6 42d3bfd
hparams : add check for layer index in is_recurrent (#16511)
danbev 4a25717
ggml : Fix FP16 ELU positive branch (#16519)
sirus20x6 a2d0199
common : update presets (#16504)
ggerganov 1718cfa
common : handle unicode during partial json parsing (#16526)
aldehir b37105d
ci : add Vulkan on Ubuntu with default packages build (#16532)
mbaudier bfb4912
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521)
NeoZhangJianyu ae71fc0
webui: remove client-side context pre-check and rely on backend for l…
ServeurpersoCom 460b03d
metal : add opt_step_adamw and op_sum (#16529)
cern1710 1ebf5b7
CANN: Update several operators to support FP16 data format (#16251)
hipudding 324337d
ggml : fix scalar path for computing norm (#16558)
ggerganov f21f48a
metal: add support for opt_step_sgd (#16539)
cern1710 1702939
fix: add remark plugin to render raw HTML as literal text (#16505)
ServeurpersoCom c9eb359
CANN: fix CPU memory leak in CANN backend (#16549)
noemotiovon 1a0d879
ggml : fix build broken with -march=armv9-a on MacOS (#16520)
DamonFool df83700
CUDA: fix numerical issues in tile FA kernel (#16540)
JohannesGaessler 9db0ead
opencl: fix build targeting CL 2 (#16554)
lhez 96f9e03
graph : support cacheless embeddings with FA and iSWA (#16528)
ggerganov 5643f23
metal : FA support F32 K and V and head size = 32 (#16531)
ggerganov 5a6bf4e
server : dynamic token limit for prompt cache (#16560)
ggerganov 6970d1f
cuda : remove legacy copy-op pointer indirection code (#16485)
anavp-nvidia 9c5ed34
CUDA: add fp kernel for larger batch size MoE (#16512)
am17an bd28c71
CUDA: use fastdiv + ggml_cuda_mad for mmvf (#16557)
am17an 8cd1850
CUDA: enable FA for FP32 KV cache (#16546)
JohannesGaessler 9f9ee43
vulkan: Improve build time for MSVC (#16545)
jeffbolznv ac97d9b
vulkan: Support FA with K/V in F32 (#16543)
jeffbolznv cabf6ff
CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion …
am17an 2130ea8
vulkan: Add ACC_TYPE_VEC2 implementation (#16203)
SavicStefan db630d5
metal : avoid using Metal's gpuAddress property (#16576)
ggerganov 1c709d1
server : fix mtmd checkpoints (#16591)
ggerganov 54bb97e
CUDA: Changing the CUDA scheduling strategy to spin (#16585)
JTischbein 45c40ee
llama-quant: add support for mmproj (#16592)
ngxson 5248535
server : fix img token logs (#16595)
ggerganov 0ed6745
metal: optimise `GGML_OP_SUM` (#16559)
cern1710 f0039fe
Add server-driven parameter defaults and syncing (#16515)
allozaur 2b6a3c3
opencl: fix FA for f32 (#16584)
lhez 7e18f24
opencl: add q8_0 mm support (#16469)
lhez 95427cb
cpu : add FLOOR, CEIL, ROUND and TRUNC unary operators (#16083)
safranowith b99f8c2
gguf-py : add support for endian conversion of BF16 data (#16594)
AlekseiNikiforovIBM 393b7c6
SYCL: Add GGML_OP_MEAN operator support (#16009)
yael-works 9990d81
ggml-cpu: replace putenv with setenv for const-correctness (#16573)
otegami 8818d36
common : Update the docs on -t --threads (#16236)
takasurazeem b2e8671
CANN: format code using .clang-format (#15863)
noemotiovon 726405c
sycl : add ARANGE operator (#16362)
GittyBurstein 29750a9
fix: added a normalization step for MathJax-style \[\] and \(\) delim…
ServeurpersoCom 588824c
mtmd : support home-cooked Mistral Small Omni (#14928)
ngxson a68daf0
SYCL SET operator optimized for F32 tensors (#16350)
GittyBurstein b590ae4
grammar : use int64_t to avoid int overflows in int schema to grammar…
ochafik aa9e70a
metal : add `CONV_TRANSPOSE_2D` (#16542)
iliailmer f9efc08
vulkan: fix debug build (add_rms_len/data not found) (#16624)
jeffbolznv 01de0fa
webui: reorganize settings layout (#16607)
ServeurpersoCom f51e2f7
ggml : fix SpaceMit IME array out-of-bounds in task assignment (#16629)
muggle-stack 557ca9b
vulkan: Add State Space Model (SSM) Operations Support (#16463)
giuseppe 0a52f5a
rpc : report actual free memory (#16616)
rgerganov 6a04bac
llama-model: fix insonsistent ctxs <-> bufs order (#16581)
JohannesGaessler 586ef08
opencl: transposed gemm/gemv moe kernel with mxfp4,f32 (#16602)
shawngu-quic ac040c3
CUDA: use registers instead of smem in topk-moe (#16647)
am17an a66d9cb
vulkan: Implement topk_moe fused shader, ported from CUDA (#16641)
jeffbolznv 718de8b
HIP: fix GPU_TARGETS (#16642)
JohannesGaessler 20fe2b7
CODEOWNERS: update for ggml-cuda/mmf (#16660)
am17an f146809
ci: include s390x release binaries (#16648)
taronaeo 0023d90
ci : avoid manual updates of docs/ops.md (#16663)
CISC 0aa82c5
ci : fix binaries release failure for s390x (binaries may not work ye…
taronaeo 6b38a72
model : add Granite Hybrid types (#16635)
giuseppe e97a19b
llama-context: only warn on pooling_type when user specified (#16674)
otegami 4c75900
SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators (#16…
safranowith 595453c
readme: update bindings (#16651)
deadprogram d82537b
llama-batch: fix build fails with `-Werror=missing-braces` (#16614)
otegami 5290956
Enable per-conversation loading states to allow having parallel conve…
allozaur e4a83f0
Import/Export UX improvements (#16619)
allozaur 1d4e654
Prevent premature submission on IME input (#16673)
allozaur b5d1a17
ggml-alloc : fix leak when reusing a tensor with a larger size (#16679)
slaren e2aad4c
Handle legacy 'context' attachments (#16687)
allozaur 2b33ba9
model : add BailingMoeV2 support (#16063)
CISC c50d04c
sycl : add PAD_REFLECT_D1 operator support (#16145)
ye-NX 738c1e8
vulkan: Handle FA with all -inf mask values (#16447)
jeffbolznv 7ad39a3
opencl: fix warnings and clean up profiling (#16688)
lhez 52d7866
ggml: add ggml_can_fuse_subgraph (#16662)
am17an 5f157a9
CUDA: better error for FA kernel with 0 occupancy (#16643)
JohannesGaessler 8c01a63
CUDA: topk-moe: add optional parameter for gpt-oss (#16649)
am17an 718c892
CUDA: fix bug in topk-moe softmax (#16711)
am17an c7edfc2
tests : fix test-thread-safety when compiling with multiple backends …
Acly f36d31a
ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_v…
sirus20x6 6819ea7
webui: introduce OpenAI-compatible model selector in JSON payload (#1…
ServeurpersoCom 8106777
Revert "ggml : Leverage the existing GGML_F32_VEC helpers to vectoriz…
slaren 0e04b43
Add experimental ggml-hexagon backend for the Hexagon NPU (#16547)
max-krasnyansky 443c17e
sycl: use async memory allocation to fix crashes during graph recordi…
mmichel11 3a95385
server : send partial stop string when <EOG> is reached (#15007)
matteoserva 0f4fedf
ggml-cuda: use passed ops instead of hardcoded ops (#16712)
am17an 1268cf7
Manually link -lbsd to resolve flock symbol on AIX (#16610)
mehendarkarprajwal bcb8ed4
mtmd-cli : allow using --jinja (#16718)
ngxson 3049040
convert : Make mistral-common dependency optional (#16738)
juliendenize 29de86b
Cleanup & remove debugging stuff
pwilkin fbe0e22
Cleanup more debug stuff and flake / editorconfig errors
pwilkin 1aed3d7
Merge remote-tracking branch 'origin/master' into qwen3_next
pwilkin 729ebf8
Unified Delta.net
pwilkin f2c8be1
Merge remote-tracking branch 'origin/master' into qwen3_next
pwilkin 8edcc4d
Update ggml/include/ggml.h [no ci]
pwilkin 4fbd224
Update ggml/include/ggml.h [no ci]
pwilkin 8cfd6d9
Update ggml/include/ggml.h [no ci]
pwilkin 2825263
Update ggml/src/ggml.c [no ci]
pwilkin b9a6294
Update ggml/src/ggml.c [no ci]
pwilkin 5112380
Update ggml/include/ggml.h [no ci]
pwilkin 61667c3
Restore comment
pwilkin 5ecbe6e
Update ggml/src/ggml-cpu/ops.cpp
pwilkin f63e270
Apply graph reduction changes
pwilkin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.