Skip to content

Conversation

@amishacorns
Copy link
Contributor

@amishacorns amishacorns commented Oct 31, 2025

Description

Adds MXFP4 load support to enable loading original OpenAI checkpoints, e.g. openai/gpt-oss-120b. Includes another swap axes since the original openai checkpoints come with [output, input] dimensions, while previous BF16 checkpoints have [input, output]. Requantization can be performed on the weights after load with an addition-config passed for both the MXFP4 and BF16 models.

The sharding had to be passed to RMSNorm to avoid Qwix-related issues, as well as wrapping with PartitionSpecs. The expert combine op was also factored out of the the MoE forward directly to allow Qwix to ignore it. The alternative is not using an einsums and filtering the quantized modules to be einsum, but currently if the op is the same type and within the same module, it doesn't seem like Qwix can target that specific op.

Tests

I evaluated the accuracy using scripts/vllm/benchmarking/benchmark_serving.py for MMLU across the four variants, MXFP4, BF16, BF16 requantized (64 size subchannel), and MXFP4 requantized. All give similar accuracy on 10% of the MMLU benchmark.

@kyuyeunk
Copy link
Collaborator

kyuyeunk commented Nov 1, 2025

@bzgoogle can you take a look?

Copy link
Collaborator

@bzgoogle bzgoogle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jordan. It LGTM

@amishacorns amishacorns force-pushed the gpt-oss-mxfp4-load branch 3 times, most recently from 81954b4 to 2f2730f Compare November 11, 2025 23:51
@amishacorns amishacorns changed the title [GPT-OSS] Load MXFP4 weights directly and dequantize online [GPT-OSS] Load MXFP4 and BF16 weights directly and quantize online Nov 11, 2025
Signed-off-by: Jordan Dotzel <amishacorns@users.noreply.github.com>
@amishacorns amishacorns force-pushed the gpt-oss-mxfp4-load branch 2 times, most recently from 5b44f2f to fb33148 Compare November 12, 2025 03:32
Signed-off-by: Jordan Dotzel <amishacorns@users.noreply.github.com>
Signed-off-by: Jordan Dotzel <amishacorns@users.noreply.github.com>
@amishacorns amishacorns changed the title [GPT-OSS] Load MXFP4 and BF16 weights directly and quantize online [GPT-OSS] Load MXFP4 and BF16 weights directly and enable online requantization Nov 12, 2025
@jrplatin jrplatin merged commit 446464d into vllm-project:main Nov 13, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants