[GPT-OSS] Load MXFP4 and BF16 weights directly and enable online requantization #992
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Adds MXFP4 load support to enable loading original OpenAI checkpoints, e.g. openai/gpt-oss-120b. Includes another swap axes since the original openai checkpoints come with [output, input] dimensions, while previous BF16 checkpoints have [input, output]. Requantization can be performed on the weights after load with an addition-config passed for both the MXFP4 and BF16 models.
The sharding had to be passed to RMSNorm to avoid Qwix-related issues, as well as wrapping with PartitionSpecs. The expert combine op was also factored out of the the MoE forward directly to allow Qwix to ignore it. The alternative is not using an einsums and filtering the quantized modules to be einsum, but currently if the op is the same type and within the same module, it doesn't seem like Qwix can target that specific op.
Tests
I evaluated the accuracy using scripts/vllm/benchmarking/benchmark_serving.py for MMLU across the four variants, MXFP4, BF16, BF16 requantized (64 size subchannel), and MXFP4 requantized. All give similar accuracy on 10% of the MMLU benchmark.