Skip to content

Commit 4eb3d28

Browse files
committed
more
Signed-off-by: bitliu <bitliu@tencent.com>
1 parent a6ba9d6 commit 4eb3d28

File tree

1 file changed

+26
-4
lines changed

1 file changed

+26
-4
lines changed

website/blog/2025-10-16-mom-family.md

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -71,14 +71,21 @@ As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements
7171

7272
### The MoM Architecture: Best of Both Worlds
7373

74-
Our **Mixture-of-Models** approach combines encoder and decoder strengths:
74+
**Mixture-of-Models (MoM)** is both a philosophy and an architecture:
75+
76+
1. **Backend LLM Architecture** — Route requests to the optimal downstream model (GPT-4, Claude, Llama, etc.)
77+
2. **Router Internal Design** — The router itself uses multiple specialized models working together
78+
79+
Our MoM approach combines encoder and decoder strengths:
7580

7681
-**Encoders** — Fast classification (sub-10ms latency) for high-throughput scenarios
7782
- 🧠 **Decoders** — Explainable decisions with reasoning for transparency
7883
- 🎯 **Domain Agents** — Expert routing with specialized knowledge
7984

8085
This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.
8186

87+
**Key Insight**: Just as vLLM-SR routes to different backend LLMs, the router itself is powered by a mixture of specialized models—each optimized for specific routing tasks (security, similarity, intent classification, domain expertise).
88+
8289
## The MoM Model Family
8390

8491
We organize MoM models into **four categories** with **three size variants** (Flash, Pro, Max):
@@ -137,15 +144,30 @@ Specialized small language models for domain-specific routing:
137144

138145
## How vLLM-SR Uses MoM
139146

140-
vLLM-SR's routing pipeline leverages MoM models at multiple stages:
147+
MoM operates at **two levels** in vLLM-SR:
148+
149+
### Level 1: Router Internal Architecture (MoM Inside)
150+
151+
The router itself is a mixture of specialized models working together in a pipeline:
141152

142153
1. **Security Check**`mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests
143154
2. **Intent Classification**`mom-brain-*` models (flash/pro/max) determine query type and routing decisions
144155
3. **Similarity Search**`mom-similarity-flash` finds semantically similar routes
145156
4. **Domain Routing**`mom-expert-*` models route specialized queries to optimal downstream models
146-
5. **Cost Optimization** → Simple queries → lightweight models; complex queries → premium models
147157

148-
This achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
158+
Each stage uses the **right model for the right task**: fast encoders for security checks, reasoning decoders for complex decisions, domain experts for specialized queries.
159+
160+
### Level 2: Backend LLM Orchestration (MoM Outside)
161+
162+
The router then directs requests to the optimal backend LLM:
163+
164+
- **Simple queries** → Lightweight models (Llama 3.2, Qwen 2.5)
165+
- **Complex queries** → Premium models (GPT-4, Claude 3.5)
166+
- **Domain-specific** → Specialized models (Code Llama, Mistral Math)
167+
168+
This dual-level MoM architecture achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
169+
170+
**The Philosophy**: Mixture-of-Models all the way down—from the router's internal decision-making to the backend LLM selection.
149171

150172
## What's Next: Exploring Frontier Techniques
151173

0 commit comments

Comments
 (0)