Skip to content

Commit a6ba9d6

Browse files
committed
more
Signed-off-by: bitliu <bitliu@tencent.com>
1 parent 75cd042 commit a6ba9d6

File tree

3 files changed

+331
-0
lines changed

3 files changed

+331
-0
lines changed
Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
---
2+
slug: mom-family
3+
title: "MoM: Specialized Models for Intelligent Routing"
4+
authors: [Xunzhuo]
5+
tags: [mom, models, routing, announcement]
6+
---
7+
8+
![MoM Family](/img/mom-family.png)
9+
10+
**One fabric. Many minds.** We're introducing **MoM** (Mixture of Models)—a family of specialized routing models that power vLLM-SR's intelligent decision-making.
11+
12+
<!-- truncate -->
13+
14+
## Why MoM?
15+
16+
vLLM-SR solves a critical problem: **how to route LLM requests to the right model at the right time**. Not every query needs the same resources—"What's the weather?" shouldn't cost as much as "Analyze this legal contract."
17+
18+
## MoM System Card
19+
20+
A quick overview of all MoM models:
21+
22+
| Category | Model | Size | Base Model | Latency | Purpose |
23+
|----------|-------|------|------------|---------|---------|
24+
| **🧠 Intelligent Routing** | mom-brain-flash | Flash | ModernBERT | &lt;10ms | Ultra-fast intent classification |
25+
| | mom-brain-pro | Pro | Qwen 0.6B | ~30-50ms | Balanced routing with reasoning |
26+
| | mom-brain-max | Max | Qwen 1.7B | ~50-100ms | Maximum accuracy for complex decisions |
27+
| **🔍 Similarity Search** | mom-similarity-flash | Flash | ModernBERT | &lt;10ms | Semantic similarity matching |
28+
| **🔒 Prompt Guardian** | mom-jailbreak-flash | Flash | ModernBERT | &lt;10ms | Jailbreak/attack detection |
29+
| | mom-pii-flash | Flash | ModernBERT | &lt;10ms | PII detection & privacy protection |
30+
| **🎯 SLM Experts** | mom-expert-math-flash | Flash | Qwen 0.6B | ~30-50ms | Mathematics routing |
31+
| | mom-expert-math-pro | Pro | Qwen 1.7B | ~50-100ms | Advanced math with reasoning |
32+
33+
**Key Insights:**
34+
35+
- **4 Categories** × **3 Size Variants** = Flexible routing architecture
36+
- **ModernBERT** (encoder-only) → Sub-10ms latency for high-throughput scenarios
37+
- **Qwen** (decoder-only) → Explainable decisions with reasoning capabilities
38+
- **Flash** models achieve 10,000+ QPS on commodity hardware
39+
40+
## The Evolution: From Encoder-Only to Mixture-of-Models
41+
42+
### Where We Started: ModernBERT Foundation
43+
44+
vLLM-SR initially built its routing intelligence entirely on **ModernBERT** (encoder-only models):
45+
46+
**Advantages**:
47+
48+
-**Blazing fast**: Sub-10ms inference latency
49+
- 📊 **High throughput**: 10,000+ QPS on commodity hardware
50+
- 💰 **Cost-effective**: Minimal compute requirements
51+
- 🎯 **Proven accuracy**: Strong performance on classification tasks
52+
53+
**Limitations**:
54+
55+
-**Black box decisions**: No explanation for routing choices
56+
-**Limited reasoning**: Cannot handle complex, multi-step logic
57+
-**Fixed capabilities**: Hard to extend with new behaviors
58+
-**No tool integration**: Cannot leverage external tools or APIs
59+
60+
### Why We're Evolving: Decoder-Only Models
61+
62+
As vLLM-SR adoption grew, we encountered more diverse scenarios and requirements:
63+
64+
- **Explainability**: Users need to understand *why* a query was routed to a specific model
65+
- **Complex reasoning**: Some routing decisions require multi-step analysis
66+
- **Agentic workflows**: Integration with tool calling, function execution, and external APIs
67+
- **Advanced techniques**: Reinforcement learning (RL), sophisticated post-training methods
68+
- **Domain expertise**: Specialized routing for legal, medical, scientific domains
69+
70+
**The Solution**: Expand to decoder-only models while keeping encoder speed where it matters.
71+
72+
### The MoM Architecture: Best of Both Worlds
73+
74+
Our **Mixture-of-Models** approach combines encoder and decoder strengths:
75+
76+
-**Encoders** — Fast classification (sub-10ms latency) for high-throughput scenarios
77+
- 🧠 **Decoders** — Explainable decisions with reasoning for transparency
78+
- 🎯 **Domain Agents** — Expert routing with specialized knowledge
79+
80+
This hybrid architecture lets you choose the right tool for each job: speed when you need it, reasoning when it matters.
81+
82+
## The MoM Model Family
83+
84+
We organize MoM models into **four categories** with **three size variants** (Flash, Pro, Max):
85+
86+
### 🧠 Intelligent Routing
87+
88+
Smart routing models with three size variants:
89+
90+
| Model | Size | Base Model | Purpose |
91+
|-------|------|------------|---------|
92+
| **mom-brain-flash** | Flash | ModernBERT | Ultra-fast intent classification (sub-10ms latency) |
93+
| **mom-brain-pro** | Pro | Qwen 0.6B | Balanced performance with reasoning capabilities |
94+
| **mom-brain-max** | Max | Qwen 1.7B | Maximum accuracy for complex routing decisions |
95+
96+
**Architecture**: Flash is based on ModernBERT (encoder-only), while Pro and Max are based on Qwen 0.6B and 1.7B (decoder-only) models.
97+
98+
### 🔍 Similarity Search
99+
100+
Semantic similarity and vector search:
101+
102+
| Model | Size | Base Model | Purpose |
103+
|-------|------|------------|---------|
104+
| **mom-similarity-flash** | Flash | ModernBERT | Fast semantic similarity matching for route selection |
105+
106+
**Architecture**: Based on ModernBERT (encoder-only) for high-speed embedding generation.
107+
108+
### 🔒 Prompt Guardian
109+
110+
Security and safety checks before routing:
111+
112+
| Model | Size | Base Model | Purpose |
113+
|-------|------|------------|---------|
114+
| **mom-jailbreak-flash** | Flash | ModernBERT | Jailbreak/attack detection (security) |
115+
| **mom-pii-flash** | Flash | ModernBERT | PII detection (privacy protection) |
116+
117+
**Architecture**: Both based on ModernBERT (encoder-only) for ultra-fast security checks.
118+
119+
### 🎯 SLM Experts
120+
121+
Specialized small language models for domain-specific routing:
122+
123+
| Model | Size | Base Model | Domain |
124+
|-------|------|------------|--------|
125+
| **mom-expert-math-flash** | Flash | Qwen 0.6B | Mathematics (algebra, calculus, statistics) |
126+
| **mom-expert-math-pro** | Pro | Qwen 1.7B | Advanced mathematics with reasoning |
127+
128+
**Architecture**: Based on Qwen models (decoder-only) for domain-specific reasoning capabilities.
129+
130+
## Design Principles
131+
132+
**Safety-First**: Prompt Guardian models (PII, jailbreak detection) run before routing—security at the edge.
133+
134+
**Speed ↔ Capability**: Choose Flash for sub-10ms latency, Pro for balanced performance, or Max for maximum accuracy. Different sizes, different SLAs.
135+
136+
**Domain Expertise**: SLM Expert models achieve 15-25% better accuracy on domain-specific tasks vs. generalist routing. Math queries go to math experts.
137+
138+
## How vLLM-SR Uses MoM
139+
140+
vLLM-SR's routing pipeline leverages MoM models at multiple stages:
141+
142+
1. **Security Check**`mom-jailbreak-flash` and `mom-pii-flash` filter malicious/sensitive requests
143+
2. **Intent Classification**`mom-brain-*` models (flash/pro/max) determine query type and routing decisions
144+
3. **Similarity Search**`mom-similarity-flash` finds semantically similar routes
145+
4. **Domain Routing**`mom-expert-*` models route specialized queries to optimal downstream models
146+
5. **Cost Optimization** → Simple queries → lightweight models; complex queries → premium models
147+
148+
This achieves **2x+ cost reduction** while maintaining quality, similar to [RouteLLM](https://arxiv.org/abs/2406.18665).
149+
150+
## What's Next: Exploring Frontier Techniques
151+
152+
The move to decoder-only models opens exciting possibilities for vLLM-SR:
153+
154+
### 🤖 Agentic Routing
155+
156+
Decoder models can act as intelligent agents that:
157+
158+
- Dynamically select and orchestrate multiple models
159+
- Make multi-step routing decisions with tool calling
160+
- Adapt routing strategies based on feedback
161+
162+
### 🎯 Reinforcement Learning (RL)
163+
164+
Apply RL techniques to optimize routing decisions:
165+
166+
- Learn from user feedback and model performance
167+
- Discover optimal routing policies through trial and error
168+
- Continuously improve cost-quality trade-offs
169+
170+
### 🔧 Advanced Post-Training
171+
172+
Leverage cutting-edge post-training methods:
173+
174+
- **Distillation**: Transfer knowledge from large models to efficient routers
175+
- **Preference learning**: Train on human feedback (RLHF, DPO)
176+
- **Domain adaptation**: Fine-tune for specific industries or use cases
177+
178+
### 🛠️ Tool Integration
179+
180+
Enable routers to:
181+
182+
- Call external APIs for context-aware routing
183+
- Query databases for historical routing patterns
184+
- Integrate with monitoring systems for real-time optimization
185+
186+
**The vision**: vLLM-SR routers that not only classify but *reason*, *learn*, and *adapt*.
187+
188+
## Model Naming Convention
189+
190+
```text
191+
mom-{category}-{size}
192+
mom-expert-{domain}-{size}
193+
```
194+
195+
### Four Categories
196+
197+
1. **Intelligent Routing**: `mom-brain-{flash|pro|max}`
198+
2. **Similarity Search**: `mom-similarity-{flash}`
199+
3. **Prompt Guardian**: `mom-{jailbreak|pii}-{flash}`
200+
4. **SLM Experts**: `mom-expert-{domain}-{flash|pro}`
201+
202+
### Three Size Variants
203+
204+
- **flash**: ModernBERT-based (for brain/similarity/guardian) or Qwen 0.6B (for experts) — fastest, sub-10ms latency
205+
- **pro**: Qwen 0.6B (for brain) or Qwen 1.7B (for experts) — balanced performance with reasoning
206+
- **max**: Qwen 1.7B (for brain) — maximum accuracy and capabilities
207+
208+
### Architecture Summary
209+
210+
- **Intelligent Routing**: Flash (ModernBERT) + Pro/Max (Qwen 0.6B/1.7B)
211+
- **Similarity Search**: Flash (ModernBERT)
212+
- **Prompt Guardian**: Flash (ModernBERT)
213+
- **SLM Experts**: Flash/Pro (Qwen 0.6B/1.7B)
214+
215+
## Get Started
216+
217+
All MoM models are available on [Hugging Face](https://huggingface.co/LLM-Semantic-Router).
218+
219+
**Resources**:
220+
221+
- [GitHub](https://github.com/vllm-project/semantic-router)
222+
- [Documentation](https://vllm-semantic-router.com)
223+
- [Quick Start Guide](https://vllm-semantic-router.com/docs/installation)
224+
225+
---
226+
227+
**vLLM-SR · Route with intent. Think with reason.**

website/src/css/custom.css

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -779,3 +779,107 @@ td, th {
779779
width: 100% !important;
780780
}
781781
}
782+
783+
/* ============================================
784+
Blog Page Optimizations - Wider Content
785+
============================================ */
786+
787+
/* Only apply blog optimizations to blog pages - use more specific selectors */
788+
789+
/* Hide blog sidebar (left side posts list) - only on blog pages */
790+
[class*='blog-wrapper'] aside[class*='blogSidebar'],
791+
[class*='blog-wrapper'] aside.col--3,
792+
[class*='blog-wrapper'] .theme-blog-sidebar {
793+
display: none !important;
794+
}
795+
796+
/* Hide table of contents (right side) - only on blog pages, not docs */
797+
[class*='blog-wrapper'] .theme-doc-toc-desktop,
798+
[class*='blog-wrapper'] .table-of-contents,
799+
[class*='blog-wrapper'] div[class*='tableOfContents'],
800+
[class*='blog-wrapper'] div[class*='tocCollapsible'] {
801+
display: none !important;
802+
}
803+
804+
/* Expand blog content row to full width - only on blog pages */
805+
[class*='blog-wrapper'] div[class*='blogContainer'] .row,
806+
[class*='blog-wrapper'] .row {
807+
justify-content: center !important;
808+
margin: 0 auto !important;
809+
}
810+
811+
/* Expand blog content column to use full width - only on blog pages */
812+
[class*='blog-wrapper'] .col--7,
813+
[class*='blog-wrapper'] div[class*='blogPostContent'] {
814+
max-width: 100% !important;
815+
flex: 0 0 100% !important;
816+
margin: 0 auto !important;
817+
}
818+
819+
/* Center blog content container - wider layout - only on blog pages */
820+
[class*='blog-wrapper'] .container,
821+
[class*='blog-wrapper'] div[class*='blogContainer'] {
822+
max-width: 1600px !important;
823+
margin: 0 auto !important;
824+
padding: 0 3rem !important;
825+
}
826+
827+
/* Blog post content - centered and wide - only on blog pages */
828+
[class*='blog-wrapper'] article,
829+
[class*='blog-wrapper'] article[class*='blogPostItem'],
830+
[class*='blog-wrapper'] div[class*='blogPostContent'] article {
831+
min-width: 60vw !important;
832+
max-width: 1400px !important;
833+
margin: 2rem auto !important;
834+
padding: 3rem 4rem !important;
835+
display: block !important;
836+
}
837+
838+
/* Blog post header - only on blog pages */
839+
[class*='blog-wrapper'] header[class*='blogPostHeader'] {
840+
max-width: 1400px !important;
841+
margin: 0 auto !important;
842+
}
843+
844+
/* Blog list page optimization - only on blog pages */
845+
[class*='blog-wrapper'] .margin-vert--lg,
846+
[class*='blog-wrapper'] div[class*='blogListPage'] {
847+
max-width: 1400px !important;
848+
margin: 2rem auto !important;
849+
width: 100% !important;
850+
}
851+
852+
/* Ensure blog post items are centered - only on blog pages */
853+
[class*='blog-wrapper'] .blogPostItem,
854+
[class*='blog-wrapper'] div[class*='blogPostItem'] {
855+
max-width: 1400px !important;
856+
margin: 0 auto 2rem auto !important;
857+
}
858+
859+
/* Center blog post content wrapper - only on blog pages */
860+
[class*='blog-wrapper'] div[class*='blogPostPageContent'] {
861+
display: flex !important;
862+
justify-content: center !important;
863+
width: 100% !important;
864+
}
865+
866+
/* Responsive adjustments for blog - only on blog pages */
867+
@media (max-width: 996px) {
868+
[class*='blog-wrapper'] article,
869+
[class*='blog-wrapper'] article[class*='blogPostItem'] {
870+
min-width: auto !important;
871+
padding: 2rem 1.5rem !important;
872+
}
873+
874+
[class*='blog-wrapper'] .container,
875+
[class*='blog-wrapper'] div[class*='blogContainer'] {
876+
padding: 0 1rem !important;
877+
}
878+
}
879+
880+
@media (max-width: 768px) {
881+
[class*='blog-wrapper'] article,
882+
[class*='blog-wrapper'] article[class*='blogPostItem'] {
883+
padding: 1.5rem 1rem !important;
884+
}
885+
}

website/static/img/mom-family.png

2.04 MB
Loading

0 commit comments

Comments
 (0)